Skip to content

Instantly share code, notes, and snippets.

@jaredwinick
Last active April 21, 2020 10:44
Show Gist options
  • Save jaredwinick/cabf91b2d8931d1db66fe9892aeeeb43 to your computer and use it in GitHub Desktop.
Save jaredwinick/cabf91b2d8931d1db66fe9892aeeeb43 to your computer and use it in GitHub Desktop.
Apache Zeppelin Object Exchange for passing a DataFrame of feature vectors from the Scala Spark interpreter to PySpark to get a numpy array

Apache Zeppelin has a helpful feature in its Spark Interpreter called Object Exchange. This allows you to pass objects, including DataFrames, between Scala and Python paragraphs of the same notebook. You can do your data prep/feature engineering with the Scala Spark Interpreter, and then pass off a DataFrame containing the features to PySpark for use with libraries like NumPy and scikit-learn. Also with Zeppelin's support for matplotlib you have a pretty good setup for poking around and testing out machine learning on your data.

import org.apache.spark.mllib.linalg.Vectors
case class TestClass(features: org.apache.spark.mllib.linalg.Vector)
val df = sqlContext.createDataFrame(
List(
TestClass(Vectors.sparse(4, Seq((0, 1.0), (2, 1.0)))),
TestClass(Vectors.sparse(4, Seq((1, 1.0), (2, 1.0))))))
z.put("df", df)
%pyspark
import numpy as np
from pyspark.sql import DataFrame
df = DataFrame(z.get("df"), sqlContext)
data = df.rdd.map(lambda row: row["features"].toArray()).collect()
npdata = np.array(data)
print np.shape(npdata)
print npdata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment