Apache Zeppelin has a helpful feature in its Spark Interpreter called Object Exchange. This allows you to pass objects, including DataFrames, between Scala and Python paragraphs of the same notebook. You can do your data prep/feature engineering with the Scala Spark Interpreter, and then pass off a DataFrame containing the features to PySpark for use with libraries like NumPy and scikit-learn. Also with Zeppelin's support for matplotlib you have a pretty good setup for poking around and testing out machine learning on your data.
Last active
April 21, 2020 10:44
-
-
Save jaredwinick/cabf91b2d8931d1db66fe9892aeeeb43 to your computer and use it in GitHub Desktop.
Apache Zeppelin Object Exchange for passing a DataFrame of feature vectors from the Scala Spark interpreter to PySpark to get a numpy array
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import org.apache.spark.mllib.linalg.Vectors | |
case class TestClass(features: org.apache.spark.mllib.linalg.Vector) | |
val df = sqlContext.createDataFrame( | |
List( | |
TestClass(Vectors.sparse(4, Seq((0, 1.0), (2, 1.0)))), | |
TestClass(Vectors.sparse(4, Seq((1, 1.0), (2, 1.0)))))) | |
z.put("df", df) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
%pyspark | |
import numpy as np | |
from pyspark.sql import DataFrame | |
df = DataFrame(z.get("df"), sqlContext) | |
data = df.rdd.map(lambda row: row["features"].toArray()).collect() | |
npdata = np.array(data) | |
print np.shape(npdata) | |
print npdata |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment