Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save dennyglee/7a04e87f3e60a446b51904055bd42ff0 to your computer and use it in GitHub Desktop.
Save dennyglee/7a04e87f3e60a446b51904055bd42ff0 to your computer and use it in GitHub Desktop.
Accessing DataFrame with [('features', 'vector'), ('label', 'double')] schema
from pyspark.mllib.linalg import Vectors
# Sample dataset
data = sc.parallelize([
(0.0, [0.0, 1.0, 2.0]),
(1.0, [1.0, 2.0, 3.0]),
(3.0, [2.0, 3.0, 4.0]),
(2.0, [3.0, 4.0, 5.0])
])
# Load each word and create row object
parts = data.map(lambda t: Row(label=t[0], features=Vectors.dense(t[1])))
# Infer schema (using reflection)
df = parts.toDF()
# Run selectExpr
df.selectExpr("max(label) as max_value","min(label) as min_value").show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment