Skip to content

Instantly share code, notes, and snippets.

@dapangmao
Created March 18, 2015 15:00
Show Gist options
  • Save dapangmao/84618a65ac5f921db76a to your computer and use it in GitHub Desktop.
Save dapangmao/84618a65ac5f921db76a to your computer and use it in GitHub Desktop.
Two ways to transform RDD to DataFrame in Spark
1. Add schema after becoming DataFrame
sqlCtx.inferSchema(rdd1)
1. Add schema after becoming DataFrame
from pyspark.sql import Row
import os
current_path = os.getcwd()
rdd = sc.textFile("current_path" + '//class.txt')
def transform(x):
y = x.split()
return str(y[0]), str(y[1]), int(y[2]), float(y[3]), float(y[4])
varnames = Row("name", "sex", "age", "height", "weight")
df = rdd.map(transform).map(lambda x: varnames(*x)).toDF()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment