PySpark Commonly Used Commands

Setup a Spark Context

spark = SparkSession.builder.appName('Magic').getOrCreate()

File IO

df = spark.read.format("csv").option("multiline", True).option("delimiter", ",").load(file_path)

df.coalase(1).write.mode("overwrite").format(file_type).save(file_path2)

Spark Authentication to a Data Store (if had to read some BlobStore/Datalake/..)

spark.sparkContext._jsc.hadoopConfiguration().set(accound_name, account_key)

Cool STUFF

df.rdd.zipWithIndex() # converts ("Sam", "MacBook") ==> (1, "Sam", "Physics")

df.rdd.zipWithIndex().map(lambda row: row[0], row[1:]) # (1, "Sam", "Physics") ==> (1, ("Sam", "Physics"))) # Key-Value pairs

rdd -> toDF -> df

new_df = df.rdd.zipWithIndex().map(lambda row: row[0], row[1]).toDF("ID", "Name")

to see data structure

new_df.printSchma()

to count rows

print(new_df.count())

to filter

new_df = new_df.filter(new_df.Name ==  "Sam")

to find max & min

df.agg({'ID': 'min'}).collect()[0]['min(ID)'] # passing a dict in `agg`
df.agg({'ID': 'max'}).collect()[0]['max(ID)'] # passing a dict in `agg`

group by

df.groupBy("Name").mapValues(list)

Save in memory

df.persist()
# do stuff,..
df.unpersist()

Tracking/Debugging stuff

spark.sparkContext.setGroupJob("title", 'description here')

df.head(5)

df.rdd.take(5)

msampathkumar/PySparkCommandsNotes.md

Select an option

No results found

Select an option

No results found

Setup a Spark Context

File IO

Spark Authentication to a Data Store (if had to read some BlobStore/Datalake/..)

Cool STUFF

rdd -> toDF -> df

to see data structure

to count rows

to filter

to find max & min

group by

Save in memory

Tracking/Debugging stuff