Skip to content

Instantly share code, notes, and snippets.

@msampathkumar
Last active June 30, 2020 07:33
Show Gist options
  • Select an option

  • Save msampathkumar/2829ff126eac1f66e5bcf073a41eb87f to your computer and use it in GitHub Desktop.

Select an option

Save msampathkumar/2829ff126eac1f66e5bcf073a41eb87f to your computer and use it in GitHub Desktop.
PySpark Commonly Used Commands

Setup a Spark Context

spark = SparkSession.builder.appName('Magic').getOrCreate()

File IO

df = spark.read.format("csv").option("multiline", True).option("delimiter", ",").load(file_path)

df.coalase(1).write.mode("overwrite").format(file_type).save(file_path2)

Spark Authentication to a Data Store (if had to read some BlobStore/Datalake/..)

spark.sparkContext._jsc.hadoopConfiguration().set(accound_name, account_key)

Cool STUFF

df.rdd.zipWithIndex() # converts ("Sam", "MacBook") ==> (1, "Sam", "Physics")

df.rdd.zipWithIndex().map(lambda row: row[0], row[1:]) # (1, "Sam", "Physics") ==> (1, ("Sam", "Physics"))) # Key-Value pairs

rdd -> toDF -> df

new_df = df.rdd.zipWithIndex().map(lambda row: row[0], row[1]).toDF("ID", "Name")

to see data structure

new_df.printSchma()

to count rows

print(new_df.count())

to filter

new_df = new_df.filter(new_df.Name ==  "Sam")

to find max & min

df.agg({'ID': 'min'}).collect()[0]['min(ID)'] # passing a dict in `agg`
df.agg({'ID': 'max'}).collect()[0]['max(ID)'] # passing a dict in `agg`

group by

df.groupBy("Name").mapValues(list)

Save in memory

df.persist()
# do stuff,..
df.unpersist()

Tracking/Debugging stuff

spark.sparkContext.setGroupJob("title", 'description here')

df.head(5)

df.rdd.take(5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment