Skip to content

Instantly share code, notes, and snippets.

@aarroyoc
Created January 19, 2020 22:24
Show Gist options
  • Select an option

  • Save aarroyoc/62f9145136165ff6398c665248e98f74 to your computer and use it in GitHub Desktop.

Select an option

Save aarroyoc/62f9145136165ff6398c665248e98f74 to your computer and use it in GitHub Desktop.
PySpark example
# ~/spark-3.0xxxx/bin/spark-submit sparky.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
# Word Count
inputfile = sc.textFile("input.txt")
counts = inputfile.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y)
counts.saveAsTextFile("output")
accum = sc.accumulator(0)
def ac(x):
global accum
accum += x
sc.parallelize([1,2,3,4]).foreach(ac)
print(accum.value)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment