Skip to content

Instantly share code, notes, and snippets.

@koverholt
Last active August 29, 2015 14:17
Show Gist options
  • Select an option

  • Save koverholt/dd8f1602b156ec8f2e16 to your computer and use it in GitHub Desktop.

Select an option

Save koverholt/dd8f1602b156ec8f2e16 to your computer and use it in GitHub Desktop.
Example Wordcount in Spark
from pyspark import SparkContext
from pyspark import SparkConf
if __name__ == "__main__":
conf = SparkConf()
conf.setMaster("spark://{hostname}:7077")
conf.setAppName("WordCount")
sc = SparkContext(conf=conf)
file = sc.textFile("/mnt/gluster/pg2591.txt")
counts = file.map(lambda word: word.replace('"', '').replace("'", '').replace('.', ' ').replace(',', ' ').replace(':', '').replace(';', ' ').lower()) \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
.map(lambda x: (x[1], x[0])) \
.sortByKey(False)
counts.saveAsTextFile("/mnt/gluster/wordcount")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment