Skip to content

Instantly share code, notes, and snippets.

@zezutom
Created December 13, 2015 21:52
Show Gist options
  • Save zezutom/edbdc5b1dd75e4b990a5 to your computer and use it in GitHub Desktop.
Save zezutom/edbdc5b1dd75e4b990a5 to your computer and use it in GitHub Desktop.
Spark's RDD transformations to arrive at a word count solution
rdd
.flatMap(_.split("\\s")) // Split on any white character
.map(_.replaceAll(
"[,.!?:;]", "") // Remove punctuation and transfer to lowercase
.trim
.toLowerCase)
.filter(!_.isEmpty) // Filter out any non-words
.map(word => (word, 1)) // Finally, count words
.reduceByKey(_ + _)
.sortByKey() // and sort the word counts in a lexical order
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment