Skip to content

Instantly share code, notes, and snippets.

@khan5v
Created August 29, 2015 16:49
Show Gist options
  • Save khan5v/1cd4aac392779f86d23f to your computer and use it in GitHub Desktop.
Save khan5v/1cd4aac392779f86d23f to your computer and use it in GitHub Desktop.
Shows how to filter entries out based on some logic in Spark
#reading data from a file
logData = sc.textFile(logFile)
X = 10
#for each key finding entries that occur more than X times
outliers = logData.map(lambda (k, v): (k, 1)).reduceByKey(lambda a, b: a + b).filter(lambda (k, v): v > X).cache()
#filtering these entries out
reducedLogData = logData.subtractByKey(outliers).cache()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment