Skip to content

Instantly share code, notes, and snippets.

@hughdbrown
Created October 9, 2015 18:28
Show Gist options
  • Save hughdbrown/907da64f040d15b3c02a to your computer and use it in GitHub Desktop.
Save hughdbrown/907da64f040d15b3c02a to your computer and use it in GitHub Desktop.
Overview of spark features

Initialize spark in python

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf=conf)

Load data

Text

filename = 'a.txt'
lines = sc.textFile(filename)

From code

lines = sc.parallelize(["a", "b"])

JSON

filename = "data.json"
people = sc.read.json(filename)
people.registerTempTable("people")

How to load directory of JSON files into Apache Spark in Python

my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map(json.loads)

JSON support in SparkSQL

from ... import SQLContext

sql = SQLContext(sc)
filename = "data.json"
people = sql.jsonFile(filename)
people.registerTempTable("people")

Actions on single RDD

Reduce

sum = series.reduce(lambda x, y: x + y)

Collect

rdd.collect()

Count

num_elems = rdd.count()

Count by value

tuples = rdd.countByValue()

Fold

a = b.fold(0)

Aggregate

For each

# Apply `func` to `rdd`
rdd.forEach(func)

Take

# Take first 10 values
values = rdd.take(10)

Top

# Take top 10 values
values = rdd.top(10)

Sample with/without replacement

# Take 10 samples with replacement
values = rdd.takeSample(True, 10)

Transformations

Filter

input = sc.textFile('a.txt')
errors = input.filter(lambda x: 'errors' in x)

Union

badlines = errors.union(warnings)

Intersection

both_attr = tall.intersection(slim)

Subtract

only_tall = tall.subtract(slim)

Distinct

unique = companies.distinct()

Map

squares = data.map(lambda x: x * x)
## Flatmap

words = lines.flatMap(lambda line: line.split(" "))


# Numeric operations
## count
## mean
## sum
## max
## min
## variance / sampleVariance
## stdev / sampleStdev

#
## Accumulator

... = sc.accumulator(0)


## Broadcast

... = sc.broadcast(...)





## Actions on single pair RDDs
countByKey
collectAsMap
lookup

## Transformations on two pair RDDs
subtractByKey
join
rightOuterJoin
leftOuterJoin
cogroup

## Transformations on single pair RDD
reduceByKye
groupByKey
combineByKey
mapValues
flatmapvalues
keys
values
sortByKeys

## Dataframe operations
show
select
filter
groupBy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment