Skip to content

Instantly share code, notes, and snippets.

@elliottcordo
Last active August 29, 2015 14:13
Show Gist options
  • Save elliottcordo/88baf8233b4165a939a4 to your computer and use it in GitHub Desktop.
Save elliottcordo/88baf8233b4165a939a4 to your computer and use it in GitHub Desktop.
Some initial stuff from SparkSQL meetup

###create an emr cluster on latest spark version --> should be fine if you don't want to use hive, or parquet

aws emr create-cluster --name SparkCluster --ami-version 3.2.1 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=<your key homey> --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=\["-v1.2.0.a"\] 

###k.. now u have a cluster - do some slicing and dicing thorugh pyspark client mode is fine but be sure to start in a screen, you will also need to play with parameters

./spark/bin/pyspark --master yarn --deploy-mode client --num-executors 12 --executor-memory 2g --executor-cores 4

###here is a little snippet of json digestion

from pyspark.sql import  HiveContext

hiveContext = HiveContext(sc)

reviews = hiveContext.jsonFile("s3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_review.json")

reviews.printSchema()

reviews.registerTempTable("reviews")

hiveContext.sql("select count(1) from reviews").collect()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment