###create an emr cluster on latest spark version --> should be fine if you don't want to use hive, or parquet
aws emr create-cluster --name SparkCluster --ami-version 3.2.1 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=<your key homey> --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=\["-v1.2.0.a"\]
###k.. now u have a cluster - do some slicing and dicing thorugh pyspark client mode is fine but be sure to start in a screen, you will also need to play with parameters
./spark/bin/pyspark --master yarn --deploy-mode client --num-executors 12 --executor-memory 2g --executor-cores 4
###here is a little snippet of json digestion
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
reviews = hiveContext.jsonFile("s3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_review.json")
reviews.printSchema()
reviews.registerTempTable("reviews")
hiveContext.sql("select count(1) from reviews").collect()