Skip to content

Instantly share code, notes, and snippets.

@elliottcordo
elliottcordo / yelp_pyspark_example.py
Last active August 29, 2015 14:08
yelp pyspark example
#MASTER=yarn-client /home/hadoop/spark/bin/pyspark
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
#------------------------------------------------
#load some users
lines=sc.textFile("s3://caserta-bucket1/yelp/in/users/users.txt")
parts = lines.map(lambda l: l.split(","))
@elliottcordo
elliottcordo / yelp_pig_join.pig
Created October 28, 2014 04:00
yelp_pig_join
REGISTER 's3://caserta-bucket1/libs/elephant-bird-pig.jar'
REGISTER 's3://caserta-bucket1/libs/elephant-bird-core.jar'
REGISTER 's3://caserta-bucket1/libs/elephant-bird-hadoop-compat.jar'
REGISTER 's3://caserta-bucket1/libs/json-simple.jar'
business = LOAD 's3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_business.json'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
business_cleaned = FOREACH business
@elliottcordo
elliottcordo / redshift_ntile.sql
Last active August 29, 2015 14:09
redshift ntile query
drop table zzt;
create temporary table zzt as
with n_tile as
(
select case when cnt>5 then 5 else cnt end as cnt
from
( select count(1)/50 as cnt
from temp.godaddy_viewing_summary_daily_visit) a
@elliottcordo
elliottcordo / emr-spark.sh
Created November 17, 2014 19:15
emr spark cluster
aws emr create-cluster --name SparkCluster --ami-version 3.2 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=caserta-1 --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark
@elliottcordo
elliottcordo / emr-spark-pyspark-fix.sh
Created November 24, 2014 20:30
emr spark pyspark fix
unzip -d tmp1 spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar
cd tmp1
#run the line below assuming openjdk is not installed on your EMR cluster (it's probably not)
sudo yum install -y java-1.6.0-openjdk-devel.x86_64
/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.33.x86_64/bin/jar cvmf META-INF/MANIFEST.MF ../spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar .
@elliottcordo
elliottcordo / spark_emr
Last active August 29, 2015 14:11
spark cluster emr command
aws emr create-cluster --name SparkCluster --ami-version 3.2.1 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=caserta-1 --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=\["-v1.1.0.d"\]
@elliottcordo
elliottcordo / emr_spark_thrift_on_yarn
Created December 15, 2014 22:21
EMR spark thrift server
#on cluster
thrift /spark/sbin/start-thriftserver.sh --master yarn-client
#ssh tunnel, direct 10000 to unused 8157
ssh -i ~/caserta-1.pem -N -L 8157:ec2-54-221-27-21.compute-1.amazonaws.com:10000 [email protected]
#see this for JDBC config on client http://blogs.aws.amazon.com/bigdata/post/TxT7CJ0E7CRX88/Using-Amazon-EMR-with-SQL-Workbench-and-other-BI-Tools
@elliottcordo
elliottcordo / emr-spark-hive-context-fix
Created December 15, 2014 22:23
Spark hiveContext Fix
cp /home/hadoop/hive/conf/hive-default.xml /home/hadoop/spark/conf/hive-site.xml
sed -i 's/SPARK_CLASSPATH=\"/&\/home\/hadoop\/hive\/lib\/bonecp-0.8.0.RELEASE.jar:\/home\/hadoop\/hive\/lib\/mysql-connector-java-5.1.30.jar:/' /home/hadoop/spark/conf/spark-env.sh
@elliottcordo
elliottcordo / spark_thrift_standalone
Created December 15, 2014 22:29
starting spark thrift server standalone
#spark scheduler
/home/hadoop/spark/bin/pyspark --master spark://ip-10-63-51-140.ec2.internal:7077
@elliottcordo
elliottcordo / gist:88baf8233b4165a939a4
Last active August 29, 2015 14:13
Some initial stuff from SparkSQL meetup

###create an emr cluster on latest spark version --> should be fine if you don't want to use hive, or parquet

aws emr create-cluster --name SparkCluster --ami-version 3.2.1 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=<your key homey> --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=\["-v1.2.0.a"\] 

###k.. now u have a cluster - do some slicing and dicing thorugh pyspark client mode is fine but be sure to start in a screen, you will also need to play with parameters

./spark/bin/pyspark --master yarn --deploy-mode client --num-executors 12 --executor-memory 2g --executor-cores 4