I started by looking at this 5 year old but quite useful file on this repo. https://github.com/alexmilowski/emr/tree/master/spark
Last active
September 21, 2019 02:31
-
-
Save ravsau/1129794bfa56655a4d03e079190718b5 to your computer and use it in GitHub Desktop.
Spark-word-count-on-aws-emr
Author
ravsau
commented
Sep 19, 2019
Code:
aws emr create-cluster --ami-version 3.2.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium --name SparkCluster --enable-debugging --tags Name=emr --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark --ec2-attributes KeyName=**Your Key** --log-uri s3://**YOUR BUCKET**
this command will launch the emr cluster. Replace key-name and s3 bucket with your bucket and key name.
aws emr create-cluster --ami-version 3.2.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium --name SparkCluster --enable-debugging --tags Name=emr --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark --ec2-attributes KeyName=your-key --log-uri s3://your-bucket
EMR launches a cluster that you can view on the ec2 console.
cat output.txt | awk -F\: '{print $1 $2}'| sort -nk 2
sort by second column
Problem can be the snappy compressed files ☝️
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment