Skip to content

Instantly share code, notes, and snippets.

@ravsau
Last active September 21, 2019 02:31
Show Gist options
  • Save ravsau/1129794bfa56655a4d03e079190718b5 to your computer and use it in GitHub Desktop.
Save ravsau/1129794bfa56655a4d03e079190718b5 to your computer and use it in GitHub Desktop.
Spark-word-count-on-aws-emr
@ravsau
Copy link
Author

ravsau commented Sep 19, 2019

image

@Sanjogsharma
Copy link

Sanjogsharma commented Sep 19, 2019

Code:

aws emr create-cluster --ami-version 3.2.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium --name SparkCluster --enable-debugging --tags Name=emr --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark --ec2-attributes KeyName=**Your Key** --log-uri s3://**YOUR BUCKET**

@ravsau
Copy link
Author

ravsau commented Sep 19, 2019

this command will launch the emr cluster. Replace key-name and s3 bucket with your bucket and key name.

aws emr create-cluster --ami-version 3.2.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium --name SparkCluster --enable-debugging --tags Name=emr --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark --ec2-attributes KeyName=your-key --log-uri s3://your-bucket

EMR launches a cluster that you can view on the ec2 console.
image

@ravsau
Copy link
Author

ravsau commented Sep 21, 2019

cat output.txt | awk -F\: '{print $1 $2}'| sort -nk 2
sort by second column

@ravsau
Copy link
Author

ravsau commented Sep 21, 2019

image

@ravsau
Copy link
Author

ravsau commented Sep 21, 2019

Resized because we got out of memory error OOM

image

@ravsau
Copy link
Author

ravsau commented Sep 21, 2019

Problem can be the snappy compressed files ☝️

@ravsau
Copy link
Author

ravsau commented Sep 21, 2019

:trollface:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment