Skip to content

Instantly share code, notes, and snippets.

@RooseveltAdvisors
Last active September 20, 2019 15:37
Show Gist options
  • Save RooseveltAdvisors/10276011590c01b54d3f721b663b9a9a to your computer and use it in GitHub Desktop.
Save RooseveltAdvisors/10276011590c01b54d3f721b663b9a9a to your computer and use it in GitHub Desktop.
Run Jupyter Notebook and JupyterHub on Amazon EMR

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>

The following example CLI command is used to launch a five-node (c3.4xlarge) EMR 5.2.0 cluster with the bootstrap action. The BA will install all the available kernels. It will also install the ggplot and nilearn Python packages and set:

the Jupyter port to 8880
the password to jupyter
the JupyterHub port to 8001
aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --applications Name=Hadoop Name=Hive Name=Pig Name=Hue Name=Spark Name=Ganglia Name=Presto Name=Tez --bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5-latest.sh","Args":["--toree","--ds-packages","--ml-packages","--python-packages","pandas ggplot","--port","8880","--jupyterhub","--jupyterhub-port","8001","--spark-opts","--packages=com.typesafe:config:1.3.1,org.datasyslab:geospark:0.8.0,com.vividsolutions:jts:1.13,com.databricks:spark-avro_2.11:3.0.0,org.elasticsearch:elasticsearch-spark_2.11:2.4.0","--notebook-dir","s3://yuan.mobiquitynetworks.com/workspace/","--cached-install","--s3fs","--python3"],"Name":"Install Jupyter notebook"}]' --ec2-attributes '{"KeyName":"<your-ec2-key>","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-1b58686f","EmrManagedSlaveSecurityGroup":"sg-2418c05e","EmrManagedMasterSecurityGroup":"sg-79e63e03"}' --service-role EMR_DefaultRole --enable-debugging --release-label emr-5.6.0 --log-uri 's3n://aws-logs-452442550777-us-west-2/elasticmapreduce/' --name 'Jupyter Notebook' --instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master - 1"}]' --scale-down-behavior TERMINATE_AT_INSTANCE_HOUR --region us-west-2

Replace with your AWS access key and with the S3 bucket where you store notebooks. You can also change the instance types to suit your needs and budget.

Reference :

@keplerCP7
Copy link

hi yuanzhaoYZ, is there a way to connect it to apache livy to manage emr spark cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment