Update 2019-10-08:
Unfortunately, this script cannot run succesfully as a bootstrap anymore. On the bright side, you can run it like a step, so if you execute it before all other steps, you can still look at it as being a "bootstrap". Instructions are updated to reflect this.
-
You will first have to download the gist to a file and then upload it to S3 in a bucket of your choice.
-
Using the AWS EMR Console create a cluster and choose advanced options.
-
In Step 1 make sure you check the Spark x.x.x checkbox if you want to make use of the sparklyr library in RStudio. You can customize the Spark version by choosing a different emr Release version.
-
Add a step by selecting Custom JAR and clicking Configure.
- For the Name you can fill something like Install RStudio Server
- For JAR location fill in something like
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar
. If you are not running inus-east-1
, change the region accordingly. - As Arguments add the following:
-
Something like
s3://my-bucket/emr/bootstrap/install-rstudio-server.sh
. This is mandatory and it is the location of the script on S3. The EMR cluster must have the permissions so that it can read from that location. -
--sd-version
- optional, default is 1.1.463. The script downloads the artefact from the daily builds bucket, You can use a CLI command likeaws s3 ls s3://rstudio-dailybuilds/rstudio-
to check what versions are available. -
--sd-user
- optional, defaults to drwho. RStudio Server needs a real system user. The script creates one as part of the bootstrap process. -
--sd-pass
- optional, default to tardis. The password for the above specified user. If you're going to use the defaults for the credentials, make sure the EMR cluster is not Internet accessible, as this could be a serious security vunerability. -
--spark-version
- optional, defaults to 2.4.3. sparklyr which is installed as part of the bootstrap process, needs a locally downloaded version of Spark. You should make sure that this version matches the Spark version installed on the cluster. This is only relevant if you are actually going to use the sparklyr capabilities.EMR release --spark-version
4.0.0 1.4.1 4.1.0 1.5.0 4.2.0 1.5.2 4.3.0 1.6.0 ..... ..... 4.5.0 1.6.1 ..... ..... 4.7.2 1.6.2 ..... ..... 5.0.0 2.0.0 5.0.3 2.0.1 ..... ..... 5.2.0 2.0.2 ..... ..... 5.3.0 2.1.0 ..... ..... 5.6.0 2.1.1 ..... ..... 5.8.0 2.2.0 ..... ..... 6.0.0 2.4.3 (default)
-
-
After the cluster has started, you will need to access your cluster's master address and specify port 8787. RStudio Server is only available on the master instance. Depending on where you cluster is launched, you might need to establish a tunnel/proxy connection.
-
After logging in using the default/custom credentials provided, you can connect to the Spark cluster with the following script:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "yarn-client")
Take a look at my other R
elated gists:
This works exactly as documented. Now that Spark 2.0 have been released, I updated its version in the script and also updated the latest RStudio version available on S3 (its 1.0.19 as of today, Sept 19, 2016). It took a long time to load (about 30 minutes with 1 master and 2 core nodes), although I think majority of the time was taken by the pre-bootstrapping steps (installing Hadoop, Spark, HIVE et al).
After the cluster came up, I logged into RStudio on the master node (port 8787) using the user id and password I had provided in the script. RStudio came up in a snap. Very exciting.
I also believe we can connect to Spark using the spark "local" mode on the master node. Like:
library(sparklyr)
sc <- spark_connect(master = "local")