Apache Drill on EMR

Taking over the management of an EMR cluster which is created on demand and terminated after use I set about "tidying it up". Part of the tidy up was to simplify the installion of Apache Drill.
Any improvements greatfully accepted.

Approach

I abandoned the use of a bootstrap script on the cluster to install Drill as I wanted to use the default installation of Zookeeper.
Additionally the use of SSM meant that I could install and tune Drill after the cluster was created without the risk of the bootstrap failing and blocking cluster creation.

Step 1: Install SSM

AWS says that SSM is installed "Amazon Linux base AMIs dated 2017.09 and later" (https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-install-ssm-agent.html).
I used the below to install it and ran this script as a bootstrap script on the EMR creation. "bootstrap_for_ssm.bash"

#!/usr/bin/env bash

# will need ssm to start drill
cd /tmp
sudo yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_386/amazon-ssm-agent.rpm

Once the cluster is ready ....

Step 2: Configure Zookepper with SSM

We need to update zoo.cfg by passing in the instance details to meet the prerequisite "(Required) Running a ZooKeeper quorum": The format is here: https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html#sc_RunningReplicatedZooKeeper
To build the update string using boto:

emr = boto3.client('emr')
zk_server = str('"'+'\n'.join(['server.{0}={1}:2888:3888'.format(n+1, ele['PrivateIpAddress']) for n, ele in enumerate(emr.list_instances(ClusterId=job_flow_id)['Instances'])])+'"')

We update the zoo.cfg using the script "zookeeper_config.bash":

#!/usr/bin/env bash

# arguements
zk_server="$1"
echo "zk_server: $zk_server"

# Set ZooKeeper zoo.cfg
cat << EOF | tee /tmp/zoo.cfg
tickTime=2000
initLimit=100
syncLimit=5
dataDir=/tmp/zookeeper
clientPort=2181
$zk_server
EOF

mkdir /tmp/zookeeper

# over-write default zoo.cfg
sudo cp /tmp/zoo.cfg /usr/lib/zookeeper/conf/zoo.cfg
sudo cp /tmp/zoo.cfg /etc/zookeeper/conf/zoo.cfg

# restart zookeeper
sudo /usr/lib/zookeeper/bin/zkServer.sh restart

To send the SSM we need a list of our instances for this cluster / job_flow_id:

cluster_instances = [ele['Ec2InstanceId'] for ele in emr.list_instances(ClusterId=job_flow_id)['Instances']]

Add the "zookeeper_config.bash" to s3.
We can now send this script to the instances using the SSM (previously set up):

ssm = boto3.client('ssm')
# send command to nodes
zookeeper_start = ssm.send_command(
    InstanceIds=cluster_instances,
    DocumentName='AWS-RunRemoteScript', Comment='Configure Zookeeper',
    Parameters={
        'sourceType':['S3'],
        'sourceInfo':['{"path": "https://s3....zookeeper_confg.bash"}'],
        'commandLine':['zookeeper_config.bash {0}'.format(zk_server)]
    },
    OutputS3Region='...',
    OutputS3BucketName='...',
    OutputS3KeyPrefix='...')

Use a wait loop to wait until the SSM commands are all complete.

Step 3: Install Drill using SSM

Same SSM RunRemoteScript syntax as above to run the Drill install script below as a remote script. "bootstrap_drill.bash" looks like the below, I am passing in the memory amounts from python (to set heap and memory max) and an already created $zk_list ("zkhostname1:port,zkhostname2:port,zkhostname3:port"):

#!/usr/bin/env bash

# Boostrap Apache Drill

# arguements
drill_memory_heap="$1"
drill_memory_max="$2"
zk_list="$3"

echo "drill_memory_heap: $drill_memory_heap"
echo "drill_memory_max: $drill_memory_max"
echo "zk_list: $zk_list"

# Distributed Mode Prerequisites
# You can install Apache Drill on one or more nodes to run it in a clustered environment.

# Prerequisites
# Before you install Drill on nodes in a cluster, ensure that the cluster meets the following prerequisites:

# (Required) Running Oracle JDK version 7 or version 8 if running Drill 1.6 or later.
java -version

# (Required) Running a ZooKeeper quorum
# cluster id
cluster_id=$( cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId" )
echo "cluster_id: $cluster_id"

# (Recommended) Running a Hadoop cluster
# (Recommended) Using DNS

# Download the latest version of Apache Drill here or from the Apache Drill mirror site with the command appropriate for your system:
sudo curl -o /tmp/drill.tar.gz http://www-eu.apache.org/dist/drill/drill-1.12.0/apache-drill-1.12.0.tar.gz --connect-timeout 240 --max-time 300 --retry 5

# Extract the tarball to the directory of your choice, such as /opt:
sudo mkdir -p /opt/drill
sudo tar -xzvf /tmp/drill.tar.gz -C /opt/drill --strip-components 1
sudo rm -f /tmp/drill.tar.gz

# update /opt/drill/conf/drill-override.conf
# Add the Drill cluster ID
# Add provide ZooKeeper host names and port numbers to configure a connection to your ZooKeeper quorum.
cat << EOF | tee /tmp/drill-override.conf
 drill.exec:{
  cluster-id: "my_cluster_id",
  buffer.size: 1000
  zk.connect: "$zk_list"
  impersonation: { enabled: false },
  profiles.store.inmemory: true
  }
EOF

# back up original /opt/drill/conf/drill-override.conf
sudo cp /opt/drill/conf/drill-override.conf /opt/drill/conf/drill-override_1.conf
# update /opt/drill/conf/drill-override.conf to final location
sudo cp /tmp/drill-override.conf /opt/drill/conf/drill-override.conf

# update core-site.xml
cat << EOF | tee /tmp/core-site.xml
<configuration>
  <property>
    <name>fs.s3a.connection.maximum</name>
    <value>50000</value>
  </property>
</configuration>
EOF

# back up original /opt/drill/conf/core-site.xml
sudo cp /opt/drill/conf/core-site.xml /opt/drill/conf/core-site_1.xml
# move /opt/drill/conf/core-site.xml to final location
sudo cp /tmp/core-site.xml /opt/drill/conf/core-site.xml

# copy in the jdbc for sqlserver
aws s3 cp s3://.../jars/sqljdbc42.jar /opt/drill/jars/3rdparty/sqljdbc42.jar

sudo chown -R hadoop:hadoop /opt/drill/

# update drill-env.sh
cat << EOF | tee /tmp/drill-env.sh
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#-----------------------------------------------------------------------------
# This file provides a variety of site-specific settings to control Drill
# launch settings. These are settings required when launching the Drillbit
# or sqlline processes using Java. Some settings are for both, some for one
# or the other.
#
# Variables may be set in one of four places:
#
#   Environment (per run)
#   drill-env.sh (this file, per site)
#   distrib-env.sh (per distribution)
#   drill-config.sh (Drill defaults)
#
# Properties "inherit" from items lower on the list, and may be "overridden" by items
# higher on the list. In the environment, just set the variable:
#
#   export FOO=value
#
# To support inheritance from the environment, you must set values as shown below:
#
#   export FOO=${FOO:-"value"}
#
# or a more specialized form.
# Amount of heap memory for the Drillbit process. Values are those supported by
# the Java -Xms option. The default is 4G.
$drill_memory_heap
# Maximum amount of direct memory to allocate to the Drillbit in the format
# supported by -XX:MaxDirectMemorySize. Default is 8G.
$drill_memory_max
# Value for the JVM -XX:MaxPermSize option for the Drillbit. Default is 512M.
#export DRILLBIT_MAX_PERM=${DRILLBIT_MAX_PERM:-"512M"}
# Native library path passed to Java. Note: use this form instead
# of the old form of DRILLBIT_JAVA_OPTS="-Djava.library.path=<dir>"
# The old form is not compatible with Drill-on-YARN.
# export DRILL_JAVA_LIB_PATH="<lib1>:<lib2>"
# Value for the code cache size for the Drillbit. Because the Drillbit generates
# code, it benefits from a large cache. Default is 1G.
#export DRILLBIT_CODE_CACHE_SIZE=${DRILLBIT_CODE_CACHE_SIZE:-"1G"}
# Provide a customized host name for when the default mechanism is not accurate
#export DRILL_HOST_NAME=`hostname`
# Base name for Drill log files. Files are named ${DRILL_LOG_NAME}.out, etc.
# DRILL_LOG_NAME="drillbit"
# Location to place Drill logs. Set to $DRILL_HOME/log by default.
#export DRILL_LOG_DIR=${DRILL_LOG_DIR:-$DRILL_HOME/conf}
# Location to place the Drillbit pid file when running as a daemon using
# drillbit.sh start.
# Set to $DRILL_HOME by default.
#export DRILL_PID_DIR=${DRILL_PID_DIR:-$DRILL_HOME}
# Custom JVM arguments to pass to the both the Drillbit and sqlline. Typically
# used to override system properties as shown below. Empty by default.
#export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS -Dproperty=value"
# As above, but only for the Drillbit. Empty by default.
#export DRILLBIT_JAVA_OPTS="$DRILLBIT_JAVA_OPTS -Dproperty=value"
# Process priority (niceness) for the Drillbit when running as a daemon.
# Defaults to 0.
#export DRILL_NICENESS=${DRILL_NICENESS:-0}
# Custom class path for Drill. In general, you should put your custom libraries into
# your site directory's jars subfolder ($DRILL_HOME/conf/jars by default, but can be
# customized with DRILL_CONF_DIR or the --config argument. But, if you must reference
# jar files in other locations, you can add them here. These jars are added to the
# Drill classpath after all Drill-provided jars. Empty by default.
# custom="/your/path/here:/your/second/path"
# if [ -z "$DRILL_CLASSPATH" ]; then
#   export DRILL_CLASSPATH=${DRILL_CLASSPATH:$custom}
# else
#   export DRILL_CLASSPATH="$custom"
# fi
# Extension classpath for things like HADOOP, HBase and so on. Set as above.
# EXTN_CLASSPATH=...
# Note that one environment variable can't be set here: DRILL_CONF_DIR.
# That variable tells Drill the location of this file, so this file can't
# set it. Instead, you can set it in the environment, or using the
# --config option of drillbit.sh or sqlline.
#-----------------------------------------------------------------------------
# The following are "advanced" options seldom used except when diagnosing
# complex issues.
#
# The prefix class path appears before any Drill-provided classpath entries.
# Use it to override Drill jars with specialized versions.
#export DRILL_CLASSPATH_PREFIX=...
# Enable garbage collection logging in the Drillbit. Logging goes to
# $DRILL_LOG_DIR/drillbit.gc. A value of 1 enables logging, all other values
# (including the default unset value) disables logging.
#export SERVER_LOG_GC=${SERVER_LOG_GC:-1}
# JVM options when running the sqlline Drill client. For example, adjust the
# JVM heap memory here. These are used ONLY in non-embedded mode; these
# are client-only settings. (The Drillbit settings are used when Drill
# is embedded.)
#export SQLLINE_JAVA_OPTS="-XX:MaxPermSize=512M"
# Arguments passed to sqlline (the Drill shell) at all times: whether
# Drill is embedded in Sqlline or not.
#export DRILL_SHELL_JAVA_OPTS="..."
# Location Drill should use for temporary files, such as downloaded dynamic UDFs jars.
# Set to "/tmp" by default.
#
# export DRILL_TMP_DIR="..."
# Block to put environment variable known to both Sqlline and Drillbit, but needs to be
# differently set for both. OR set for one and unset for other.
#
# if [ "$DRILLBIT_CONTEXT" = "1" ]; then
#   Set environment variable value to be consumed by Drillbit
# else
#   Set environment variable value to be consumed by Sqlline
# fi
#
EOF

# back up original /opt/drill/conf/drill-env.sh
sudo cp /opt/drill/conf/drill-env.sh /opt/drill/conf/drill-env_1.sh
# move /opt/drill/conf/core-site.xml to final location
sudo cp /tmp/drill-env.sh /opt/drill/conf/drill-env.sh

# Start Drill in Distributed Mode
sudo /opt/drill/bin/drillbit.sh restart

Use a wait loop to wait until the SSM commands are all complete, hopefully now you are good to go.

fmcmac/emr_drill.md

Apache Drill on EMR

Approach

Step 1: Install SSM

Step 2: Configure Zookepper with SSM

Step 3: Install Drill using SSM