Skip to content

Instantly share code, notes, and snippets.

@timrobertson100
Last active February 8, 2024 06:08
Show Gist options
  • Save timrobertson100/71fbd4b56eda9aea4d289288f60f598f to your computer and use it in GitHub Desktop.
Save timrobertson100/71fbd4b56eda9aea4d289288f60f598f to your computer and use it in GitHub Desktop.
Spark 2.4 on CDH 5.12
Based on ideas here, expanded to enable Hive support
https://www.linkedin.com/pulse/running-spark-2xx-cloudera-hadoop-distro-cdh-deenar-toraskar-cfa/
wget https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-without-hadoop.tgz
tar -xvzf spark-2.4.8-bin-without-hadoop.tgz
cd spark-2.4.8-bin-without-hadoop
cp -R /etc/spark2/conf/* conf/
cp /etc/hive/conf/hive-site.xml conf/
sed -i "s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#" conf/spark-env.sh
sed -i 's/spark.master=yarn-client/spark.master=yarn/' conf/spark-defaults.conf
sed -i '/spark.yarn.jar/d' conf/spark-defaults.conf
sed -i '/NavigatorAppListener/d' conf/spark-defaults.conf
sed -i '/NavigatorQueryListener/d' conf/spark-defaults.conf
# We don't use spark.yarn.jars or spark.yarn.archive so the jars folder goes on the CP
# This is uploaded on each job submission, rather than using the common HDFS folder like CDH
cd jars
# download the correct jline (https://issues.apache.org/jira/browse/SPARK-25783)
wget https://repo1.maven.org/maven2/jline/jline/2.14.3/jline-2.14.3.jar
# bring down the spark hive jars
wget https://repo1.maven.org/maven2/org/apache/spark/spark-hive_2.11/2.4.8/spark-hive_2.11-2.4.8.jar
wget https://repo1.maven.org/maven2/org/spark-project/hive/hive-exec/1.2.1.spark2/hive-exec-1.2.1.spark2.jar
cd ..
# copy transient dependency jars from Cloudera, required for Hive
cp /opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/jars/hive-* jars/
cp /opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/jars/libthrift-0.9.3.jar jars/
cp /opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/jars/commons*.jar jars/
cd ..
# now run the spark-shell
./bin/spark-shell
# verify it works with
scala> val a = spark.sql("show databases");
a: org.apache.spark.sql.DataFrame = [databaseName: string]
scala> a.show
+---------------+
| databaseName|
+---------------+
| analytics|
... etc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment