Skip to content

Instantly share code, notes, and snippets.

@chicagobuss
Last active June 11, 2022 11:53
Show Gist options
  • Save chicagobuss/6557dbf1ad97e5a09709 to your computer and use it in GitHub Desktop.
Save chicagobuss/6557dbf1ad97e5a09709 to your computer and use it in GitHub Desktop.
How to get spark 1.6.0 with hadoop 2.6 working with s3
<configuration>
<property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID. Omit for Role-based authentication.</description>
<value>YOUR_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key. Omit for Role-based authentication.</description>
<value>YOUR_SECRET_KEY</value>
</property>
</configuration>
#!/usr/bin/env bash
export DEFAULT_HADOOP_HOME=/usr/lib/hadoop
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=/usr/lib/spark/python/:/usr/lib/spark/python/lib/py4j-0.9-src.zip
export SPARK_DIST_CLASSPATH=/usr/lib/spark/conf/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/client/*:/usr/lib/spark/lib/*
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zoo1:2181,zoo2:2181,zoo3:2181"
export STANDALONE_SPARK_MASTER_HOST="spark-master1,spark-master2"
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_MEMORY=4g
export SPARK_DRIVER_MEMORY=3g
export SPARK_DAEMON_MEMORY=1g
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.executor.extraClassPath /var/lib/hadoop/lib/aws-java-sdk-1.7.4.jar:/var/lib/hadoop/lib/hadoop-aws-2.7.1.jar
spark.driver.extraClassPath /var/lib/hadoop/lib/aws-java-sdk-1.7.4.jar:/var/lib/hadoop/lib/hadoop-aws-2.7.1.jar
## Launch this with ./bin/spark-submit pyspark-s3a-example.py from /usr/lib/spark as root
## Or even better:
# ./bin/spark-submit --master=spark://spark-master-1,spark-master-2 pyspark-s3a-example.py
from pyspark import SparkContext
sc = SparkContext('spark://spark-master-1:7077,spark-master-2:7077')
dataFile = "s3a://dabucket/sample.csv"
input = sc.textFile(dataFile)
header = input.take(1)[0]
rows = input.filter(lambda line: line != header)
lines = rows.map(lambda line: int((line.split(',')[2]))).collect()
print lines
# A working spark 1.6.0 / hadoop 2.6 configuration for talking to s3 with s3a:
############################################################################
# First the ridiculous part - if you have any of these files, delete them.
rm ${HADOOP_HOME}/lib/aws-java-sdk-s3-1.10.6.jar
rm ${HADOOP_HOME}/lib/aws-java-sdk-core-1.10.6.jar
rm /usr/lib/hadoop/hadoop-aws-2.6.0-cdh5.7.0.jar
###################################################################
big thanks to:
http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
https://gist.github.com/thekensta/21068ef1b6f4af08eb09
@johntmyers
Copy link

Does the extraClassPath declarations actually work for you in spark.properties?

I have to put mine in spark-defaults.conf for S3A to work properly.

@chicagobuss
Copy link
Author

Yes, that worked for me, but it's worth noting that I'm primarily using spark through submitting pyspark jobs (and ipython notebooks) - maybe that's why this works for me.

@bfleming-ciena
Copy link

I put all the changes in the spark-defaults.conf file

How do my executors get the aws jar and the hadoop jar I downloaded on my master node? I'm still getting the

: java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException
tnx

@Vivekdjango
Copy link

Hi @chicagobuss , I need your help since I am also facing ClassNot Found Exception and tried each and every possible solution but still no luck. Please help me out its an urgent project which I have to fix.

I have posted my question on StackOverFlow:
https://stackoverflow.com/questions/72562423/pyspark-on-jupyterhub-k8s-unable-to-query-data-class-org-apache-hadoop-fs

And on Databrick as well:

https://community.databricks.com/s/feed/0D58Y00008otNmfSAE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment