Last active
June 11, 2022 11:53
-
-
Save chicagobuss/6557dbf1ad97e5a09709 to your computer and use it in GitHub Desktop.
How to get spark 1.6.0 with hadoop 2.6 working with s3
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<configuration> | |
<property> | |
<name>fs.s3a.access.key</name> | |
<description>AWS access key ID. Omit for Role-based authentication.</description> | |
<value>YOUR_ACCESS_KEY</value> | |
</property> | |
<property> | |
<name>fs.s3a.secret.key</name> | |
<description>AWS secret key. Omit for Role-based authentication.</description> | |
<value>YOUR_SECRET_KEY</value> | |
</property> | |
</configuration> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
export DEFAULT_HADOOP_HOME=/usr/lib/hadoop | |
export SPARK_HOME=/usr/lib/spark | |
export PYTHONPATH=/usr/lib/spark/python/:/usr/lib/spark/python/lib/py4j-0.9-src.zip | |
export SPARK_DIST_CLASSPATH=/usr/lib/spark/conf/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/client/*:/usr/lib/spark/lib/* | |
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zoo1:2181,zoo2:2181,zoo3:2181" | |
export STANDALONE_SPARK_MASTER_HOST="spark-master1,spark-master2" | |
export SPARK_MASTER_WEBUI_PORT=8080 | |
export SPARK_WORKER_MEMORY=4g | |
export SPARK_DRIVER_MEMORY=3g | |
export SPARK_DAEMON_MEMORY=1g |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem | |
spark.executor.extraClassPath /var/lib/hadoop/lib/aws-java-sdk-1.7.4.jar:/var/lib/hadoop/lib/hadoop-aws-2.7.1.jar | |
spark.driver.extraClassPath /var/lib/hadoop/lib/aws-java-sdk-1.7.4.jar:/var/lib/hadoop/lib/hadoop-aws-2.7.1.jar |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Launch this with ./bin/spark-submit pyspark-s3a-example.py from /usr/lib/spark as root | |
## Or even better: | |
# ./bin/spark-submit --master=spark://spark-master-1,spark-master-2 pyspark-s3a-example.py | |
from pyspark import SparkContext | |
sc = SparkContext('spark://spark-master-1:7077,spark-master-2:7077') | |
dataFile = "s3a://dabucket/sample.csv" | |
input = sc.textFile(dataFile) | |
header = input.take(1)[0] | |
rows = input.filter(lambda line: line != header) | |
lines = rows.map(lambda line: int((line.split(',')[2]))).collect() | |
print lines |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# A working spark 1.6.0 / hadoop 2.6 configuration for talking to s3 with s3a: | |
############################################################################ | |
# First the ridiculous part - if you have any of these files, delete them. | |
rm ${HADOOP_HOME}/lib/aws-java-sdk-s3-1.10.6.jar | |
rm ${HADOOP_HOME}/lib/aws-java-sdk-core-1.10.6.jar | |
rm /usr/lib/hadoop/hadoop-aws-2.6.0-cdh5.7.0.jar |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
################################################################### | |
big thanks to: | |
http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/ | |
https://gist.github.com/thekensta/21068ef1b6f4af08eb09 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @chicagobuss , I need your help since I am also facing ClassNot Found Exception and tried each and every possible solution but still no luck. Please help me out its an urgent project which I have to fix.
I have posted my question on StackOverFlow:
https://stackoverflow.com/questions/72562423/pyspark-on-jupyterhub-k8s-unable-to-query-data-class-org-apache-hadoop-fs
And on Databrick as well:
https://community.databricks.com/s/feed/0D58Y00008otNmfSAE