-
-
Save chicagobuss/6557dbf1ad97e5a09709 to your computer and use it in GitHub Desktop.
<configuration> | |
<property> | |
<name>fs.s3a.access.key</name> | |
<description>AWS access key ID. Omit for Role-based authentication.</description> | |
<value>YOUR_ACCESS_KEY</value> | |
</property> | |
<property> | |
<name>fs.s3a.secret.key</name> | |
<description>AWS secret key. Omit for Role-based authentication.</description> | |
<value>YOUR_SECRET_KEY</value> | |
</property> | |
</configuration> |
#!/usr/bin/env bash | |
export DEFAULT_HADOOP_HOME=/usr/lib/hadoop | |
export SPARK_HOME=/usr/lib/spark | |
export PYTHONPATH=/usr/lib/spark/python/:/usr/lib/spark/python/lib/py4j-0.9-src.zip | |
export SPARK_DIST_CLASSPATH=/usr/lib/spark/conf/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/client/*:/usr/lib/spark/lib/* | |
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zoo1:2181,zoo2:2181,zoo3:2181" | |
export STANDALONE_SPARK_MASTER_HOST="spark-master1,spark-master2" | |
export SPARK_MASTER_WEBUI_PORT=8080 | |
export SPARK_WORKER_MEMORY=4g | |
export SPARK_DRIVER_MEMORY=3g | |
export SPARK_DAEMON_MEMORY=1g |
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem | |
spark.executor.extraClassPath /var/lib/hadoop/lib/aws-java-sdk-1.7.4.jar:/var/lib/hadoop/lib/hadoop-aws-2.7.1.jar | |
spark.driver.extraClassPath /var/lib/hadoop/lib/aws-java-sdk-1.7.4.jar:/var/lib/hadoop/lib/hadoop-aws-2.7.1.jar |
## Launch this with ./bin/spark-submit pyspark-s3a-example.py from /usr/lib/spark as root | |
## Or even better: | |
# ./bin/spark-submit --master=spark://spark-master-1,spark-master-2 pyspark-s3a-example.py | |
from pyspark import SparkContext | |
sc = SparkContext('spark://spark-master-1:7077,spark-master-2:7077') | |
dataFile = "s3a://dabucket/sample.csv" | |
input = sc.textFile(dataFile) | |
header = input.take(1)[0] | |
rows = input.filter(lambda line: line != header) | |
lines = rows.map(lambda line: int((line.split(',')[2]))).collect() | |
print lines |
# A working spark 1.6.0 / hadoop 2.6 configuration for talking to s3 with s3a: | |
############################################################################ | |
# First the ridiculous part - if you have any of these files, delete them. | |
rm ${HADOOP_HOME}/lib/aws-java-sdk-s3-1.10.6.jar | |
rm ${HADOOP_HOME}/lib/aws-java-sdk-core-1.10.6.jar | |
rm /usr/lib/hadoop/hadoop-aws-2.6.0-cdh5.7.0.jar |
################################################################### | |
big thanks to: | |
http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/ | |
https://gist.github.com/thekensta/21068ef1b6f4af08eb09 |
Yes, that worked for me, but it's worth noting that I'm primarily using spark through submitting pyspark jobs (and ipython notebooks) - maybe that's why this works for me.
I put all the changes in the spark-defaults.conf file
How do my executors get the aws jar and the hadoop jar I downloaded on my master node? I'm still getting the
: java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException
tnx
Hi @chicagobuss , I need your help since I am also facing ClassNot Found Exception and tried each and every possible solution but still no luck. Please help me out its an urgent project which I have to fix.
I have posted my question on StackOverFlow:
https://stackoverflow.com/questions/72562423/pyspark-on-jupyterhub-k8s-unable-to-query-data-class-org-apache-hadoop-fs
And on Databrick as well:
Does the extraClassPath declarations actually work for you in spark.properties?
I have to put mine in spark-defaults.conf for S3A to work properly.