-
-
Save chicagobuss/6557dbf1ad97e5a09709 to your computer and use it in GitHub Desktop.
| <configuration> | |
| <property> | |
| <name>fs.s3a.access.key</name> | |
| <description>AWS access key ID. Omit for Role-based authentication.</description> | |
| <value>YOUR_ACCESS_KEY</value> | |
| </property> | |
| <property> | |
| <name>fs.s3a.secret.key</name> | |
| <description>AWS secret key. Omit for Role-based authentication.</description> | |
| <value>YOUR_SECRET_KEY</value> | |
| </property> | |
| </configuration> |
| #!/usr/bin/env bash | |
| export DEFAULT_HADOOP_HOME=/usr/lib/hadoop | |
| export SPARK_HOME=/usr/lib/spark | |
| export PYTHONPATH=/usr/lib/spark/python/:/usr/lib/spark/python/lib/py4j-0.9-src.zip | |
| export SPARK_DIST_CLASSPATH=/usr/lib/spark/conf/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/client/*:/usr/lib/spark/lib/* | |
| export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zoo1:2181,zoo2:2181,zoo3:2181" | |
| export STANDALONE_SPARK_MASTER_HOST="spark-master1,spark-master2" | |
| export SPARK_MASTER_WEBUI_PORT=8080 | |
| export SPARK_WORKER_MEMORY=4g | |
| export SPARK_DRIVER_MEMORY=3g | |
| export SPARK_DAEMON_MEMORY=1g |
| spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem | |
| spark.executor.extraClassPath /var/lib/hadoop/lib/aws-java-sdk-1.7.4.jar:/var/lib/hadoop/lib/hadoop-aws-2.7.1.jar | |
| spark.driver.extraClassPath /var/lib/hadoop/lib/aws-java-sdk-1.7.4.jar:/var/lib/hadoop/lib/hadoop-aws-2.7.1.jar |
| ## Launch this with ./bin/spark-submit pyspark-s3a-example.py from /usr/lib/spark as root | |
| ## Or even better: | |
| # ./bin/spark-submit --master=spark://spark-master-1,spark-master-2 pyspark-s3a-example.py | |
| from pyspark import SparkContext | |
| sc = SparkContext('spark://spark-master-1:7077,spark-master-2:7077') | |
| dataFile = "s3a://dabucket/sample.csv" | |
| input = sc.textFile(dataFile) | |
| header = input.take(1)[0] | |
| rows = input.filter(lambda line: line != header) | |
| lines = rows.map(lambda line: int((line.split(',')[2]))).collect() | |
| print lines |
| # A working spark 1.6.0 / hadoop 2.6 configuration for talking to s3 with s3a: | |
| ############################################################################ | |
| # First the ridiculous part - if you have any of these files, delete them. | |
| rm ${HADOOP_HOME}/lib/aws-java-sdk-s3-1.10.6.jar | |
| rm ${HADOOP_HOME}/lib/aws-java-sdk-core-1.10.6.jar | |
| rm /usr/lib/hadoop/hadoop-aws-2.6.0-cdh5.7.0.jar |
| ################################################################### | |
| big thanks to: | |
| http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/ | |
| https://gist.github.com/thekensta/21068ef1b6f4af08eb09 |
Yes, that worked for me, but it's worth noting that I'm primarily using spark through submitting pyspark jobs (and ipython notebooks) - maybe that's why this works for me.
I put all the changes in the spark-defaults.conf file
How do my executors get the aws jar and the hadoop jar I downloaded on my master node? I'm still getting the
: java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException
tnx
Hi @chicagobuss , I need your help since I am also facing ClassNot Found Exception and tried each and every possible solution but still no luck. Please help me out its an urgent project which I have to fix.
I have posted my question on StackOverFlow:
https://stackoverflow.com/questions/72562423/pyspark-on-jupyterhub-k8s-unable-to-query-data-class-org-apache-hadoop-fs
And on Databrick as well:
Does the extraClassPath declarations actually work for you in spark.properties?
I have to put mine in spark-defaults.conf for S3A to work properly.