Skip to content

Instantly share code, notes, and snippets.

@dcluna
Forked from eddies/setup-notes.md
Created May 28, 2020 13:40
Show Gist options
  • Save dcluna/5ea2aafbd791197e2987ecc1afbf49e5 to your computer and use it in GitHub Desktop.
Save dcluna/5ea2aafbd791197e2987ecc1afbf49e5 to your computer and use it in GitHub Desktop.
Spark 2.0.0 and Hadoop 2.7 with s3a setup

Standalone Spark 2.0.0 with s3

###Tested with:

  • Spark 2.0.0 pre-built for Hadoop 2.7
  • Mac OS X 10.11
  • Python 3.5.2

Goal

Use s3 within pyspark with minimal hassle.

Load required libraries

If $SPARK_HOME/conf/spark-defaults.conf does not exist, create a copy from $SPARK_HOME/conf/spark-defaults.conf.template

In $SPARK_HOME/conf/spark-defaults.conf include:

spark.jars.packages                com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2

AWS Credentials

In $SPARK_HOME/conf/hdfs-site.xml include:

<?xml version="1.0"?>
<configuration>
<property>
  <name>fs.s3a.access.key</name>
  <value>YOUR_KEY_HERE</value>
</property>
<property>
  <name>fs.s3a.secret.key</name>
  <value>YOUR_SECRET_HERE</value>
</property>
<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>YOUR_KEY_HERE </value>
</property>
<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value> YOUR_SECRET_HERE </value>
</property>
</configuration>

Notes

Per HADOOP-12420:

the rule for s3a work now and in future "use a consistent version of the amazon libraries with which hadoop was built with"

With a future version of Spark with Hadoop 2.8, you should be able to use aws-sdk-s3.

Things that didn't work

  1. Defining aws_access_key_id and aws_secret_access_key in ~/.aws/credentials, e.g.:

    [default]
    aws_access_key_id=YOUR_KEY_HERE
    aws_secret_access_key=YOUR_SECRET_HERE
    
    [profile_foo]
    aws_access_key_id=YOUR_KEY_HERE
    aws_secret_access_key=YOUR_SECRET_HERE
    
  2. Setting the PYSPARK_SUBMIT_ARGS environment variable, e.g.

    export PYSPARK_SUBMIT_ARGS="--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell" 

Things that also worked but were less optimal

  1. Calling pyspark with --packages argument:

    pyspark --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
  2. Defining AWS credentials in code, e.g.:

    sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "YOUR_KEY_HERE")
    sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "YOUR_SECRET_HERE")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment