###Tested with:
- Spark 2.0.0 pre-built for Hadoop 2.7
- Mac OS X 10.11
- Python 3.5.2
Use s3 within pyspark with minimal hassle.
If $SPARK_HOME/conf/spark-defaults.conf
does not exist, create a copy from $SPARK_HOME/conf/spark-defaults.conf.template
In $SPARK_HOME/conf/spark-defaults.conf
include:
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
In $SPARK_HOME/conf/hdfs-site.xml
include:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>YOUR_KEY_HERE</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_SECRET_HERE</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_KEY_HERE </value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value> YOUR_SECRET_HERE </value>
</property>
</configuration>
Per HADOOP-12420:
the rule for s3a work now and in future "use a consistent version of the amazon libraries with which hadoop was built with"
With a future version of Spark with Hadoop 2.8, you should be able to use aws-sdk-s3
.
-
Defining
aws_access_key_id
andaws_secret_access_key
in~/.aws/credentials
, e.g.:[default] aws_access_key_id=YOUR_KEY_HERE aws_secret_access_key=YOUR_SECRET_HERE [profile_foo] aws_access_key_id=YOUR_KEY_HERE aws_secret_access_key=YOUR_SECRET_HERE
-
Setting the
PYSPARK_SUBMIT_ARGS
environment variable, e.g.export PYSPARK_SUBMIT_ARGS="--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell"
-
Calling
pyspark
with--packages
argument:pyspark --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
-
Defining AWS credentials in code, e.g.:
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "YOUR_KEY_HERE") sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "YOUR_SECRET_HERE")