Custom S3 endpoints with Spark

To be able to use custom endpoints with the latest Spark distribution, one needs to add an external package (hadoop-aws). Then, custum endpoints can be configured according to docs.

Use the `hadoop-aws` package

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.2

SparkContext configuration

Add this to your application, or in the spark-shell:

sc.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
sc.hadoopConfiguration.set("fs.s3a.access.key","<<ACCESS_KEY>>");
sc.hadoopConfiguration.set("fs.s3a.secret.key","<<SECRET_KEY>>");

If your endpoint doesn't support HTTPS, then you'll need the following:

sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false");

S3 url usage

You can use s3a urls like this:

s3a://<<BUCKET>>/<<FOLDER>>/<<FILE>>

Also, it is possible to use the credentials in the path:

s3a://<<ACCESS_KEY>>:<<SECRET_KEY>>@<<BUCKET>>/<<FOLDER>>/<<FILE>>

For S3 API Compatibles(like Ceph, Minio)

If you don't prepare domain name, just enable path-style access:

sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

If the default signer is not supported by some services, use the old signer:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm", "S3SignerType")

minyk/custom_s3_endpoint_in_spark.md

Custom S3 endpoints with Spark

Use the hadoop-aws package

SparkContext configuration

S3 url usage

For S3 API Compatibles(like Ceph, Minio)

Use the `hadoop-aws` package