Skip to content

Instantly share code, notes, and snippets.

@minyk
Forked from tobilg/custom_s3_endpoint_in_spark.md
Last active December 28, 2021 08:03
Show Gist options
  • Save minyk/ea8e281443b43cfad745bd9c5f46c5b7 to your computer and use it in GitHub Desktop.
Save minyk/ea8e281443b43cfad745bd9c5f46c5b7 to your computer and use it in GitHub Desktop.
Description on how to use a custom S3 endpoint (like Rados Gateway for Ceph)

Custom S3 endpoints with Spark

To be able to use custom endpoints with the latest Spark distribution, one needs to add an external package (hadoop-aws). Then, custum endpoints can be configured according to docs.

Use the hadoop-aws package

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.2

SparkContext configuration

Add this to your application, or in the spark-shell:

sc.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
sc.hadoopConfiguration.set("fs.s3a.access.key","<<ACCESS_KEY>>");
sc.hadoopConfiguration.set("fs.s3a.secret.key","<<SECRET_KEY>>");

If your endpoint doesn't support HTTPS, then you'll need the following:

sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false");

S3 url usage

You can use s3a urls like this:

s3a://<<BUCKET>>/<<FOLDER>>/<<FILE>>

Also, it is possible to use the credentials in the path:

s3a://<<ACCESS_KEY>>:<<SECRET_KEY>>@<<BUCKET>>/<<FOLDER>>/<<FILE>>

For S3 API Compatibles(like Ceph, Minio)

If you don't prepare domain name, just enable path-style access:

sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

If the default signer is not supported by some services, use the old signer:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm", "S3SignerType")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment