tobilg/custom_s3_endpoint_in_spark.md

Last active July 31, 2024 10:22

Star (23) You must be signed in to star a gist
Fork (11) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/tobilg/e03dbc474ba976b9f235.js"></script>
Save tobilg/e03dbc474ba976b9f235 to your computer and use it in GitHub Desktop.

Download ZIP

Description on how to use a custom S3 endpoint (like Rados Gateway for Ceph)

Raw

custom_s3_endpoint_in_spark.md

Custom S3 endpoints with Spark

To be able to use custom endpoints with the latest Spark distribution, one needs to add an external package (hadoop-aws). Then, custum endpoints can be configured according to docs.

Use the `hadoop-aws` package

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.2

SparkContext configuration

Add this to your application, or in the spark-shell:

sc.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
sc.hadoopConfiguration.set("fs.s3a.access.key","<<ACCESS_KEY>>");
sc.hadoopConfiguration.set("fs.s3a.secret.key","<<SECRET_KEY>>");

If your endpoint doesn't support HTTPS, then you'll need the following:

sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false");

S3 url usage

You can use s3a urls like this:

s3a://<<BUCKET>>/<<FOLDER>>/<<FILE>>

Also, it is possible to use the credentials in the path:

s3a://<<ACCESS_KEY>>:<<SECRET_KEY>>@<<BUCKET>>/<<FOLDER>>/<<FILE>>

Pankajpluang123 commented Jul 31, 2024

spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint","https://.us-east-1.vpce.amazonaws.com")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
spark._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint.region","us-east-1")
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "*****")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

df = spark.read.csv("s3a://*****/dt=2022-05-27/*",header=True,inferSchema=True)
error :Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;
Private bucket