tobilg/custom_s3_endpoint_in_spark.md

Last active July 31, 2024 10:22

Star (23) You must be signed in to star a gist
Fork (11) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/tobilg/e03dbc474ba976b9f235.js"></script>
Save tobilg/e03dbc474ba976b9f235 to your computer and use it in GitHub Desktop.

Download ZIP

Description on how to use a custom S3 endpoint (like Rados Gateway for Ceph)

Raw

custom_s3_endpoint_in_spark.md

Custom S3 endpoints with Spark

To be able to use custom endpoints with the latest Spark distribution, one needs to add an external package (hadoop-aws). Then, custum endpoints can be configured according to docs.

Use the `hadoop-aws` package

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.2

SparkContext configuration

Add this to your application, or in the spark-shell:

sc.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
sc.hadoopConfiguration.set("fs.s3a.access.key","<<ACCESS_KEY>>");
sc.hadoopConfiguration.set("fs.s3a.secret.key","<<SECRET_KEY>>");

If your endpoint doesn't support HTTPS, then you'll need the following:

sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false");

S3 url usage

You can use s3a urls like this:

s3a://<<BUCKET>>/<<FOLDER>>/<<FILE>>

Also, it is possible to use the credentials in the path:

s3a://<<ACCESS_KEY>>:<<SECRET_KEY>>@<<BUCKET>>/<<FOLDER>>/<<FILE>>

cipri-tom commented Apr 7, 2017

It was this simple! Incredible. Took me a few days to stumble upon this. Thank you !

For PySpark, you don't have direct access to sc.hadoopContext so you have to use (from here):

>>> sc._jsc.hadoopContext().set("fs.s3a.endpoint", "<<ENDPOINT>>")
>>> ...

david-wb commented Oct 18, 2017

I'm running a s3 service locally using atlassianlabs/local stack. I want to pass the local s3 endpoint to the hadoop context like this:

.set("fs.s3a.endpoint", "http://aws-local:9000")

Is it possible to do this?

Saurabh7 commented Jan 24, 2018

I have cluster in southeast region and have to read/write from both south, southeast region.
I sometimes encounter Moved Permanently (Service: Amazon S3; Status Code: 301 when I have multiple jobs running.
Any ideas on how to fix this ?

steveloughran commented Jun 6, 2018 •

edited

Loading

@Saurabh7: you need to use per bucket configs there; the newer AWS S3 locations don't work with the classic endpoint which the v4 signing API used to support

spark.hadoop.fs.s3a.asia-bucket.endpoint s3.ap-northeast-2.amazonaws.com

consult the Hadoop AWS doc for the details

suryajayanthi commented Jan 6, 2019

I'm running a s3 service locally using atlassianlabs/local stack. I want to pass the local s3 endpoint to the hadoop context like this:
.set("fs.s3a.endpoint", "http://aws-local:9000")
Is it possible to do this?

Hi David-wb,

You method of accessing s3 work?

Thanks.

rockb1017 commented Jan 28, 2019

I am using Ceph RGW and trying to get data directly from Spark. Looking at DEBUG logs, it uses bucket name as prefix of endpoint url. ex)BUCKETNAME.ceph_host:ceph_port/KEY.
on my RGW, I do not have this set up. So it is not working for me.

wuxg commented Apr 19, 2019

Hi,
Is that a way to set different endpoint for each datanode?
I am trying to set a local radosgw as s3 endpoint for each datanode.

minyk commented May 28, 2019

@rockb1017 try this configuration(https://github.com/apache/hadoop/blob/branch-2.9.2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java#L75):

<property>
  <name>fs.s3a.path.style.access</name>
  <value>true</value>
</property>

anirtek commented Dec 3, 2019

I am using spark structured streaming. So when should I set the values using sc.hadoopConfiguration()? Although I am new to spark, I am not sure if we can set these values after the spark session is created. Can someone help me our here?

ghost commented Jan 10, 2020

Idk how common this use case is, but I have a pipeline that would be greatly simplified if I was able to use multiple S3 endpoints in a single spark submit. That is, I have a minio endpoint along with a standard aws s3 endpoint. Is there a way to configure the session on the fly such that I can use both?

anirtek commented Apr 6, 2020

@Jamesthegiantpeach were you able to resolve your issue?

ysennoun commented Apr 19, 2020

Hello, fantastic work !
It works for me when I use a custom s3 (minio on Kubernetes) without TLS. But when I use my S3 with self-signed certificate TLS, event if I set the following value:

spark.hadoop.fs.s3a.endpoint=https://<my-endpoint-name>:9000

I get this exception:

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on bucket:
com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException:
PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException:
unable to find valid certification path to requested target: 
Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: 
unable to find valid certification path to requested target at org.apache.hadoop.fs.s3a.S3AUtils.translateException

I really don't find in Internet how to set the path to the self-signed certificate. Do you ?

minyk commented Apr 20, 2020 •

edited

Loading

@ysennoun IMO, the JVM(which runs spark) needs a custom ca cert. Googling like this: "custom certs into jvm"

shubhamprasad0 commented Nov 20, 2020

Idk how common this use case is, but I have a pipeline that would be greatly simplified if I was able to use multiple S3 endpoints in a single spark submit. That is, I have a minio endpoint along with a standard aws s3 endpoint. Is there a way to configure the session on the fly such that I can use both?

@Jamesthegiantpeach: I have a similar use case. How did you solve this problem in your case?

amityadav-alphonso commented Feb 1, 2023 •

edited

Loading

I am getting this error when I am reading S3 local instance.

The confs I have set are -

val sparkConf = new SparkConf().setAppName("testing") .set("fs.s3a.endpoint", "http://127.0.0.1:9090") .set("fs.s3a.multipart.size", "104857600) .set("fs.s3a.connection.ssl.enabled", "false") .set("fs.s3a.access.key", "adfadf") .set("fs.s3a.secret.key", "qerqer")

java.lang.IllegalArgumentException
at java.base/java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1293)
at java.base/java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1215)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:280)

BKRami commented May 22, 2023

You have to set the style because it is a different from retrieving from AWS directly
.set("fs.s3a.path.style.access", "true")

Pankajpluang123 commented Jul 31, 2024

spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint","https://.us-east-1.vpce.amazonaws.com")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
spark._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint.region","us-east-1")
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "*****")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

df = spark.read.csv("s3a://*****/dt=2022-05-27/*",header=True,inferSchema=True)
error :Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;
Private bucket

tobilg/custom_s3_endpoint_in_spark.md

Custom S3 endpoints with Spark

Use the hadoop-aws package

SparkContext configuration

S3 url usage

cipri-tom commented Apr 7, 2017

Uh oh!

david-wb commented Oct 18, 2017

Uh oh!

Saurabh7 commented Jan 24, 2018

Uh oh!

steveloughran commented Jun 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suryajayanthi commented Jan 6, 2019

Uh oh!

rockb1017 commented Jan 28, 2019

Uh oh!

wuxg commented Apr 19, 2019

Uh oh!

minyk commented May 28, 2019

Uh oh!

anirtek commented Dec 3, 2019

Uh oh!

ghost commented Jan 10, 2020

Uh oh!

anirtek commented Apr 6, 2020

Uh oh!

ysennoun commented Apr 19, 2020

Uh oh!

minyk commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shubhamprasad0 commented Nov 20, 2020

Uh oh!

amityadav-alphonso commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BKRami commented May 22, 2023

Uh oh!

Pankajpluang123 commented Jul 31, 2024

Uh oh!

Use the `hadoop-aws` package

steveloughran commented Jun 6, 2018 •

edited

Loading

minyk commented Apr 20, 2020 •

edited

Loading

amityadav-alphonso commented Feb 1, 2023 •

edited

Loading