Frequently, our EMR applications need to perform cross-account read/write, i.e., the cluster is created under one AWS billing account, but the data lives under another (let's call it "guest account"). Because security concerns, we cannot grant blank S3 access to the guest account. Instead, we should rely on assume-role function of AWS STS to provide ephemeral authentication for read/write transactions. The basic logic for calling STS service is not difficult, but there are some pitfalls when you want to integrate the assume-role authentication with EMRFS.
For hadoop/Spark, the authentication process is handled within the file system itself, so the application code can write to a S3 file without worrying about the underlying nitty-gritty details. EMRFS is an implementation of S3 file system, and it provides an extension point so you can plug in your custom credential provider Now, you can enable cross account read/write access from Hadoop/Spark by the following steps:
- Create a role in your client account that has read/write permission
- Make sure your EMR cluster can assume this role
- Write a custom credential provider that assumes the role through STS REST API
- Copy credential provider jar to the CLASSPATH of your EMR application (e.g. /usr/shared/aws/emrfs/auxlib)
- Update emrfs-site as described here
There is a serious limitation to the custom credential provider due to its way of caching the credential provider. Specifically, S3 will try a chain of credential providers (at least it will try custom credential provider and default aws credential provider), and cache the last working credential provider for the following S3 access until the credentials expire. This means, you can not use two different credential providers within the same EMR application. Say, if you want to use one credential for s3://one_bucket/data
and the other for s3://another_bucket/..
, there simply is no way to do that, because hadoop S3 file system will always use the same credential that succeeded before.
One way to fix this is to allow S3 URI to carry the assume role name, and have the custom credential provider to assume different roles for different URIs. For example, we may have s3://one_bucket/data?use-role=role1
and s3://another_bucket/data?use-role=role2
.
Thank you for sharing these insights...!
I see that the open sourced zillow/aws-custom-credentials-provider was developed by you.
I am curious about some details, as I am not able to find the answers myself.
In the credentials-provider, you implemented the refresh method.
How does this refresh method gets invoked by the underlying Hadoop services?
Is it safe to assume that having an implementation for the refresh mechanism is sufficient and the Hadoop service will invoke this method?
I found the following explanation with the Hadoop 3's AssumedRoleCredentialsProvider:
and the description in the configuration for timeout duration is:
I tried to scan through the Hadoop code base to figure out how this refresh is possible, but so far, fail to pinpoint the relevant code block.