Custom credential provider for AWS EMRFS and Spark applications

Background

Frequently, our EMR applications need to perform cross-account read/write, i.e., the cluster is created under one AWS billing account, but the data lives under another (let's call it "guest account"). Because security concerns, we cannot grant blank S3 access to the guest account. Instead, we should rely on assume-role function of AWS STS to provide ephemeral authentication for read/write transactions. The basic logic for calling STS service is not difficult, but there are some pitfalls when you want to integrate the assume-role authentication with EMRFS.

Custom credential provider

For hadoop/Spark, the authentication process is handled within the file system itself, so the application code can write to a S3 file without worrying about the underlying nitty-gritty details. EMRFS is an implementation of S3 file system, and it provides an extension point so you can plug in your custom credential provider Now, you can enable cross account read/write access from Hadoop/Spark by the following steps:

Create a role in your client account that has read/write permission
Make sure your EMR cluster can assume this role
Write a custom credential provider that assumes the role through STS REST API
Copy credential provider jar to the CLASSPATH of your EMR application (e.g. /usr/shared/aws/emrfs/auxlib)
Update emrfs-site as described here

Limitation

There is a serious limitation to the custom credential provider due to its way of caching the credential provider. Specifically, S3 will try a chain of credential providers (at least it will try custom credential provider and default aws credential provider), and cache the last working credential provider for the following S3 access until the credentials expire. This means, you can not use two different credential providers within the same EMR application. Say, if you want to use one credential for s3://one_bucket/data and the other for s3://another_bucket/.., there simply is no way to do that, because hadoop S3 file system will always use the same credential that succeeded before.

One way to fix this is to allow S3 URI to carry the assume role name, and have the custom credential provider to assume different roles for different URIs. For example, we may have s3://one_bucket/data?use-role=role1 and s3://another_bucket/data?use-role=role2.

This AWS Credential provider will read in the fs.s3a.assumed.role options needed to connect to the Security Token Service Assumed Role API, first authenticating with the full credentials, then assuming the specific role specified. It will then refresh this login at the configured rate of fs.s3a.assumed.role.session.duration

<property> <name>fs.s3a.assumed.role.session.duration</name> <value>30m</value> <description> Duration of assumed roles before a refresh is attempted. Only used if AssumedRoleCredentialProvider is the AWS credential provider. Range: 15m to 1h </description> </property>

smartnose/emr-fs-custom-credential-provider.md

Custom credential provider for AWS EMRFS and Spark applications

Background

Custom credential provider

Limitation

mnoumanshahzad commented May 22, 2021