Wei (Wayne) Liu smartnose

Custom credential provider for AWS EMRFS and Spark applications

Background

Frequently, our EMR applications need to perform cross-account read/write, i.e., the cluster is created under one AWS billing account, but the data lives under another (let's call it "guest account"). Because security concerns, we cannot grant blank S3 access to the guest account. Instead, we should rely on assume-role function of AWS STS to provide ephemeral authentication for read/write transactions. The basic logic for calling STS service is not difficult, but there are some pitfalls when you want to integrate the assume-role authentication with EMRFS.

Custom credential provider

For hadoop/Spark, the authentication process is handled within the file system itself, so the application code can write to a S3 file without worrying about the underlying nitty-gritty details. EMRFS is an implementation o

Hadoop File Systems Through Code

Here, I'm trying to explain how various file systems (hdfs, s3, emrfs) interacts with hadoop. Understanding this would help address some of the tricky problems arise during development process, e.g. authentication & performance issues. Hadoop file systems nowadays support a variety of applications. Specifically, I'll focus on EMRFS and Spark.

Given a URI (e.g. s3://mybucket/objectname). Spark interacts with hadoop file system API through DataSource.write function.

       val caseInsensitiveOptions = new CaseInsensitiveMap(options)

Spark internals through code

Nothing gives you more detail about spark internals than actually reading it source code. In addition, you get to learn many design techniques and improve your scala coding skills. These are the random notes I make while reading the spark code. The best way to comprehend the notes is to load spark code into an IDE, e.g. IntelliJ, and navigate the code on the side.

Genesis - creation of a spark cluster

The scripts for creating a spark cluster are: start-master.sh and start-slave.sh. Read them carefully, and you can see that both scripts are very similar except the values for $CLASS variable. For start-master.sh, the value is CLASS="org.apache.spark.deploy.master.Master", while the value for start-slave.sh is shown below with more context.

# NOTE: This exact class name is matched downstream by SparkSubmit.

	# Vespa seems like 18 people's effort at the core
	How do I know it? Every folder in vespa source has a OWNERS file containing aliases of people who probably owns the code or responsible for the build breaks.
	Read & dedupe this list returns 18 logins as of April. 2nd, 2018
	bjorncs
	gjoranv
	arnej27959
	havardpe
	yngve
	aressem
	bratseth