Vikas Chitturi - Open Source Contributor absognety

Standalone Spark 2.0.0 with s3

###Tested with:

Spark 2.0.0 pre-built for Hadoop 2.7
Mac OS X 10.11
Python 3.5.2

Goal

Use s3 within pyspark with minimal hassle.

System Design Cheatsheet

Picking the right architecture = Picking the right battles + Managing trade-offs

Basic Steps

Clarify and agree on the scope of the system

User cases (description of sequences of events that, taken together, lead to a system doing something useful)
- Who is going to use it?
- How are they going to use it?

People

`:bowtie:`	😄 `:smile:`	😆 `:laughing:`
😊 `:blush:`	😃 `:smiley:`	☺️ `:relaxed:`
😏 `:smirk:`	😍 `:heart_eyes:`	😘 `:kissing_heart:`
😚 `:kissing_closed_eyes:`	😳 `:flushed:`	😌 `:relieved:`
😆 `:satisfied:`	😁 `:grin:`	😉 `:wink:`
😜 `:stuck_out_tongue_winking_eye:`	😝 `:stuck_out_tongue_closed_eyes:`	😀 `:grinning:`
😗 `:kissing:`	😙 `:kissing_smiling_eyes:`	😛 `:stuck_out_tongue:`

Setup multiple git identities & git user informations

/!\ Be very carrefull in your setup : any misconfiguration make all the git config to fail silently ! Go trought this guide step by step and it should be fine 😉

Setup multiple git ssh identities for git

Generate your SSH keys as per your git provider documentation.
Add each public SSH keys to your git providers acounts.
In your ~/.ssh/config, set each ssh key for each repository as in this exemple:

	# Install R + RStudio on Ubuntu 14.04

	sudo apt-key adv –keyserver keyserver.ubuntu.com –recv-keys E084DAB9

	# Ubuntu 12.04: precise
	# Ubuntu 14.04: trusty
	# Ubuntu 16.04: xenial
	# Basic format of next line deb https://<my.favorite.cran.mirror>/bin/linux/ubuntu <enter your ubuntu version>/
	sudo add-apt-repository 'deb https://ftp.ussg.iu.edu/CRAN/bin/linux/ubuntu trusty/'
	sudo apt-get update

	1)What is Difference between Secondary namenode, Checkpoint namenode & backupnod Secondary Namenode, a poorly named component of hadoop.

	(2)What are the Side Data Distribution Techniques.

	(3)What is shuffleing in mapreduce?

	(4)What is partitioning?

	(5)Can we change the file cached by Distributed Cache

	// Find the minimum path sum (from root to leaf)
	public static int minPathSum(TreeNode root) {
	if(root == null) return 0;
	int sum = root.val;

	int leftSum = minPathSum(root.left);
	int rightSum = minPathSum(root.right);

	if(leftSum < rightSum){
	sum += leftSum;

	# MWS API docs at http://docs.developer.amazonservices.com/en_US/orders-2013-09-01/Orders_Datatypes.html#Order
	# MWS Scratchpad at https://mws.amazonservices.com/scratchpad/index.html
	# Boto docs at http://docs.pythonboto.org/en/latest/ref/mws.html?#module-boto.mws

	from boto.mws.connection import MWSConnection

	...

	# Provide your credentials.
	conn = MWSConnection(

	import multiprocessing #:)

	def do_this(number):
	print number
	return number*2

	# Create a list to iterate over.
	# (Note: Multiprocessing only accepts one item at a time)
	some_list = range(0,10)

	import copy

	# write to a path using the Hudi format
	def hudi_write(df, schema, table, path, mode, hudi_options):
	hudi_options = {
	"hoodie.datasource.write.recordkey.field": "recordkey",
	"hoodie.datasource.write.precombine.field": "precombine_field",
	"hoodie.datasource.write.partitionpath.field": "partitionpath_field",
	"hoodie.datasource.write.operation": "write_operaion",
	"hoodie.datasource.write.table.type": "table_type",