Last active
July 10, 2019 13:44
-
-
Save rohitdholakia/91f621fafb1451babbb0 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Amazon Web Services" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The story goes that Amazon used to see a 400% surge in traffic during the Christmas shopping period. To address this sudden spike in traffic, they provisioned additional data centres for necessary compute power! However, the other 11 months of the year, this infrastructure lay idle. Internally, a lot of other teams started to find ways to use this to run their own compute jobs. \n", | |
"\n", | |
"Later, it was suggested that people outside the company be allowed to use this and thus, Amazon Web Services [AWS](http://aws.amazon.com) was born. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Today" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Today, AWS is a large suite of services. You can safely say that a large company can run it's whole infrastructure on AWS with no issues. \n", | |
"\n", | |
"* Want a small cluster to run a hadoop job and kill it? That's easy, there's [Elastic MapReduce](http://aws.amazon.com/emr).\n", | |
"* But, it's too time consuming to copy data to HDFS from local network? But, why are you doing that! Use [S3](http://aws.amazon.com/s3).\n", | |
"* Quick is too quick for ya'? Want something running for months and stable. That's fine, use [EC2](http://aws.amazon.com/ec2)\n", | |
"* Want private machines for your databases and host a static website off of it? Use a [Virtual Private Cloud](http://aws.amazon.com/vpc) \n", | |
"* Struggling with DNS problems? Use [Route53](http://aws.amazon.com/route53). \n", | |
"* Too many services and not sure how to template-ify it? Use CloudFormation! \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"#Elastic MapReduce" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A hadoop cluster can run MR jobs (now, many more kind of jobs) on a cluster of machines a.k.a nodes. These nodes can be of two types: namenode (metadata server) and compute nodes (data nodes). One of the key features of hadoop was that the compute was taken to the data and not vice-versa. The amount of data movement is minimized. To set this up, you just have to do the following: \n", | |
"\n", | |
"* Create a .ssh pub/pvt key pair and share it across 100s of nodes. \n", | |
"* Create DNS and reverse DNS lookups and make sure they work. \n", | |
"* Set up hadoop dependencies on each of the machines and build all the JAR files. \n", | |
"* Just hadoop isn't enough and you should also set up Hive, Spark, Pig, and other services. \n", | |
"* Running so many services means that you need something to take care of logging, so use another Apache service. \n", | |
"* Have enough patience to make respected late Sir Nelson Mandela feel bad about the lack of his. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If it wasn't clear already, setting up a hadoop cluster from scratch isn't a typical lunch hour activity. It needs a thorough knowledge of linux networking, coupled with an idea of the dependencies of Hadoop and what you exactly need. To solve this problem, EMR abstracts away a lot of the complexity and lets you focus on the problem at hand. Consider the following problems: " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"* I have some data on AWS S3. For e.g, I have all of the books of Shakespeare. I want to find out the word count of each unique word in the book. How do I do it? \n", | |
"\n", | |
"* I have Amazon reviews of all products on S3. I want to copy them to HDFS and run Hive jobs on it and copy back the output of those jobs to S3. How do I do this? \n", | |
"\n", | |
"* I just want to geek out on Spark. Just tell me what to do. And make sure one script does everything! " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Enter Boto\n", | |
"\n", | |
"Boto is a python library that helps you run several AWS services from the comforts of your python script. \n", | |
"\n", | |
"* Want to create buckets on S3? Done. \n", | |
"\n", | |
"* Want to start an EMR cluster with a jobflow? Done. \n", | |
"\n", | |
"* Create an IAM role? Done. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import boto" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##S3\n", | |
"\n", | |
"Amazon Simple Storage Service (S3) was a break through of sorts. When it came out, you were charged a flat rate for being able to use 100G of storage, for e.g. It didn't matter if you ended up using only 20G of those. And, if there's a sudden surge and your business needs 200G urgently, God only can help you. \n", | |
"\n", | |
"S3 got rid of all that. Amazon had a simple offer, \"Pay us for what you need. And hey!, if you need more, just put the data in and we will scale it for you!\" Woah! \n", | |
"\n", | |
"It's like infinite storage! Must be expensive? Nope. It's cheap. Like, commodity storage cheap. So, how does S3 work? Let's find out using boto. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Credentials\n", | |
"\n", | |
"AWS needs to know who you are. This is enabled by an AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. This can be generated by you in your account or if you aren't root user, you could ask any of the root users to generate a key pair for you. **Keep this safe!** \n", | |
"\n", | |
"It's super annoying to generate this all the time and this can be only saved one. So, keep it safe. Except on github. Yeah, please don't put it in a public github repo or like, you know, tweet it. That's a no-no. \n", | |
"\n", | |
"To store your credentials for boto, create a ~/.boto file and put your credentials like this \n", | |
"\n", | |
" [Credentials]\n", | |
" aws_access_key_id = access_key_id_here \n", | |
" aws_secret_access_key = secret_access_key_here" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Have you done the above? If yes, great. If not, what you waiting for?! Let's do it so we can move ahead in the tutorial." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Back to S3" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from boto.s3.connection import S3Connection" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"conn = S3Connection()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[<Bucket: aws-logs-756672503175-us-east-1>,\n", | |
" <Bucket: aws-logs-756672503175-us-west-2>,\n", | |
" <Bucket: cross-region-transfer2>,\n", | |
" <Bucket: datateched>,\n", | |
" <Bucket: obiee-test-data>,\n", | |
" <Bucket: simba-bamboo-data-lake>,\n", | |
" <Bucket: simba-drivers>,\n", | |
" <Bucket: simba-dynamo-bucket>,\n", | |
" <Bucket: simba-dynamo-eastvirginia>,\n", | |
" <Bucket: simba-dynamo-us-west-2>,\n", | |
" <Bucket: simba-hivejdbc-ami31-log>,\n", | |
" <Bucket: simba-hivejdbc-ami32-log>,\n", | |
" <Bucket: simba-hivejdbc-release-ami31-log>,\n", | |
" <Bucket: simba-hivejdbc-release-ami32-log>,\n", | |
" <Bucket: simba-impalajdbc-release-ami32-log>,\n", | |
" <Bucket: simba-perftest>,\n", | |
" <Bucket: simba-private-bucket>,\n", | |
" <Bucket: simba-shared-tpch>,\n", | |
" <Bucket: simba-tcpdump-files>,\n", | |
" <Bucket: simba-tpch>,\n", | |
" <Bucket: simbaawsbillingreport>,\n", | |
" <Bucket: testingmove>]" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"conn.get_all_buckets()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Bucket: simba-demo-emr-bucket>" | |
] | |
}, | |
"execution_count": 27, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"conn.create_bucket('simba-demo-emr-bucket')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"bucket = conn.get_bucket('simba-demo-emr-bucket')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"databucket = conn.get_bucket('datateched')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": false | |
}, | |
"source": [ | |
"## An EMR setup example" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"Let's create a small cluster and run a wordcount job on some data we already have on S3. \n", | |
"\n", | |
"The things we need are: \n", | |
"\n", | |
"* Some data on S3 in a bucket. \n", | |
"* A \"step\" that describes which dataset is your input, where does your output go, and where is the script you want to run as part of your job. \n", | |
"* Create a cluster that describes which roles will run the job, when, how many instances, and so on. Let's find this out. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"#lets take care of the imports first\n", | |
"from boto.emr.connection import EmrConnection\n", | |
"from boto.emr.step import StreamingStep\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"conn = boto.emr.connect_to_region('us-west-2')\n", | |
"\n", | |
"step = StreamingStep(name = 'My WordCount example', \n", | |
" mapper = 's3n://elasticmapreduce/samples/wordcount/wordSplitter.py',\n", | |
" reducer = 'aggregate', \n", | |
" input = 's3n://elasticmapreduce/samples/wordcount/input',\n", | |
" output = 's3n://datateched/wordcount_output_2')\n", | |
"\n", | |
"\n", | |
"\n", | |
"jobid = conn.run_jobflow(name = 'wordcountwhat?',\n", | |
" master_instance_type = 'm1.small', slave_instance_type = 'm1.small',\n", | |
" num_instances = 4,\n", | |
" job_flow_role = 'myinstanceprofile', \n", | |
" service_role = 'EMR_DefaultRole',\n", | |
" steps = [step], log_uri = \"s3n://datateched\", enable_debugging = True, keep_alive = True,\n", | |
" ami_version = 'latest',\n", | |
" )" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# What just happened? " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's take a step back and discuss some of the things we did. \n", | |
"\n", | |
"###Step. What's that, yo?\n", | |
"\n", | |
"A step is the base unit of measurement in an EMR setup. A step would be a MR job, a Spark job. Something that has an input/output, a mapper/reducer setup. Note that in case of classical MR, you moved your mapper and reducer to the data, hence the extra mention. What about Spark? In case of Spark, we could write a spark application in Python, put the script on S3, and then run it using the streamingStep. To tell EMR that this is a Spark job, there is a \"Type\" variable. \n", | |
"\n", | |
"Moreover, we can simplify this by writing .json files for each of the jobs we run. StreamingStep internally is thought of as JSON key-value pairs, which makes our life nicer (json, FTW!)\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##What about that roles business?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Good question! So, Identity and Access Management, IAM in short, is important for managing AWS access. Any organization with more than 3 people i.e all oganizations need a framework to control access to resources. That's where IAM enters the picture. It lets you do several things. For e.g \n", | |
"\n", | |
"* Create a policy for access to resources on AWS. \n", | |
" e.g {\"Resource\": \"S3\", \"Actions\": \"*\"} \n", | |
" \n", | |
" And now, this policy could be attached to an IAM user and he/she will be able to access and use S3 for read/write/delete. \n", | |
" \n", | |
"**Q: I only want to give read access though!**\n", | |
"\n", | |
"Ans: Valid concern! That can be addressed by changing the Actions key. It can be fine controlled to only give specific access options. \n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment