maneesh disodia maneeshdisodia

`JupyterHub` on `AWS`

EC2 Setup

Log in to AWS
Go to a sensible region
Start a new instance with Ubuntu Trusty (14.04) - compute-optimised instances have a high vCPU:memory ratio, and the lowest-cost CPU time. c4.2xlarge is a decent choice.
Set security group (firewall) to have ports 22, 80, and 443 open (SSH, HTTP, HTTPS)
If you want a static IP address (for long-running instances) then select Elastic IP for this VM
If you want to use HTTPS, you'll probably need a paid certificate, or to use Amazon's Route 53 to get a non-Amazon domain (to avoid region blocking).

GCP Dataflow Pipelines

This gist is a detailed walkthrough on how to deploy python Dataflow pipelines in GCP to run without external IPs. Full code samples are available below.

This walkthrough assumes you have a already authenticated with gcloud login commands and have the appropriate IAM privileges to execute these operations.

Step 1 - Gather application dependencies

Since we are planning to use no external IPs on our dataflow worker nodes, we must package up all our application dependencies for an offline deployment. I highly recommend using a virtual environment as your global dependencies will be much more than your single application will require.

Dump your application dependencies into a single file.

	def search(text,n):
	'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
	word = r"\W*([\w]+)"
	groups = re.search(r'{}\W{}{}'.format(wordn,'place',word*n), text).groups()
	return groups[:n],groups[n:]


	t = "The world is a small place, we should try to take care of it."
	search(t,3)
	#(('is', 'a', 'small'), ('we', 'should', 'try'))

	def search(text,n):
	'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
	word = r"\W*([\w]+)"
	groups = re.search(r'{}\W{}{}'.format(wordn,'place',word*n), text).groups()
	return groups[:n],groups[n:]


	t = "The world is a small place, we should try to take care of it."
	search(t,3)
	#(('is', 'a', 'small'), ('we', 'should', 'try'))

	#create fake data example taken from stackoverflow
	df_example = pd.DataFrame({'CG':np.random.randint(0, 5, 100), 'Morph':np.random.choice(['S', 'E'], 100), 'R':np.random.rand(100) * -100})

	def my_agg(x):
	x = x.sort_values('R')
	morph = x.head(1)['Morph'].values[0]
	diff = x.iloc[0]['R'] - x.iloc[1]['R']
	diff2 = -2.5np.log10(sum(10(-0.4x['R'])))
	prop = (x['Morph'].iloc[1:] == 'S').mean()
	return pd.Series([morph, diff, diff2, prop], index=['morph', 'diff', 'diff2', 'prop'])

	import multiprocessing
	import pandas as pd
	import numpy as np

	def _apply_df(args):
	df, func, kwargs = args
	return df.apply(func, **kwargs)

	def apply_by_multiprocessing(df, func, **kwargs):
	workers = kwargs.pop('workers')

	#!/usr/bin/env python

	import os

	path = 'data'
	os.chdir(path)
	files = sorted(os.listdir(os.getcwd()), key=os.path.getmtime)

	oldest = files[0]
	newest = files[-1]

	import pandas as pd
	from google.cloud import firestore

	db = firestore.Client()
	users = list(db.collection(u'users').stream())

	users_dict = list(map(lambda x: x.to_dict(), users))
	df = pd.DataFrame(users_dict)

	from pyspark.sql.functions import pandas_udf, PandasUDFType

	df = spark.createDataFrame(
	[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
	("id", "v"))


	def my_function(df, by="id", column="v", value=1.0):
	schema = "{} long, {} double".format(by, column)

maneesh disodia maneeshdisodia

JupyterHub on AWS

EC2 Setup

GCP Dataflow Pipelines

Step 1 - Gather application dependencies

`JupyterHub` on `AWS`