maneesh disodia maneeshdisodia

`JupyterHub` on `AWS`

EC2 Setup

Log in to AWS
Go to a sensible region
Start a new instance with Ubuntu Trusty (14.04) - compute-optimised instances have a high vCPU:memory ratio, and the lowest-cost CPU time. c4.2xlarge is a decent choice.
Set security group (firewall) to have ports 22, 80, and 443 open (SSH, HTTP, HTTPS)
If you want a static IP address (for long-running instances) then select Elastic IP for this VM
If you want to use HTTPS, you'll probably need a paid certificate, or to use Amazon's Route 53 to get a non-Amazon domain (to avoid region blocking).

GCP Dataflow Pipelines

This gist is a detailed walkthrough on how to deploy python Dataflow pipelines in GCP to run without external IPs. Full code samples are available below.

This walkthrough assumes you have a already authenticated with gcloud login commands and have the appropriate IAM privileges to execute these operations.

Step 1 - Gather application dependencies

Since we are planning to use no external IPs on our dataflow worker nodes, we must package up all our application dependencies for an offline deployment. I highly recommend using a virtual environment as your global dependencies will be much more than your single application will require.

Dump your application dependencies into a single file.

	import multiprocessing
	import pandas as pd
	import numpy as np

	def _apply_df(args):
	df, func, kwargs = args
	return df.apply(func, **kwargs)

	def apply_by_multiprocessing(df, func, **kwargs):
	workers = kwargs.pop('workers')

	#!/usr/bin/env python

	import os

	path = 'data'
	os.chdir(path)
	files = sorted(os.listdir(os.getcwd()), key=os.path.getmtime)

	oldest = files[0]
	newest = files[-1]

	import pandas as pd
	from google.cloud import firestore

	db = firestore.Client()
	users = list(db.collection(u'users').stream())

	users_dict = list(map(lambda x: x.to_dict(), users))
	df = pd.DataFrame(users_dict)

	= Why JIRA should use Neo4j

	== Introduction

	There are few developers in the world that have never used an issue tracker. But there are even fewer developers who have ever used an issue tracker which uses a graph database. This is a shame because issue tracking really maps much better onto a graph database, than it does onto a relational database. Proof of that is the https://developer.atlassian.com/download/attachments/4227160/JIRA61_db_schema.pdf?api=v2[JIRA database schema].

	Now obviously, the example below does not have all of the features that a tool like JIRA provides. But it is only a proof of concept, you could map every feature of JIRA into a Neo4J database. What I've done below, is take out some of the core functionalities and implement those.

	== The data set

maneesh disodia maneeshdisodia

JupyterHub on AWS

EC2 Setup

GCP Dataflow Pipelines

Step 1 - Gather application dependencies

`JupyterHub` on `AWS`