Skip to content

Instantly share code, notes, and snippets.

View maneeshdisodia's full-sized avatar

maneesh disodia maneeshdisodia

  • AI Architect @ Altimetrik
  • India
View GitHub Profile
@maneeshdisodia
maneeshdisodia / jupyterhub_aws.md
Created March 11, 2019 11:35 — forked from widdowquinn/jupyterhub_aws.md
Set up JupyterHub on AWS

JupyterHub on AWS

EC2 Setup

  • Log in to AWS
  • Go to a sensible region
  • Start a new instance with Ubuntu Trusty (14.04) - compute-optimised instances have a high vCPU:memory ratio, and the lowest-cost CPU time. c4.2xlarge is a decent choice.
  • Set security group (firewall) to have ports 22, 80, and 443 open (SSH, HTTP, HTTPS)
  • If you want a static IP address (for long-running instances) then select Elastic IP for this VM
  • If you want to use HTTPS, you'll probably need a paid certificate, or to use Amazon's Route 53 to get a non-Amazon domain (to avoid region blocking).
@maneeshdisodia
maneeshdisodia / apply_df_by_multiprocessing.py
Created April 8, 2019 11:02 — forked from yong27/apply_df_by_multiprocessing.py
pandas DataFrame apply multiprocessing
import multiprocessing
import pandas as pd
import numpy as np
def _apply_df(args):
df, func, kwargs = args
return df.apply(func, **kwargs)
def apply_by_multiprocessing(df, func, **kwargs):
workers = kwargs.pop('workers')
@maneeshdisodia
maneeshdisodia / time-files-modified.py
Created September 28, 2021 08:29 — forked from benhosmer/time-files-modified.py
Find the oldest and newest file in a directory and sort them.
#!/usr/bin/env python
import os
path = 'data'
os.chdir(path)
files = sorted(os.listdir(os.getcwd()), key=os.path.getmtime)
oldest = files[0]
newest = files[-1]
@maneeshdisodia
maneeshdisodia / Offline-Dataflow.md
Created June 29, 2022 07:01 — forked from elavenrac/Offline-Dataflow.md
GCP Dataflow processing with no external IPs

GCP Dataflow Pipelines

This gist is a detailed walkthrough on how to deploy python Dataflow pipelines in GCP to run without external IPs. Full code samples are available below.

This walkthrough assumes you have a already authenticated with gcloud login commands and have the appropriate IAM privileges to execute these operations.

Step 1 - Gather application dependencies

Since we are planning to use no external IPs on our dataflow worker nodes, we must package up all our application dependencies for an offline deployment. I highly recommend using a virtual environment as your global dependencies will be much more than your single application will require.

Dump your application dependencies into a single file.

@maneeshdisodia
maneeshdisodia / firestore_to_pandas_dataframe.py
Created November 25, 2022 12:54 — forked from romicofre/firestore_to_pandas_dataframe.py
Load firestore table to pandas dataframe
import pandas as pd
from google.cloud import firestore
db = firestore.Client()
users = list(db.collection(u'users').stream())
users_dict = list(map(lambda x: x.to_dict(), users))
df = pd.DataFrame(users_dict)
@maneeshdisodia
maneeshdisodia / spikes.ipynb
Created November 7, 2023 21:50 — forked from w121211/spikes.ipynb
Identifying Spikes in timeseries data with Pandas
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
= Why JIRA should use Neo4j
== Introduction
There are few developers in the world that have never used an issue tracker. But there are even fewer developers who have ever used an issue tracker which uses a graph database. This is a shame because issue tracking really maps much better onto a graph database, than it does onto a relational database. Proof of that is the https://developer.atlassian.com/download/attachments/4227160/JIRA61_db_schema.pdf?api=v2[JIRA database schema].
Now obviously, the example below does not have all of the features that a tool like JIRA provides. But it is only a proof of concept, you could map every feature of JIRA into a Neo4J database. What I've done below, is take out some of the core functionalities and implement those.
== The data set