Skip to content

Instantly share code, notes, and snippets.

View richiefrost's full-sized avatar

Richie Frost richiefrost

View GitHub Profile
@richiefrost
richiefrost / Dockerfile
Created July 24, 2019 08:02
Dockerfile for creating a textract Docker image
FROM ubuntu:latest
RUN apt-get update && \
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev python-pip -y && \
pip install textract
@richiefrost
richiefrost / pipe_example.py
Created September 3, 2019 12:51
Using pipe() for Pandas DataFrames
def remove_null_cols(df):
_df = df.copy()
_df = df.dropna(how='all', axis=1)
return _df
def set_category_types(df, columns):
_df = df.copy()
for col in columns:
_df[col] = df[col].astype('category')
return _df
@richiefrost
richiefrost / AzureDatalakeDemo.py
Created September 10, 2019 17:01
Pandas read_csv from Azure Data Lake with interactive login
from azure.datalake.store import core, lib, multithread
import pandas as pd
class ADLSHelper:
def __init__(self, store_name='mystorename'):
"""
When initializing this helper, it will prompt you to do an interactive login to connect to your data lake store.
It uses Azure Active Directory for authentication, and you use the token returned from
your login process to connect to your Azure Data Lake instance.
You can also authenticate with username/password or ServicePrincipal for production.
from time import time
def log_time(func):
"""Logs the time it took for func to execute"""
def wrapper(*args, **kwargs):
start = time()
val = func(*args, **kwargs)
end = time()
duration = end - start
print(f'{func.__name__} took {duration} seconds to run')
@richiefrost
richiefrost / query_builder.py
Last active June 21, 2022 13:13
Simple use of the builder pattern to create a SQL query generator
class QueryBuilder:
def __init__(self):
self.select_value = ''
self.from_table_name = ''
self.where_value = ''
self.groupby_value = ''
def select(self, select_arg):
self.select_value = select_arg
return self
@richiefrost
richiefrost / dependency_injection.py
Last active January 9, 2022 22:06
Simple example of using dependency injection
# Don't do this
def get_data_bad(query_text):
db = SQLDB()
return db.get(query_text)
# What if you need to use a DocDB instance? Or a DynamoDB instance?
# Do this instead
def get_data(db, query_text):
return db.get(query_text)
@richiefrost
richiefrost / rfm_cluster.py
Last active June 7, 2020 07:02
RFM Clustering example
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
df = pd.read_csv('user_history.csv')
# Pandas' cut method groups continuous values into equal-sized bins
df['Frequency'] = pd.cut(df['RequestsPerMonth'], bins=4)
# Since lower recency is better, we need to reverse the order of the bins
@richiefrost
richiefrost / decision_tree.py
Last active June 12, 2020 23:25
Get the most salient attributes in a decision tree
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
# Get the most valuable customers, from step 2
df = pd.read_csv('high_value_customers.csv')
# Churned is our target. Why did they/didn't they churn?
X, y = df.drop('Churned', axis=1), df['Churned']
model = DecisionTreeClassifier()
model.fit(X, y)