Skip to content

Instantly share code, notes, and snippets.

View ZaxR's full-sized avatar

Zax ZaxR

View GitHub Profile
@ZaxR
ZaxR / nearest_neighbors.py
Last active January 17, 2023 10:17
Find nearest neighbors by lat/long using Haversine distance with a BallTree
"""
Example:
# All locations; also locations FROM which we want to find nearest neighbors
locations = pd.DataFrame({"LOCATION_NAME": ["Chicago, IL", "New York, NY", "San Fransisco, CA"],
"LATITUDE": [1, 2, 3],
"LONGITUDE": [1, 2, 3],
"ID": [1, 2, 3]})
locations = locations.apply(lambda x: Location(location_name=x['LOCATION_NAME'],
latitude=x['LATITUDE'],
longitude=x['LONGITUDE'],
@ZaxR
ZaxR / advanced_pipelines.py
Last active November 13, 2020 22:14
Advanced sklearn Pipeline / GridSearchCV Workflow
import re
from collections import Counter
import numpy as np
import pandas as pd
from google.cloud import storage
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk import pos_tag # do this on token and get part of speech
@ZaxR
ZaxR / google_cloud_helpers.py
Created February 13, 2019 04:56
Helpers for Google Cloud
from typing import Iterable, Optional
from uuid import uuid4
import pandas as pd
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
def gcs_file_exists(project_name: str, bucket_name: str, file_name: str) -> bool:
"""Check to see whether a specific Google Cloud Storage file exists.
@ZaxR
ZaxR / query.sql
Last active May 23, 2019 14:34
[Standard] Bigquery regex, adding missing dates, and forward filling
WITH some_data AS (
  SELECT
    _TDLinx_Store_Code_ as store_code,
    _UPC_ AS upc,
DATE(PARSE_TIMESTAMP("%m/%d/%Y", (REGEXP_EXTRACT(_Period_Description_Short_, "\\S+\\s*$")))) AS end_date,
    ___ AS dollars,
    _Units_ AS units
  FROM
    `project.dataset.table`
), end_dates AS (
@ZaxR
ZaxR / .gitconfig
Created August 19, 2019 19:26
Git Aliases
...
[alias]
gone = "!f() { git branch -D $1 | git push origin --delete $1; }; f"
save = commit -am
wip = commit -am "WIP"
new = "!f() { git checkout -b $1 develop; }; f"
slog = log --abbrev-commit --color=auto --decorate=short --pretty=oneline --graph --branches=* --remotes=* --tags=*
sweep = "!f() { git remote prune origin && git branch --merged | egrep -v \"(^\\*|master|develop)\" | xargs git branch -d; }; f"
...
@ZaxR
ZaxR / id_cat_num_types.py
Created September 12, 2019 20:25
Identify the categorical and numeric columns in a pd.DataFrame
"""
Models often have both numeric and categorical features.
These types of features have different EDA flows and need separate trasformations (e.g. OHE for categorical features).
I want to see how much of the pipeline can reliably be automated,
starting with identifying the columns we want to use in our model and what type they are.
Pandas offers `df.select_dtypes` which allows you to use include or excludes for types.
You can then use the following to split numeric and categorical:
"""
# splits the df itself; use .categories at the end to just get col names
# Parser as part of this interview challenge: https://github.com/GuildEducationInc/data-engineer-project
# Dataset: https://www.kaggle.com/rounakbanik/the-movies-dataset
# Alternative ETL process, including DB schema: https://github.com/guenthermi/the-movie-database-import
import ast
import pandas as pd
keywords_df = pd.read_csv("the-movies-dataset/keywords.csv", nrows=1000)
keywords_df = keywords_df.rename(columns={"id": "movie_id"})
@ZaxR
ZaxR / download_google_images.py
Created October 28, 2019 21:48
Download images from Google Images to Google Drive in Collab Notebook
!apt install chromium-chromedriver -q
!pip install google_images_download selenium -q
import sys
from fastai.vision import *
from google.colab import drive, files
from google_images_download import google_images_download
from selenium import webdriver # needed for google_images_download
@ZaxR
ZaxR / download_google_images.py
Created October 28, 2019 21:48
Download images from Google Images to Google Drive in Collab Notebook
!apt install chromium-chromedriver -q
!pip install google_images_download selenium -q
import sys
from fastai.vision import *
from google.colab import drive, files
from google_images_download import google_images_download
from selenium import webdriver # needed for google_images_download
@ZaxR
ZaxR / ml_model_architecture.py
Last active January 13, 2020 19:49
Attempt at a general ML model architecture
""" Demo Model class that can be used generally for ML projects.
Attempts to solve the following problems:
1. Ensure data preprocessing is consistent between data for model training and prediction
2. Have a common architecture for any data types / ML model types/libraries (as long as a consistent API is ued)
3. Allow for easy swapping of preprocessing, modeling, and/or postprocessing
4. Enforce an input (and potentially output) data schema(s)
Bonus:
- Follow's sklearn's estimator/transformer/predictor APIs, allowing use of sklearn Pipelines and GridSearchCV.