ZaxR’s gists

ZaxR / nearest_neighbors.py

Last active January 17, 2023 10:17

Find nearest neighbors by lat/long using Haversine distance with a BallTree

	"""
	Example:
	# All locations; also locations FROM which we want to find nearest neighbors
	locations = pd.DataFrame({"LOCATION_NAME": ["Chicago, IL", "New York, NY", "San Fransisco, CA"],
	"LATITUDE": [1, 2, 3],
	"LONGITUDE": [1, 2, 3],
	"ID": [1, 2, 3]})
	locations = locations.apply(lambda x: Location(location_name=x['LOCATION_NAME'],
	latitude=x['LATITUDE'],
	longitude=x['LONGITUDE'],

ZaxR / advanced_pipelines.py

Last active November 13, 2020 22:14

Advanced sklearn Pipeline / GridSearchCV Workflow

	import re
	from collections import Counter

	import numpy as np
	import pandas as pd
	from google.cloud import storage
	from nltk.corpus import stopwords
	from nltk.tokenize import word_tokenize
	from nltk.stem.porter import PorterStemmer
	from nltk import pos_tag # do this on token and get part of speech

ZaxR / google_cloud_helpers.py

Created February 13, 2019 04:56

Helpers for Google Cloud

	from typing import Iterable, Optional
	from uuid import uuid4

	import pandas as pd
	from google.cloud import bigquery
	from google.cloud.exceptions import NotFound


	def gcs_file_exists(project_name: str, bucket_name: str, file_name: str) -> bool:
	"""Check to see whether a specific Google Cloud Storage file exists.

ZaxR / query.sql

Last active May 23, 2019 14:34

[Standard] Bigquery regex, adding missing dates, and forward filling

	WITH some_data AS (
	SELECT
	_TDLinx_Store_Code_ as store_code,
	_UPC_ AS upc,
	DATE(PARSE_TIMESTAMP("%m/%d/%Y", (REGEXP_EXTRACT(_Period_Description_Short_, "\\S+\\s*$")))) AS end_date,
	___ AS dollars,
	_Units_ AS units
	FROM
	`project.dataset.table`
	), end_dates AS (

ZaxR / .gitconfig

Created August 19, 2019 19:26

Git Aliases

	...
	[alias]
	gone = "!f() { git branch -D $1 \| git push origin --delete $1; }; f"
	save = commit -am
	wip = commit -am "WIP"
	new = "!f() { git checkout -b $1 develop; }; f"
	slog = log --abbrev-commit --color=auto --decorate=short --pretty=oneline --graph --branches=* --remotes=* --tags=*
	sweep = "!f() { git remote prune origin && git branch --merged \| egrep -v \"(^\\*\|master\|develop)\" \| xargs git branch -d; }; f"
	...

ZaxR / id_cat_num_types.py

Created September 12, 2019 20:25

Identify the categorical and numeric columns in a pd.DataFrame

	"""
	Models often have both numeric and categorical features.
	These types of features have different EDA flows and need separate trasformations (e.g. OHE for categorical features).
	I want to see how much of the pipeline can reliably be automated,
	starting with identifying the columns we want to use in our model and what type they are.
	Pandas offers `df.select_dtypes` which allows you to use include or excludes for types.
	You can then use the following to split numeric and categorical:
	"""

	# splits the df itself; use .categories at the end to just get col names

ZaxR / movie_keyword_parser.py

Created October 17, 2019 21:26

Parser as part of this interview challenge: https://github.com/GuildEducationInc/data-engineer-project

	# Parser as part of this interview challenge: https://github.com/GuildEducationInc/data-engineer-project
	# Dataset: https://www.kaggle.com/rounakbanik/the-movies-dataset
	# Alternative ETL process, including DB schema: https://github.com/guenthermi/the-movie-database-import
	import ast

	import pandas as pd


	keywords_df = pd.read_csv("the-movies-dataset/keywords.csv", nrows=1000)
	keywords_df = keywords_df.rename(columns={"id": "movie_id"})

ZaxR / download_google_images.py

Created October 28, 2019 21:48

Download images from Google Images to Google Drive in Collab Notebook

	!apt install chromium-chromedriver -q
	!pip install google_images_download selenium -q

	import sys

	from fastai.vision import *
	from google.colab import drive, files
	from google_images_download import google_images_download
	from selenium import webdriver # needed for google_images_download

ZaxR / download_google_images.py

Created October 28, 2019 21:48

Download images from Google Images to Google Drive in Collab Notebook

	!apt install chromium-chromedriver -q
	!pip install google_images_download selenium -q

	import sys

	from fastai.vision import *
	from google.colab import drive, files
	from google_images_download import google_images_download
	from selenium import webdriver # needed for google_images_download

ZaxR / ml_model_architecture.py

Last active January 13, 2020 19:49

Attempt at a general ML model architecture

	""" Demo Model class that can be used generally for ML projects.

	Attempts to solve the following problems:
	1. Ensure data preprocessing is consistent between data for model training and prediction
	2. Have a common architecture for any data types / ML model types/libraries (as long as a consistent API is ued)
	3. Allow for easy swapping of preprocessing, modeling, and/or postprocessing
	4. Enforce an input (and potentially output) data schema(s)

	Bonus:
	- Follow's sklearn's estimator/transformer/predictor APIs, allowing use of sklearn Pipelines and GridSearchCV.

Zax ZaxR