This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Example: | |
# All locations; also locations FROM which we want to find nearest neighbors | |
locations = pd.DataFrame({"LOCATION_NAME": ["Chicago, IL", "New York, NY", "San Fransisco, CA"], | |
"LATITUDE": [1, 2, 3], | |
"LONGITUDE": [1, 2, 3], | |
"ID": [1, 2, 3]}) | |
locations = locations.apply(lambda x: Location(location_name=x['LOCATION_NAME'], | |
latitude=x['LATITUDE'], | |
longitude=x['LONGITUDE'], |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
from collections import Counter | |
import numpy as np | |
import pandas as pd | |
from google.cloud import storage | |
from nltk.corpus import stopwords | |
from nltk.tokenize import word_tokenize | |
from nltk.stem.porter import PorterStemmer | |
from nltk import pos_tag # do this on token and get part of speech |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from typing import Iterable, Optional | |
from uuid import uuid4 | |
import pandas as pd | |
from google.cloud import bigquery | |
from google.cloud.exceptions import NotFound | |
def gcs_file_exists(project_name: str, bucket_name: str, file_name: str) -> bool: | |
"""Check to see whether a specific Google Cloud Storage file exists. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WITH some_data AS ( | |
SELECT | |
_TDLinx_Store_Code_ as store_code, | |
_UPC_ AS upc, | |
DATE(PARSE_TIMESTAMP("%m/%d/%Y", (REGEXP_EXTRACT(_Period_Description_Short_, "\\S+\\s*$")))) AS end_date, | |
___ AS dollars, | |
_Units_ AS units | |
FROM | |
`project.dataset.table` | |
), end_dates AS ( |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
... | |
[alias] | |
gone = "!f() { git branch -D $1 | git push origin --delete $1; }; f" | |
save = commit -am | |
wip = commit -am "WIP" | |
new = "!f() { git checkout -b $1 develop; }; f" | |
slog = log --abbrev-commit --color=auto --decorate=short --pretty=oneline --graph --branches=* --remotes=* --tags=* | |
sweep = "!f() { git remote prune origin && git branch --merged | egrep -v \"(^\\*|master|develop)\" | xargs git branch -d; }; f" | |
... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Models often have both numeric and categorical features. | |
These types of features have different EDA flows and need separate trasformations (e.g. OHE for categorical features). | |
I want to see how much of the pipeline can reliably be automated, | |
starting with identifying the columns we want to use in our model and what type they are. | |
Pandas offers `df.select_dtypes` which allows you to use include or excludes for types. | |
You can then use the following to split numeric and categorical: | |
""" | |
# splits the df itself; use .categories at the end to just get col names |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Parser as part of this interview challenge: https://github.com/GuildEducationInc/data-engineer-project | |
# Dataset: https://www.kaggle.com/rounakbanik/the-movies-dataset | |
# Alternative ETL process, including DB schema: https://github.com/guenthermi/the-movie-database-import | |
import ast | |
import pandas as pd | |
keywords_df = pd.read_csv("the-movies-dataset/keywords.csv", nrows=1000) | |
keywords_df = keywords_df.rename(columns={"id": "movie_id"}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
!apt install chromium-chromedriver -q | |
!pip install google_images_download selenium -q | |
import sys | |
from fastai.vision import * | |
from google.colab import drive, files | |
from google_images_download import google_images_download | |
from selenium import webdriver # needed for google_images_download |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
!apt install chromium-chromedriver -q | |
!pip install google_images_download selenium -q | |
import sys | |
from fastai.vision import * | |
from google.colab import drive, files | |
from google_images_download import google_images_download | |
from selenium import webdriver # needed for google_images_download |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" Demo Model class that can be used generally for ML projects. | |
Attempts to solve the following problems: | |
1. Ensure data preprocessing is consistent between data for model training and prediction | |
2. Have a common architecture for any data types / ML model types/libraries (as long as a consistent API is ued) | |
3. Allow for easy swapping of preprocessing, modeling, and/or postprocessing | |
4. Enforce an input (and potentially output) data schema(s) | |
Bonus: | |
- Follow's sklearn's estimator/transformer/predictor APIs, allowing use of sklearn Pipelines and GridSearchCV. |
OlderNewer