Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
@sshleifer
sshleifer / pandorable_notes.md
Last active May 14, 2016 22:49
Notes on http://tomaugspurger.github.io/ Modern Pandas blogposts

Will immediately Incorporate

  • df.assign(lambda x: x.px * 2) # x is the DataFrame magically this will save us mucho code
  • df.loc[df.index.get_level_values(1) == 'donger'] can be df.loc[pd.IndexSlice[:,'donger'],]
  • ser.sort_values(ascending=False).head() can be ser.nlargest(5). nsmallest also exists.
  • df.add_suffix is built into pandas
  • df.dropna(thresh=4) If at least thresh items are missing, the row is dropped.

Could be useful

  • pd.TimeGrouper('H')
@sshleifer
sshleifer / zimmerman_chap2.md
Last active July 4, 2016 01:11
Notes on Chapter 2 of Tom Zimmerman's Dissertation

[Paper] (https://dash.harvard.edu/bitstream/handle/1/17467320/ZIMMERMANN-DISSERTATION-2015.pdf?sequence=1])

Intro: Econom(etr)ics vs. ML

  • Economics focused on empirical relationships between features and outcomes, ML focused on predicting outcomes.
  • Beta vs. yhat. cv.coeffs vs cv.metrics.fscore
  • TZ: Can test relationship by seeing if inclusion of variable in big model improves predictions, thereby avoiding omitted control issues.
  • requires ML approach (feature engineering) on investor behavior datasets!
  • implementation details and robustness checks more valuable than actual results on disposition effect.
@sshleifer
sshleifer / apps.md
Last active September 1, 2023 15:12
My Favorite apps and workflow stuff (for mac/iOS/python)
@sshleifer
sshleifer / kernel_trick.md
Last active October 19, 2016 19:19
Attempt at explaining the kernel trick in preparation for 6.867 Midterm

Problem: Transforming X into φ(X) space can be expensive, and it is usually used as an intermediate result inside of a dot product like <φ(x[i]), φ(x[j])>.

Trick to save computation time: Conditional on having a φ where we know how to compute <φ(x[i]), φ(x[j])> through a shortcut, we can use the shortcut instead of explicitly calling φ and storing the long intermediate result. The savings stem from (a) saving calls to φ, and (b) making the dot product operate on shorter vectors.

Example

φ(x) = (x[1]**2, sqrt(2)*x[1]* x[2], x[2]**2)

&lt;φ(x),φ(z)&gt; = sum((x[1]**2)(z[1]**2), 2x[1]x[2]z[1]z[2], (x[2]**2)(z[2]**2))
@sshleifer
sshleifer / imagerive.md
Created June 7, 2018 17:16
Imagerive Notes

WHERE IS THE DATA? SSH into {FIXME} while connected to ImageRive VPN (must be from windows machine) All data is is /merantix_core/data/hospitals/imagerive/export Anonymized reports in reports anonymized_dicoms/ export/cases_new.json export/patients_new.json

Normal Windows VPN connection.

@sshleifer
sshleifer / generate_boxes_from_masks.py
Created May 1, 2019 17:21
Script for going from mask to bboxes (bbox branch)
import numpy as np
import pandas as pd
import pickle as pkl
import nrrd
import glob
import os
import sys
def find_bounding_box(mask, point, label):
visited = set()
import SimpleITK as sitk
import numpy as np
mask_file = '/data/ct-cspine/test_set_w_masks_2019_05_01/cspine_fx_seg/Cspine_fx_seg/5616571.nrrd'
array_file = '/data/ct-cspine/processed-studies/data_20180524_161757/anonymized_data/images/test/5616571.npy'
def projectImage(reference, moving, interpolate = 'linear'):
# projects moving image onto reference image space
# use  interpolate = 'NN' for segmentation masks
resample = sitk.ResampleImageFilter()
resample.SetReferenceImage(reference)
"""Modified from https://github.com/gan3sh500/mixmatch-pytorch/blob/master/layer.py
Implementation of """
def mixmatch(X_labeled, y, X_unlabeled, model, augment_fn, T=0.5, K=2, alpha=0.75):
"""Generate labeled and unlabeled batches for mixmatch. Helpers are below. Use in dataloader."""
xb = augment_fn(X_labeled)
n_labeled = len(xb)
ub = [augment_fn(X_unlabeled) for _ in range(K)] # unlabeled
qb = sharpen(sum(map(model, ub)) / K, T)
@sshleifer
sshleifer / hardness_grid.py
Created May 29, 2019 15:02
ideal grid/api for hardness sampling
pg1 = update_batch_size(ParameterGrid({
'lr': [1e-4, 1e-3, 3e-3, 1e-2, .05, 1e-1],
'label_smoothing': [True, False],
'size': [128],
'bs': [256],
'hardness_percentile': [.75, .5, .25, .1] # top 50%, top25%
}))
@sshleifer
sshleifer / gcp_setup_help.sh
Last active March 17, 2020 19:37
GCP Setup Instructions
#!/usr/bin/env bash
#Make an instance here
# https://console.cloud.google.com/marketplace/details/click-to-deploy-images/deeplearning?_ga=2.50258406.1502354465.1584473811-759161763.1583556304
# dont enable jupyterlab
# Note that if you work at curai, this is moved to https://github.com/curai/experiments/blob/master/shleifer/gcp_setup.md
# Follow these instructions until the start of the "First-time setup script" section
# https://github.com/cs231n/gcloud/