Skip to content

Instantly share code, notes, and snippets.

@tjunxiang92
Created November 16, 2018 02:15
Show Gist options
  • Save tjunxiang92/d5b30777c5ea58bb554c462b970948c3 to your computer and use it in GitHub Desktop.
Save tjunxiang92/d5b30777c5ea58bb554c462b970948c3 to your computer and use it in GitHub Desktop.
Getting a small sample from a large dataset

Generate your CSV

head -n1000 dataset.csv > small.csv

Generate a subset of train.csv taking random samples from the dataset

tail -n +2 nyc_taxi_train_dataset.csv | gshuf -n 100000 > processed.csv
head -n1 nyc_taxi_train_dataset.csv | cat - processed.csv > temp && mv temp processed.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment