Generate your CSV
head -n1000 dataset.csv > small.csv
Generate a subset of train.csv taking random samples from the dataset
tail -n +2 nyc_taxi_train_dataset.csv | gshuf -n 100000 > processed.csv
head -n1 nyc_taxi_train_dataset.csv | cat - processed.csv > temp && mv temp processed.csv