Skip to content

Instantly share code, notes, and snippets.

@fearofcode
Last active December 23, 2017 21:20
Show Gist options
  • Save fearofcode/5f07ac8fba86c33767ea17ff05e54f7a to your computer and use it in GitHub Desktop.
Save fearofcode/5f07ac8fba86c33767ea17ff05e54f7a to your computer and use it in GitHub Desktop.
postgres loading example
example output from running scripts in https://github.com/fearofcode/nyc-taxi-data follows.
base script, 2tb HDD, vanilla postgres settings fresh from ubuntu install. 4 CPUs, 8 GB VM RAM. ubuntu 16.04, postgres 10.
sedding a 14 million line CSV and then COPYing it in:
Thu Dec 21 23:30:18 PST 2017: beginning load for data/yellow_tripdata_2009-01.csv
COPY 14092413
Thu Dec 21 23:34:57 PST 2017: finished raw load for data/yellow_tripdata_2009-01.csv
time: 4m39s
effective copy rate: 50510 rows/s
break up the sed processing to write to a file and then import that in:
preprocessed_filename="${filename}-processed"
echo "`date`: preprocessing ${filename}"
sed $'s/\r$//' $filename | sed '/^$/d' > $preprocessed_filename
echo "`date`: done preprocessing. beginning load for ${preprocessed_filename}"
psql nyc-taxi-data -c "COPY yellow_tripdata_staging ${schema} FROM '$(pwd)/${preprocessed_filename}' CSV HEADER;"
echo "`date`: done copying ${preprocessed_filename}"
Fri Dec 22 01:08:18 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
Fri Dec 22 01:08:57 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
COPY 14092413
Fri Dec 22 01:11:32 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed
time: 3m14s (postgresql copy: 2m35s)
effective copy rate: 65546
- change yellow_tripdata_staging to be unlogged
postgresql conf changes:
synchronous_commit = off
shared_buffers = 4GB
restart postgresql, clear page cache:
sudo sh -c "echo 1 > /proc/sys/vm/drop_caches"
now rerun import:
Fri Dec 22 01:27:14 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
Fri Dec 22 01:27:51 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
COPY 14092413
Fri Dec 22 01:28:41 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed
time: 1m27s (postgresql copy: 50s)
effective copy rate: 183018
let's try restarting and deleting the processed file just in case there was some caching effect:
./reset.sh
rm data/yellow_tripdata_2009-01.csv-processed
./import_trip_data.sh
Fri Dec 22 01:38:37 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
Fri Dec 22 01:39:10 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
COPY 14092413
Fri Dec 22 01:40:04 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed
time: 1m27s (postgresql copy: 54s)
effective copy rate: 173980
postgres-only rate: 260970
reverting the shared_buffers change so that only unlogged tables and synchronous commit are taking effect:
Fri Dec 22 02:06:45 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
Fri Dec 22 02:07:21 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
COPY 14092413
Fri Dec 22 02:08:30 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed
Postgres-only copy time: 1m9s (vs 50-55-ish seconds)
so it isn't just that throwing more RAM at it makes it faster. the unlogged table and synchronous commit together make a huge difference.
reverting synchronous commit (unlogged only):
Fri Dec 22 02:11:27 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
Fri Dec 22 02:11:49 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
COPY 14092413
Fri Dec 22 02:13:00 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed
Postgres-only copy time: 1m11s
delete the data/yellow_tripdata_2009-01.csv-processed file and restart to make sure there aren't any caching effects:
Fri Dec 22 02:15:55 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
Fri Dec 22 02:16:30 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
COPY 14092413
Fri Dec 22 02:17:37 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed
Postgres copy time: 1m17s.
So unloggged tables make a big difference. Combining performance increasing settings still makes a significant difference.
now let's try the above exercise with all enhancements enabled (unlogged table, asynchronous commit, larger shared buffers) but on an SSD:
Sat Dec 23 13:16:16 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
Sat Dec 23 13:16:34 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
COPY 14092413
Sat Dec 23 13:17:29 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed
only 18 seconds to preprocess data this time, but about the same amount of time (55 seconds) to copy the CSV for a total time of 1m13s.
so, to make Postgres go like a bat out of hell on commodity hardware:
- give postgres lots of ram + CPU cores + an SSD/NVMe
- turn off unneeded durability settings in your application
when "Postgres is slow", that's because it's providing safety features you're unaware of or might not need, not because "it's slow" or "can't scale".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment