fearofcode · December 23, 2017 21:20
diff --git a/gistfile1.txt b/gistfile1.txt
 example output from running scripts in https://github.com/fearofcode/nyc-taxi-data follows.

 base script, 2tb HDD, vanilla postgres settings fresh from ubuntu install. 4 CPUs, 8 GB VM RAM. ubuntu 16.04, postgres 10.

 sedding a 14 million line CSV and then COPYing it in:

 Thu Dec 21 23:30:18 PST 2017: beginning load for data/yellow_tripdata_2009-01.csv
 COPY 14092413
 Thu Dec 21 23:34:57 PST 2017: finished raw load for data/yellow_tripdata_2009-01.csv

 time: 4m39s

 effective copy rate: 50510 rows/s

 break up the sed processing to write to a file and then import that in:

 preprocessed_filename="${filename}-processed"
 echo "`date`: preprocessing ${filename}"
 sed $'s/\r$//' $filename | sed '/^$/d' > $preprocessed_filename
 echo "`date`: done preprocessing. beginning load for ${preprocessed_filename}"
 psql nyc-taxi-data -c "COPY yellow_tripdata_staging ${schema} FROM '$(pwd)/${preprocessed_filename}' CSV HEADER;"
 echo "`date`: done copying ${preprocessed_filename}"

 Fri Dec 22 01:08:18 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
 Fri Dec 22 01:08:57 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
 COPY 14092413
 Fri Dec 22 01:11:32 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

 time: 3m14s (postgresql copy: 2m35s)

 effective copy rate: 65546

 - change yellow_tripdata_staging to be unlogged

 postgresql conf changes:

 synchronous_commit = off
 shared_buffers = 4GB

 restart postgresql, clear page cache:

 sudo sh -c "echo 1 > /proc/sys/vm/drop_caches"

 now rerun import:

 Fri Dec 22 01:27:14 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
 Fri Dec 22 01:27:51 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
 COPY 14092413
 Fri Dec 22 01:28:41 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

 time: 1m27s (postgresql copy: 50s)

 effective copy rate: 183018

 let's try restarting and deleting the processed file just in case there was some caching effect:

 ./reset.sh
 rm data/yellow_tripdata_2009-01.csv-processed
 ./import_trip_data.sh

 Fri Dec 22 01:38:37 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
 Fri Dec 22 01:39:10 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
 COPY 14092413
 Fri Dec 22 01:40:04 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

 time: 1m27s (postgresql copy: 54s)

 effective copy rate: 173980

 postgres-only rate: 260970

 reverting the shared_buffers change so that only unlogged tables and synchronous commit are taking effect:

 Fri Dec 22 02:06:45 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
 Fri Dec 22 02:07:21 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
 COPY 14092413
 Fri Dec 22 02:08:30 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

 Postgres-only copy time: 1m9s (vs 50-55-ish seconds)

 so it isn't just that throwing more RAM at it makes it faster. the unlogged table and synchronous commit together make a huge difference.

 reverting synchronous commit (unlogged only):

 Fri Dec 22 02:11:27 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
 Fri Dec 22 02:11:49 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
 COPY 14092413
 Fri Dec 22 02:13:00 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

 Postgres-only copy time: 1m11s

 delete the data/yellow_tripdata_2009-01.csv-processed file and restart to make sure there aren't any caching effects:

 Fri Dec 22 02:15:55 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
 Fri Dec 22 02:16:30 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
 COPY 14092413
 Fri Dec 22 02:17:37 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

 Postgres copy time: 1m17s.

 So unloggged tables make a big difference. Combining performance increasing settings still makes a significant difference.

 now let's try the above exercise with all enhancements enabled (unlogged table, asynchronous commit, larger shared buffers) but on an SSD:

 Sat Dec 23 13:16:16 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
 Sat Dec 23 13:16:34 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
 COPY 14092413
 Sat Dec 23 13:17:29 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

 only 18 seconds to preprocess data this time, but about the same amount of time (55 seconds) to copy the CSV for a total time of 1m13s.

 so, to make Postgres go like a bat out of hell on commodity hardware:

 - give postgres lots of ram + CPU cores + an SSD/NVMe
 - turn off unneeded durability settings in your application

 when "Postgres is slow", that's because it's providing safety features you're unaware of or might not need, not because "it's slow" or "can't scale".
	example output from running scripts in https://github.com/fearofcode/nyc-taxi-data follows.

	base script, 2tb HDD, vanilla postgres settings fresh from ubuntu install. 4 CPUs, 8 GB VM RAM. ubuntu 16.04, postgres 10.

	sedding a 14 million line CSV and then COPYing it in:

	Thu Dec 21 23:30:18 PST 2017: beginning load for data/yellow_tripdata_2009-01.csv
	COPY 14092413
	Thu Dec 21 23:34:57 PST 2017: finished raw load for data/yellow_tripdata_2009-01.csv

	time: 4m39s

	effective copy rate: 50510 rows/s

	break up the sed processing to write to a file and then import that in:

	preprocessed_filename="${filename}-processed"
	echo "`date`: preprocessing ${filename}"
	sed $'s/\r$//' $filename \| sed '/^$/d' > $preprocessed_filename
	echo "`date`: done preprocessing. beginning load for ${preprocessed_filename}"
	psql nyc-taxi-data -c "COPY yellow_tripdata_staging ${schema} FROM '$(pwd)/${preprocessed_filename}' CSV HEADER;"
	echo "`date`: done copying ${preprocessed_filename}"

	Fri Dec 22 01:08:18 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
	Fri Dec 22 01:08:57 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
	COPY 14092413
	Fri Dec 22 01:11:32 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

	time: 3m14s (postgresql copy: 2m35s)

	effective copy rate: 65546

	- change yellow_tripdata_staging to be unlogged

	postgresql conf changes:

	synchronous_commit = off
	shared_buffers = 4GB

	restart postgresql, clear page cache:

	sudo sh -c "echo 1 > /proc/sys/vm/drop_caches"

	now rerun import:

	Fri Dec 22 01:27:14 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
	Fri Dec 22 01:27:51 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
	COPY 14092413
	Fri Dec 22 01:28:41 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

	time: 1m27s (postgresql copy: 50s)

	effective copy rate: 183018

	let's try restarting and deleting the processed file just in case there was some caching effect:

	./reset.sh
	rm data/yellow_tripdata_2009-01.csv-processed
	./import_trip_data.sh

	Fri Dec 22 01:38:37 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
	Fri Dec 22 01:39:10 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
	COPY 14092413
	Fri Dec 22 01:40:04 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

	time: 1m27s (postgresql copy: 54s)

	effective copy rate: 173980

	postgres-only rate: 260970

	reverting the shared_buffers change so that only unlogged tables and synchronous commit are taking effect:

	Fri Dec 22 02:06:45 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
	Fri Dec 22 02:07:21 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
	COPY 14092413
	Fri Dec 22 02:08:30 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

	Postgres-only copy time: 1m9s (vs 50-55-ish seconds)

	so it isn't just that throwing more RAM at it makes it faster. the unlogged table and synchronous commit together make a huge difference.

	reverting synchronous commit (unlogged only):

	Fri Dec 22 02:11:27 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
	Fri Dec 22 02:11:49 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
	COPY 14092413
	Fri Dec 22 02:13:00 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

	Postgres-only copy time: 1m11s

	delete the data/yellow_tripdata_2009-01.csv-processed file and restart to make sure there aren't any caching effects:

	Fri Dec 22 02:15:55 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
	Fri Dec 22 02:16:30 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
	COPY 14092413
	Fri Dec 22 02:17:37 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

	Postgres copy time: 1m17s.

	So unloggged tables make a big difference. Combining performance increasing settings still makes a significant difference.

	now let's try the above exercise with all enhancements enabled (unlogged table, asynchronous commit, larger shared buffers) but on an SSD:

	Sat Dec 23 13:16:16 PST 2017: preprocessing data/yellow_tripdata_2009-01.csv
	Sat Dec 23 13:16:34 PST 2017: done preprocessing. beginning load for data/yellow_tripdata_2009-01.csv-processed
	COPY 14092413
	Sat Dec 23 13:17:29 PST 2017: done copying data/yellow_tripdata_2009-01.csv-processed

	only 18 seconds to preprocess data this time, but about the same amount of time (55 seconds) to copy the CSV for a total time of 1m13s.

	so, to make Postgres go like a bat out of hell on commodity hardware:

	- give postgres lots of ram + CPU cores + an SSD/NVMe
	- turn off unneeded durability settings in your application

	when "Postgres is slow", that's because it's providing safety features you're unaware of or might not need, not because "it's slow" or "can't scale".