tobert · January 21, 2015 21:09
diff --git a/gistfile1.txt b/gistfile1.txt
 Credit: Shaun Thomas
 via: http://www.postgresql.org/message-id/[email protected]

 Two Necessary Kernel Tweaks for Linux Systems

 From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
 To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
 Subject:	Two Necessary Kernel Tweaks for Linux Systems
 Date:	2013-01-02 21:46:25
 Message-ID:	[email protected] (view raw or flat)
 Thread:	2013-01-02 21:46:25 from Shaun Thomas <sthomas(at)optionshouse(dot)com>

 Hey everyone!

 After much testing and hair-pulling, we've confirmed two kernel settings 
 that should always be modified in production Linux systems. Especially 
 new ones with the completely fair scheduler (CFS) as opposed to the O(1) 
 scheduler.

 If you want to follow along, these are:

 /proc/sys/kernel/sched_migration_cost
 /proc/sys/kernel/sched_autogroup_enabled

 Which correspond to sysctl settings:

 kernel.sched_migration_cost
 kernel.sched_autogroup_enabled

 What do these settings do?
 --------------------------

 * sched_migration_cost

 The migration cost is the total time the scheduler will consider a 
 migrated process "cache hot" and thus less likely to be re-migrated. By 
 default, this is 0.5ms (500000 ns), and as the size of the process table 
 increases, eventually causes the scheduler to break down. On our 
 systems, after a smooth degradation with increasing connection count, 
 system CPU spiked from 20 to 70% sustained and TPS was cut by 5-10x once 
 we crossed some invisible connection count threshold. For us, that was a 
 pgbench with 900 or more clients.

 The migration cost should be increased, almost universally on server 
 systems with many processes. This means systems like PostgreSQL or 
 Apache would benefit from having higher migration costs. We've had good 
 luck with a setting of 5ms (5000000 ns) instead.

 When the breakdown occurs, system CPU (as obtained from sar) increases 
 from 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%, 
 and %nice/%user is cut by half or more. A higher migration cost 
 essentially eliminates this artificial throttle.

 * sched_autogroup_enabled

 This is a relatively new patch which Linus lauded back in late 2010. It 
 basically groups tasks by TTY so perceived responsiveness is improved. 
 But on server systems, large daemons like PostgreSQL are going to be 
 launched from the same pseudo-TTY, and be effectively choked out of CPU 
 cycles in favor of less important tasks.

 The default setting is 1 (enabled) on some platforms. By setting this to 
 0 (disabled), we saw an outright 30% performance boost on the same 
 pgbench test. A fully cached scale 3500 database on a 72GB system went 
 from 67k TPS to 82k TPS with 900 client connections.

 Total Benefit
 -------------

 At higher connections counts, such as systems that can't use pooling or 
 make extensive use of prepared queries, these can massively affect 
 performance. At 900 connections, our test systems were at 17k TPS 
 unaltered, but 85k TPS after these two modifications. Even with this 
 performance boost, we still had 40% CPU free instead of 0%. In effect, 
 the logarithmic performance of the new scheduler is returned to normal 
 under large process tables.

 Some systems will have a higher "cracking" point than others. The effect 
 is amplified when a system is under high memory pressure, hence a lot of 
 expensive queries on a high number of concurrent connections is the 
 easiest way to replicate these results.

 Admins migrating from older systems (RHEL 5.x) may find this especially 
 shocking, because the old O(1) scheduler was too "stupid" to have these 
 advanced features, hence it was impossible to cause this kind of behavior.

 There's probably still a little room for improvement here, since 30-40% 
 CPU is still unclaimed in our larger tests. I'd like to see the total 
 performance drop (175k ideal TPS at 24-connections) decreased. But these 
 kernel tweaks are rarely discussed anywhere, it seems. There doesn't 
 seem to be any consensus on how these (and other) scheduler settings 
 should be modified under different usage scenarios.

 I just figured I'd share, since we found this info so beneficial.

 -- 
 Shaun Thomas
 OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
 312-444-8534
 sthomas(at)optionshouse(dot)com
	Credit: Shaun Thomas
	via: http://www.postgresql.org/message-id/[email protected]

	Two Necessary Kernel Tweaks for Linux Systems

	From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
	To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
	Subject: Two Necessary Kernel Tweaks for Linux Systems
	Date: 2013-01-02 21:46:25
	Message-ID: [email protected] (view raw or flat)
	Thread: 2013-01-02 21:46:25 from Shaun Thomas <sthomas(at)optionshouse(dot)com>

	Hey everyone!

	After much testing and hair-pulling, we've confirmed two kernel settings
	that should always be modified in production Linux systems. Especially
	new ones with the completely fair scheduler (CFS) as opposed to the O(1)
	scheduler.

	If you want to follow along, these are:

	/proc/sys/kernel/sched_migration_cost
	/proc/sys/kernel/sched_autogroup_enabled

	Which correspond to sysctl settings:

	kernel.sched_migration_cost
	kernel.sched_autogroup_enabled

	What do these settings do?
	--------------------------

	* sched_migration_cost

	The migration cost is the total time the scheduler will consider a
	migrated process "cache hot" and thus less likely to be re-migrated. By
	default, this is 0.5ms (500000 ns), and as the size of the process table
	increases, eventually causes the scheduler to break down. On our
	systems, after a smooth degradation with increasing connection count,
	system CPU spiked from 20 to 70% sustained and TPS was cut by 5-10x once
	we crossed some invisible connection count threshold. For us, that was a
	pgbench with 900 or more clients.

	The migration cost should be increased, almost universally on server
	systems with many processes. This means systems like PostgreSQL or
	Apache would benefit from having higher migration costs. We've had good
	luck with a setting of 5ms (5000000 ns) instead.

	When the breakdown occurs, system CPU (as obtained from sar) increases
	from 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%,
	and %nice/%user is cut by half or more. A higher migration cost
	essentially eliminates this artificial throttle.

	* sched_autogroup_enabled

	This is a relatively new patch which Linus lauded back in late 2010. It
	basically groups tasks by TTY so perceived responsiveness is improved.
	But on server systems, large daemons like PostgreSQL are going to be
	launched from the same pseudo-TTY, and be effectively choked out of CPU
	cycles in favor of less important tasks.

	The default setting is 1 (enabled) on some platforms. By setting this to
	0 (disabled), we saw an outright 30% performance boost on the same
	pgbench test. A fully cached scale 3500 database on a 72GB system went
	from 67k TPS to 82k TPS with 900 client connections.

	Total Benefit
	-------------

	At higher connections counts, such as systems that can't use pooling or
	make extensive use of prepared queries, these can massively affect
	performance. At 900 connections, our test systems were at 17k TPS
	unaltered, but 85k TPS after these two modifications. Even with this
	performance boost, we still had 40% CPU free instead of 0%. In effect,
	the logarithmic performance of the new scheduler is returned to normal
	under large process tables.

	Some systems will have a higher "cracking" point than others. The effect
	is amplified when a system is under high memory pressure, hence a lot of
	expensive queries on a high number of concurrent connections is the
	easiest way to replicate these results.

	Admins migrating from older systems (RHEL 5.x) may find this especially
	shocking, because the old O(1) scheduler was too "stupid" to have these
	advanced features, hence it was impossible to cause this kind of behavior.

	There's probably still a little room for improvement here, since 30-40%
	CPU is still unclaimed in our larger tests. I'd like to see the total
	performance drop (175k ideal TPS at 24-connections) decreased. But these
	kernel tweaks are rarely discussed anywhere, it seems. There doesn't
	seem to be any consensus on how these (and other) scheduler settings
	should be modified under different usage scenarios.

	I just figured I'd share, since we found this info so beneficial.

	--
	Shaun Thomas
	OptionsHouse \| 141 W. Jackson Blvd. \| Suite 500 \| Chicago IL, 60604
	312-444-8534
	sthomas(at)optionshouse(dot)com