Chef Server 12.0.8+ contains a new version of our underlying postgresql client library and a new version of the sqerl library which provideds an abstraction over that. These updates included the following change:
https://github.com/chef/sqerl/commit/11d08f363fa61958f5e46d2560447931f88ae3ee
which results in the following query being sent before each query that we send to sqerl:
set statement_timeout=5000
If this query is also counted as a transaction, that would account for the doubling in the transaction rate seen when upgrading to 12.0.8.
Since the statement_timeout is (by default) a per-session configurable, we can set it once at initialze. Thus, we can move the timeout statement into the initialization of the sqerl worker. The following PR does that:
To test the fix, I generated simple load against a standalone Chef Server built off of master. I then measured transactions on a 1 second interval for 5 minutes using a query very similar to what munin uses. On master, without the proposed fix, we observe the following transaction rate data:
V1
Min. : 0.0
1st Qu.:480.0
Median :512.0
Mean :511.1
3rd Qu.:549.0
Max. :771.0
I then stopped the load generator and erchef. I rebuilt oc_erchef with a new version of sqerl with the above fix, restarted the load generator, and restarted the monitoring application. With the change, we observe the following transaction rate data:
V1
Min. : 0.0
1st Qu.:230.0
Median :252.0
Mean :249.1
3rd Qu.:271.5
Max. :395.0
puts "Starting up"
threads = []
1.upto(4) do |i|
threads << Thread.new do
loop do
puts "Request from thread #{i}"
api.get("/nodes")
end
end
end
threads.each {|t| t.join}
started as: knife exec -u usera -k usera.pem test.rb -s 'https://localhost/organizations/orga'
after creating the appropriate user and organization.
QUERY="SELECT sum(pg_stat_get_db_xact_commit(oid)) FROM pg_database WHERE datname='opscode_chef'"
# all of our calls are queries only, so rollbacks are very unlikely. Munin, however does measure them
# by adding something like the following:
#UNION ALL
#SELECT 'rollback',sum(pg_stat_get_db_xact_rollback(oid)) FROM pg_database;
#
get_commit_count() {
su opscode-pgsql -c "/opt/opscode/embedded/bin/psql opscode_chef -t -c \"$QUERY\"" | head -1 | awk '{print $1}'
}
OLD_COUNT=$(get_commit_count)
while true; do
NEW_COUNT=$(get_commit_count)
echo $(echo "$NEW_COUNT - $OLD_COUNT" | bc)
OLD_COUNT=$NEW_COUNT
sleep 1
done
started with: bash monitor.sh 2>/dev/null >tx.log
. manually stopped with an interrupt after 5 minutes.