If you're unfamiliar with the issue, there is an intermittent failure when running Capybara tests on any machine (spec/requests/
or features/
):
undefined method `db_charmer_connection_name' for #<ActiveRecord::ConnectionAdapters::Mysql2Adapter:0x00000009f16ec8>
/var/lib/jenkins/.rvm/gems/ruby-1.9.3-p0/gems/db-charmer-1.7.0/lib/db_charmer/active_record/class_attributes.rb:85:in `db_charmer_remapped_connection'
/var/lib/jenkins/.rvm/gems/ruby-1.9.3-p0/gems/db-charmer-1.7.0/lib/db_charmer/active_record/connection_switching.rb:34:in `connection_with_magic'
The failure is particularly annoying because if it occurs, it will cause all tests in the same process to fail with the same error.
Again, if you're unfamiliar, db-charmer
is a gem that allows us to use different databases (ie. slaves) for certain models or certain queries.
For example, when we say User.on_db(:slave0).count
, the query will be executed on the first slave database, instead of the normal connection (User.connection
).
Charles had espoused recently that it might be a multithreading issue with autoload
. While his instincts about thread safety were spot on, (since Capybara runs in a separate thread from rspec/cucumber), the loading of the gem was not at fault. Simply put, a race condition was causing the db-charmer
class variables to be in an unexpected state.
db-charmer
works by hijacking the ActiveRecord::Base.connection
method with good ol' alias_method_chain
:
class << self
def connection_with_magic
connection_proxy || connection_without_magic
end
alias_method_chain :connection, :magic
end
Initially, all classes have a connection_proxy
of nil
. Connection switching is achieved in the User.on_db(new_conn, &block)
method, which swaps in a connection_proxy
before running the block, and replaces the old one before returning. A simplification of this is as follows:
def self.on_db(new_conn, &block)
self.connection_level += 1
old_proxy = self.connection_proxy
self.connection_proxy = new_conn
yield(&block)
self.connection_proxy = old_proxy
self.connection_level -= 1
end
If you're not already smelling a concurrency problem, you should be. If one thread is interrupted in the middle of an on_db
call, all other threads get new_conn
when asking for connection
.
We will come back to this issue. The real problem occurs when two threads interleave the above code. (Which explains why this is so intermittent; the timing has to be very exact.)
Here's an example:
# Both threads are running User.on_db(:slave0).first, which gets translated to
# User.on_db(slave_connection) { User.first }
# Thread 1 # Thread 2
self.connection_level += 1 # 1
old_proxy = self.connection_proxy # nil
self.connection_proxy = slave_connection
yield(&block) self.connection_level += 1 # 2
old_proxy = self.connection_proxy # slave_connection
self.connection_proxy = slave_connection
self.connection_proxy = old_proxy # nil yield(&block)
self.connection_level -= 1 # 1
self.connection_proxy = old_proxy # slave_connection
self.connection_level -= 1 # 0
# BAD!!
# Now, in all threads, connection_level == 0 but connection_proxy == slave_connection
The state that I just described, where connection_level
is zero but connection_proxy
is a non-nil connection object, is simply unexpected by db-charmer
. In this state, any calls to User.connection
will throw NameError: undefined method 'db_charmer_connection_name'
.
(This isn't apparent by my above simplification of self.connection_with_magic
, you can check the code for yourself to see how this state throws an error.) That's why if the error pops up during tests, it will cause all remaining tests to fail, since User.connection
is simply broken.
The fix I propose is to synchronize thread access to db_charmer_connection_proxy
. We can't just slap a mutex around on_db
since it could be recursive. (on_db
calls can be nested.)
My solution is to have threads acquire a lock on the outermost invocation of on_db
, and not release that lock until they are out of that outermost invocation. That way, no two threads can be inside of on_db
at the same time.
Here's the updated method with synchronization code:
def self.on_db(new_conn, &block)
# Acquire the lock if starting a new chain of on_db calls
if Thread.current[:in_on_db_call]
in_outer_call = false
else
connection_proxy_mutex.lock
Thread.current[:in_on_db_call] = true
in_outer_call = true
end
self.connection_level += 1
old_proxy = self.connection_proxy
self.connection_proxy = new_conn
yield(&block)
self.connection_proxy = old_proxy
self.connection_level -= 1
# Release the lock if leaving the outer call
if in_outer_call
connection_proxy_mutex.unlock
Thread.current[:in_on_db_call] = false
end
end
In addition to preventing two threads from being in on_db
calls at the same time, we also don't want a thread to execute connection_with_magic
if another thread is in on_db
. So we acquire the lock for all threads not inside on_db
:
def connection_with_magic
block = lambda { connection_proxy || connection_without_magic }
if Thread.current[:in_on_db_call]
block.call
else
connection_proxy_mutex.synchronize &block
end
end
The fact is that db-charmer
should never have been used in a multithreaded environment. This band-aid will hopefully help. I'm not guaranteeing that this will fix all thread issues, since db-charmer
was simply not written with threads in mind.
This fix does indeed seem to prevent the pesky error. The build is currently green on CI, and I am doing repeated builds to be confident. In order to debug this issue, I had a local request spec with which I was able to pretty consistently recreate the error. On local runs of the spec using my fork, I haven't gotten any reds. It is also green against the db-charmer
test suite.
The commit for my fix can be found here: https://github.com/change/db-charmer/commit/8f6b7cd38a70c47882a6c7417b35b3ec5b3afbeb
Thanks for joining me on this quest, onto more intermittent failure squashing!
Why not just move connection_level and connection_proxy into thread-local storage, possibly keyed to the object ID in case of multiple connections being used in one thread?
You avoid bottlenecking your code to only allowing one thread to talk to a database at a time that way, no?