db-charmer

The Problem

If you're unfamiliar with the issue, there is an intermittent failure when running Capybara tests on any machine (spec/requests/ or features/):

undefined method `db_charmer_connection_name' for #<ActiveRecord::ConnectionAdapters::Mysql2Adapter:0x00000009f16ec8>
/var/lib/jenkins/.rvm/gems/ruby-1.9.3-p0/gems/db-charmer-1.7.0/lib/db_charmer/active_record/class_attributes.rb:85:in `db_charmer_remapped_connection'
/var/lib/jenkins/.rvm/gems/ruby-1.9.3-p0/gems/db-charmer-1.7.0/lib/db_charmer/active_record/connection_switching.rb:34:in `connection_with_magic'

The failure is particularly annoying because if it occurs, it will cause all tests in the same process to fail with the same error.

Again, if you're unfamiliar, db-charmer is a gem that allows us to use different databases (ie. slaves) for certain models or certain queries. For example, when we say User.on_db(:slave0).count, the query will be executed on the first slave database, instead of the normal connection (User.connection).

Charles had espoused recently that it might be a multithreading issue with autoload. While his instincts about thread safety were spot on, (since Capybara runs in a separate thread from rspec/cucumber), the loading of the gem was not at fault. Simply put, a race condition was causing the db-charmer class variables to be in an unexpected state.

The Explanation

db-charmer works by hijacking the ActiveRecord::Base.connection method with good ol' alias_method_chain:

class << self
  def connection_with_magic
    connection_proxy || connection_without_magic
  end
  alias_method_chain :connection, :magic
end

Initially, all classes have a connection_proxy of nil. Connection switching is achieved in the User.on_db(new_conn, &block) method, which swaps in a connection_proxy before running the block, and replaces the old one before returning. A simplification of this is as follows:

def self.on_db(new_conn, &block)

  self.connection_level += 1
  old_proxy = self.connection_proxy
  self.connection_proxy = new_conn

  yield(&block)

  self.connection_proxy = old_proxy
  self.connection_level -= 1

end

If you're not already smelling a concurrency problem, you should be. If one thread is interrupted in the middle of an on_db call, all other threads get new_conn when asking for connection. We will come back to this issue. The real problem occurs when two threads interleave the above code. (Which explains why this is so intermittent; the timing has to be very exact.) Here's an example:

# Both threads are running User.on_db(:slave0).first, which gets translated to
# User.on_db(slave_connection) { User.first }

# Thread 1                                        # Thread 2

self.connection_level += 1                # 1
old_proxy = self.connection_proxy         # nil
self.connection_proxy = slave_connection

yield(&block)                                     self.connection_level += 1                # 2
                                                  old_proxy = self.connection_proxy         # slave_connection
                                                  self.connection_proxy = slave_connection

self.connection_proxy = old_proxy        # nil    yield(&block)
self.connection_level -= 1               # 1

                                                  self.connection_proxy = old_proxy         # slave_connection
                                                  self.connection_level -= 1                # 0

# BAD!!
# Now, in all threads, connection_level == 0 but connection_proxy == slave_connection

The state that I just described, where connection_level is zero but connection_proxy is a non-nil connection object, is simply unexpected by db-charmer. In this state, any calls to User.connection will throw NameError: undefined method 'db_charmer_connection_name'. (This isn't apparent by my above simplification of self.connection_with_magic, you can check the code for yourself to see how this state throws an error.) That's why if the error pops up during tests, it will cause all remaining tests to fail, since User.connection is simply broken.

The Solution

The fix I propose is to synchronize thread access to db_charmer_connection_proxy. We can't just slap a mutex around on_db since it could be recursive. (on_db calls can be nested.) My solution is to have threads acquire a lock on the outermost invocation of on_db, and not release that lock until they are out of that outermost invocation. That way, no two threads can be inside of on_db at the same time. Here's the updated method with synchronization code:

def self.on_db(new_conn, &block)

  # Acquire the lock if starting a new chain of on_db calls
  if Thread.current[:in_on_db_call]
    in_outer_call = false
  else
    connection_proxy_mutex.lock
    Thread.current[:in_on_db_call] = true
    in_outer_call = true
  end

  self.connection_level += 1
  old_proxy = self.connection_proxy
  self.connection_proxy = new_conn

  yield(&block)

  self.connection_proxy = old_proxy
  self.connection_level -= 1

  # Release the lock if leaving the outer call
  if in_outer_call
    connection_proxy_mutex.unlock
    Thread.current[:in_on_db_call] = false
  end

end

In addition to preventing two threads from being in on_db calls at the same time, we also don't want a thread to execute connection_with_magic if another thread is in on_db. So we acquire the lock for all threads not inside on_db:

def connection_with_magic
  block = lambda { connection_proxy || connection_without_magic }
  if Thread.current[:in_on_db_call]
    block.call
  else
    connection_proxy_mutex.synchronize &block
  end
end

The Conclusion

The fact is that db-charmer should never have been used in a multithreaded environment. This band-aid will hopefully help. I'm not guaranteeing that this will fix all thread issues, since db-charmer was simply not written with threads in mind.

This fix does indeed seem to prevent the pesky error. The build is currently green on CI, and I am doing repeated builds to be confident. In order to debug this issue, I had a local request spec with which I was able to pretty consistently recreate the error. On local runs of the spec using my fork, I haven't gotten any reds. It is also green against the db-charmer test suite.

The commit for my fix can be found here: https://github.com/change/db-charmer/commit/8f6b7cd38a70c47882a6c7417b35b3ec5b3afbeb

Thanks for joining me on this quest, onto more intermittent failure squashing!

chasestubblefield/gist:2339681

db-charmer

The Problem

The Explanation

The Solution

The Conclusion

MrJoy commented Oct 19, 2012

Uh oh!

chasestubblefield commented Nov 29, 2012

Uh oh!

kovyrin commented Mar 3, 2013

Uh oh!