Horrible haskell FFI concurrency talloc bug

Notes about a very hard to debug bug:

Eventually I deduced that it was lack of thread safety in talloc causing corruption and (most of the time) an abort() when talloc itself detected the corruption.

So, I created pthread mutexs around all operations that might do talloc things. Including routines called from finalisers.

This resulted in instant deadlocks. Why? forkIO was being used, instead of forkOS to create bound threads. Because Haskell execution capability can move among OS threads, in some cases the lock was acquired by one OS thread, released by a different thread (which does not actually unlock it), and then the next attempt to acquire the lock deadlocks.

OK, so we have to use bound threads for everything. No worries, switch to forkOS.

Still, instant deadlocks. So, what now? A footnote in the Control.Concurrent module documentation contains a clue:

There is a subtle interaction between deadlock detection and finalizers (as created by newForeignPtr or the functions in System.Mem.Weak): if a thread is blocked waiting for a finalizer to run, then the thread will be considered deadlocked and sent an exception. So preferably don't do this, but if you have no alternative then it is possible to prevent the thread from being considered deadlocked by making a StablePtr pointing to it. Don't forget to release the StablePtr later with freeStablePtr.

OK, so a strong dose of StablePtr should fix it. The following forkOS wrapper automatically creates and destroys a StablePtr to the new thread spawned to execute the action:

-- | Variant of forkOS that creates a StablePtr to the new thread.
-- This can be used to prevent garbage collection of a thread that
-- is blocked waiting for a finalizer to run.
--
forkOSStableThread :: IO () -> IO ThreadId
forkOSStableThread action = do
  mv <- newEmptyMVar  -- new MVar to hold the StablePtr
  let
    newAction =
      bracket
        (takeMVar mv)   -- wait to receive the StablePtr
        freeStablePtr   -- free the StablePtr
        (const action)  -- execute the original action
  tid <- forkOS newAction
  sp <- newStablePtr tid
  putMVar mv sp
  pure tid

And with that (and a small patch to brick where the main event loop thread gets spawned) the deadlocks are gone.

But the crash behaviour persists :(

frasertweedale/bug.rst