gist:fac76caa775b3ad63ac8

Introduction

This is a short write-up that explains why test/addons/make-callback-recurse/test.js in github.com/nodejs/node/pull/4507 fails with the following error message:

$ node test.js
d2 error
/Users/JulienGilli/dev/node/node/test/addons/make-callback-recurse/test.js:88
    throw new Error('test d1');
    ^

Error: test d1
    at Domain.<anonymous> (/Users/JulienGilli/dev/node/node/test/addons/make-callback-recurse/test.js:88:11)
    at Domain.<anonymous> (/Users/JulienGilli/dev/node/node/test/common.js:395:15)
    at Domain.run (domain.js:228:14)
    at checkDomains (/Users/JulienGilli/dev/node/node/test/addons/make-callback-recurse/test.js:84:6)
    at Immediate.<anonymous> (/Users/JulienGilli/dev/node/node/test/addons/make-callback-recurse/test.js:69:7)
    at Immediate._onImmediate (/Users/JulienGilli/dev/node/node/test/common.js:395:15)
    at processImmediate [as _immediateCallback] (timers.js:392:17)
$

Technical explanation of why the test fails

The relevant part of this test is the following code snippet:

const d1 = domain.create();
const d2 = domain.create();

d1.on('error', function() { process._rawDebug('d1 error') });
d2.on('error', function() { process._rawDebug('d2 error') });

d1.run(common.mustCall(function() {
  makeCallback({domain:d2}, function() {
    throw new Error('test d2');
  });
  throw new Error('test d1');
}));

Instead, what we might expect (we'll see why it's actually a wrong assumption) is that the second error thrown (throw new Error('test d1');) is caught by the domain in which it runs, which is d1, instead of not being caught and making the process exit.

For instance, let's take a look at what happens with similar code that doesn't use the makeCallback binding:

var domain = require('domain');

var d1 = domain.create();
var d2 = domain.create();


d1.on('error', function() { console.error('d1 on error'); });
d2.on('error', function() { console.error('d2 on error'); });

d1.run(function() {
  d2.run(function() {
    throw new Error('boom from d2');
  });
  throw new Error('boom from d1'); 
});

When run with node v4.2.1, this code doesn't exit with a non-zero exit code, and no uncaught error bubbles up:

$ node /tmp/test.js 
d2 on error
$

So why would using the makeCallback binding change this behavior?

First, the latest code snippet actually behaves much differently than the first one. In the first code snippet that uses the makeCallback binding, the second error is thrown, whereas in the second snippet that doesn't use the custom makeCallback binding, the second error is not thrown.

The reason why the second error is not thrown in the second snippet is simply that, after the first error is thrown, the call stack is unwound until there's an handler for that exception that can be found on the stack. In this case, the only handler that can be found is an external handler setup when bootstrapping node.

That handler is "verbose", which means that it generates a "message" that is handled by node::FatalException. When the error is handled by node::FatalException, the execution doesn't return from where it left off, which is right after the first error was thrown. Instead, the JavaScript code terminates its execution and the node event loop runs. However since there's nothing holding the loop open, the program terminates its execution successfuly.

So the question should now be: why in the first code snippet does execution continue after the first error is thrown? This first error is thrown from within a JavaScript function called by node::MakeCallback. What's important here is that node::MakeCallback sets up its own external exception handler, in addition to the one we already mentioned that was setup during node's bootstrap.

That exception handler is also verbose, and so node::FatalException (which in turns emits domains' error events) will also be called. However since in the first code snippet the custom makeCallback binding is called synchronously, here's how the stack roughly looks in terms of external exception handlers:

external exception handler 1
-----
call to node.js' "main" function
---
call to d1.run
----
call to anonymous function that runs within domain d1 
----
call to node::MakeCallback
----
external exception handler 2
----
call to anonymous JS function that throws

So when the first error is thrown, the stack is unwound until "external exception handler 2" is found, and then the rest of the anonymous function that runs within domain d1 can resume its execution and throw the second error.

However, by that that time, the domains stack and the active domain was cleared by the domain's error handler that was called after the first error was thrown. So when it's time to handle the second error in node::FatalException, no domain is found on the stack, and the process exits with an error.

Further questions/reflexions

Not clearing the whole domains stack on an uncaught error

One might think that the domains stack should not be cleared any time an uncaught error is handled by a domain that is part of a stack of nested domains. Here's a small quick patch that only pops the domain at the top of the stack in this case:

diff --git a/lib/domain.js b/lib/domain.js
index 630d02e..2ddb794 100644
--- a/lib/domain.js
+++ b/lib/domain.js
@@ -65,11 +65,12 @@ Domain.prototype._errorHandler = function errorHandler(er) {
   function emitError() {
     var handled = self.emit('error', er);
 
-    // Exit all domains on the stack.  Uncaught exceptions end the
-    // current tick and no domains should be left on the stack
-    // between ticks.
-    stack.length = 0;
-    exports.active = process.domain = null;
+    stack.pop();
+    if (stack.length) {
+      exports.active = process.domain = stack[stack.length - 1];
+    } else {
+      exports.active = process.domain = null;
+    }
 
     return handled;
   }

and the output with the changes in github.com/nodejs/node/pull/4507 is what we'd "intuitively" expect:

$ cd test/addons/make-callback-recurse 
$ ../../../node test.js                                    
d2 error
d1 error
$

Now what does make test say about this change?

➜  node git:(master) ✗ make test
/Applications/Xcode.app/Contents/Developer/usr/bin/make -C out BUILDTYPE=Release V=1
make[1]: Nothing to be done for `all'.
ln -fs out/Release/node node
Running main() from gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from UtilTest
[ RUN      ] UtilTest.ListHead
[       OK ] UtilTest.ListHead (0 ms)
[----------] 1 test from UtilTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.
/usr/bin/python tools/test.py --mode=release message parallel sequential -J
=== release test-domain-nested ===                                             
Path: parallel/test-domain-nested
assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: 4 == 0
    at process.<anonymous> (/Users/JulienGilli/dev/node/node/test/parallel/test-domain-nested.js:9:10)
    at emitOne (events.js:83:20)
    at process.emit (events.js:170:7)
Command: out/Release/node /Users/JulienGilli/dev/node/node/test/parallel/test-domain-nested.js
[00:45|% 100|+ 1001|-   1]: Done                                               
make: *** [test] Error 1
➜  node git:(master) ✗

Let's see why test/parallel/test-domain-nested.js fails. Here's the code:

'use strict';
// Make sure that the nested domains don't cause the domain stack to grow

require('../common');
var assert = require('assert');
var domain = require('domain');

process.on('exit', function(c) {
  assert.equal(domain._stack.length, 0);
});

domain.create().run(function() {
  domain.create().run(function() {
    domain.create().run(function() {
      domain.create().on('error', function(e) {
          // Don't need to do anything here
      }).run(function() {
        throw new Error('died');
      });
    });
  });
});

So the test expects all domains to be exited in the process' exit event's handler. And intuitively it makes sense: an error is thrown in the innermost domain, and so all enclosing domains have exited, thus when the process' exit event's handler run, there's no domain left on the stack.

Actually, that's not what happens. When the error is thrown, domains do not exit. The innermost domain's error handler pops the innermost domain from the stack, and the process' exit event is emitted (via node::MakeCallback, which pushes another domain on the stack).

So at that point the domains stack has the three non-exited domains + the domain entered to emit the process' 'exit' event, which is 4, and thus !== 0: the test fails.

Unwinding domains stack, but stop at each external exception handler boundary

A more sensible way to solve the problem mentioned in the introduction is to add a marker in the domains stack every time we add an additional external exception handler. See https://gist.github.com/misterdjules/8fc28539c3e095ee2e35 which is a patch on top of nodejs/node#4507 that implements this idea.

With this patch applied to nodejs/node#4507, all tests pass.

The problem with that approach is that now, the following code throws two errors:

const d1 = domain.create();
const d2 = domain.create();

d1.on('error', function() { process._rawDebug('d1 error') });
d2.on('error', function() { process._rawDebug('d2 error') });

d1.run(common.mustCall(function() {
  makeCallback({domain:d2}, function() {
    throw new Error('test d2');
  });
  throw new Error('test d1');
}));

whereas the following code throws only one error:

var domain = require('domain');

var d1 = domain.create();
var d2 = domain.create();

d1.on('error', function() { console.error('d1 on error'); });
d2.on('error', function() { console.error('d2 on error'); });

d1.run(function() {
  d2.run(function() {
    throw new Error('boom from d2');
  });
  throw new Error('boom from d1'); 
});

So unless one knows and understands that the makeCallback JavaScript binding actually adds an external exception handler on the stack, which is what allows the JavaScript code to resume its execution and throw the second error, it can be very confusing and unintuitive.

Not adding any external exception handler in node::*::MakeCallback calls

Having an external exception handler in node::*::MakeCallback means that the callstack will not be unwound past node::*::MakeCallback if an uncaught exception is thrown. It means that the following code:

const binding = require('./build/Release/binding');
const makeCallback = binding.makeCallback;

process.on('uncaughtException', function() {});

makeCallback({}, function() {
  console.log('throwing error');
  throw new Error('boom');
});
console.log('foo');

would execute the call to console.log('foo'), which is inconsistent with what would happen if we'd replace the call to the node::MakeCallback JS binding with any JS function that calls its callback synchronously. That makes node::MakeCallback "special" and inconsistent.

Removing the external exception handler from node::*::MakeCallback would make the code sample above not execute the console.log('foo'); statement, while the error would still be handled by any domain error handler or process 'uncaughtException' event handler.

The problem with that approach though is that when cb->Call returns within node::*::MakeCallback, any exception thrown from within the JavaScript function that was executed is still pending, and so doing any calls to the JS layer after cb->Call returned seems dangerous, and possibly has undefined behavior.

That means that Domain.prototype.exit and async-wrap post-hooks wouldn't run. But maybe then can be called at another point? Domain.prototype.exit could be called when the domain's error is handled (from within node::FatalException), I'm not sure about async-wrap post-hooks.