My team has been working on an Ember.js application for the better part of a year now. About a month or two ago, we started seeing the following in our Travis CI build output intermittently:
...
ok 483 PhantomJS 2.0 - Acceptance: stats/filter: filter by service
ok 484 PhantomJS 2.0 - Acceptance: stats/filter: filter by region
not ok 485 PhantomJS - Browser "phantomjs /home/travis/build/fastly/Tango/node_modules/ember-cli/node_modules/testem/assets/phantom.js http://localhost:7357/5881" exited unexpectedly.
A Google search suggested that it might be a corrupt PhantomJS build. So we talked to the Travis CI folks and the Ember.js folks and changed our build to download PhantomJS 2.0.0 at the beginning of the build and use that version.
before_install:
- wget https://s3.amazonaws.com/travis-phantomjs/phantomjs-2.0.0-ubuntu-12.04.tar.bz2
- tar -xjf phantomjs-2.0.0-ubuntu-12.04.tar.bz2
- sudo mkdir -p /usr/local/phantomjs/bin/
- sudo mv phantomjs /usr/local/phantomjs/bin/phantomjs
No help.
The Travis CI folks also suggested we try using their new Trusty build environment:
sudo: required
dist: trusty
That fixed it! For a few weeks. Then the failures started coming back.
I thought that maybe something we were doing in our code was causing an exception in PhantomJS. So I bisected through about 1200 recent commits, building each several times on Travis CI to determine whether the build passed reliably. The bisect ended with a rather banal commit that changed the logic of a minor helper:
- return value === null || value === 'none' ? '—' : value;
+ return Ember.isBlank(value) || value === 'none' ? '—' : value;
I didn't believe that was the cause, so I went back and ran some of the "good" commits again. They now failed even though they had passed several times.
Since builds failed intermittently, but the version of PhantomJS didn't seem to play a role, I began to suspect a resource exhaustion issue. My first instinct was that we had a memory leak. The Travis CI folks said this was supported by the fact that the Trusty builds passed for a while since that environment has more memory and runs in more isolation.
I confirmed the memory leak by running the test suite in Chrome on my machine and taking repeated heap snapshots. They grew continuously from test to test. Garbage collection would kick in every so often and clean up some, but not nearly all of the allocated memory.
Here are snapshots one and two from the same test suite run, separated by about 20 seconds.
Finding a memory leak in thousands of lines of app code plus tens of thousands of lines of library code was daunting. And I needed to unblock my team. So I decided the best thing to do for now was to paper over the problem by splitting up the build into multiple suites.
I added ember-exam and set Travis up to run builds in parallel:
env:
matrix:
EXAM_SUITE=1
EXAM_SUITE=2
EXAM_SUITE=3
EXAM_SUITE=4
script:
- node_modules/ember-cli/bin/ember exam --split=4 --split-file="$EXAM_SUITE"
Unfortunately, ember-exam has a bug that causes non-test files
in tests/
to be filtered out in the splits. That wreaked havoc on our test suite. That can absolutely be fixed, but I
needed to focus on the fastest thing that would unblock my team.
Ember-CLI's test command comes with some rudimentary built-in filtering. It lets you run all tests filtered by a positive
substring match. Good enough for now! I made sure that every call to module('...')
had a prefix of "Acceptance: "
or
"Unit: "
and then configured Travis to run those (very uneven) suites in parallel:
env:
matrix:
TEST_FILTER="Acceptance: "
TEST_FILTER="Unit: "
TEST_FILTER="JSCS -"
TEST_FILTER="JSHint -"
script:
- node_modules/ember-cli/bin/ember test --filter="$TEST_FILTER"
Now the build was green, and it ran a little faster as a bonus!
But this introduced a new problem. Each of the four jobs that passed would kick off the deploy
step that published the
compiled Ember app to our staging server. That's fine if all four pass. But if one fails, we don't want that build published!
There has been an issue open on Travis since February 2013 to support
some sort of after_all_jobs_succeed
hook. Unfortunately, there has been no movement.
There is, however, a third-party travis-after-all, which is a Node module that polls the Travis API for build status and runs a callback when all jobs are finished. Unfortunately, it doesn't yet work with private Travis accounts. OAuth is always a beast to deal with, and it's even worse when you can't run the code locally to test it.
What I need to do, but am dreading:
- patch travis-after-all to support GitHub OAuth
- patch ember-exam to support non-test modules in
tests/
- find and fix the memory leaks
Is there a way to get more information from phantom.js about what happened before it crashed or even why it crashed? Maybe that would provide a clue. Does the crash always occur at
stats/filter: filter by region
? Does that mean this feature has the code that triggers the crash?