I'm stuck on fixing our build, so I'm writing down what I've done. Maybe someone will have some ideas.

My team has been working on an Ember.js application for the better part of a year now. About a month or two ago, we started seeing the following in our Travis CI build output intermittently:

...
ok 483 PhantomJS 2.0 - Acceptance: stats/filter: filter by service
ok 484 PhantomJS 2.0 - Acceptance: stats/filter: filter by region
not ok 485 PhantomJS - Browser "phantomjs /home/travis/build/fastly/Tango/node_modules/ember-cli/node_modules/testem/assets/phantom.js http://localhost:7357/5881" exited unexpectedly.

Corrupt PhantomJS?

A Google search suggested that it might be a corrupt PhantomJS build. So we talked to the Travis CI folks and the Ember.js folks and changed our build to download PhantomJS 2.0.0 at the beginning of the build and use that version.

before_install:
  - wget https://s3.amazonaws.com/travis-phantomjs/phantomjs-2.0.0-ubuntu-12.04.tar.bz2
  - tar -xjf phantomjs-2.0.0-ubuntu-12.04.tar.bz2
  - sudo mkdir -p /usr/local/phantomjs/bin/
  - sudo mv phantomjs /usr/local/phantomjs/bin/phantomjs

No help.

Trusty?

The Travis CI folks also suggested we try using their new Trusty build environment:

sudo: required
dist: trusty

That fixed it! For a few weeks. Then the failures started coming back.

Our Code?

I thought that maybe something we were doing in our code was causing an exception in PhantomJS. So I bisected through about 1200 recent commits, building each several times on Travis CI to determine whether the build passed reliably. The bisect ended with a rather banal commit that changed the logic of a minor helper:

-  return value === null || value === 'none' ? '—' : value;
+  return Ember.isBlank(value) || value === 'none' ? '—' : value;

I didn't believe that was the cause, so I went back and ran some of the "good" commits again. They now failed even though they had passed several times.

Memory Problem?

Since builds failed intermittently, but the version of PhantomJS didn't seem to play a role, I began to suspect a resource exhaustion issue. My first instinct was that we had a memory leak. The Travis CI folks said this was supported by the fact that the Trusty builds passed for a while since that environment has more memory and runs in more isolation.

I confirmed the memory leak by running the test suite in Chrome on my machine and taking repeated heap snapshots. They grew continuously from test to test. Garbage collection would kick in every so often and clean up some, but not nearly all of the allocated memory.

Here are snapshots one and two from the same test suite run, separated by about 20 seconds.

Patch 1: ember-exam

Finding a memory leak in thousands of lines of app code plus tens of thousands of lines of library code was daunting. And I needed to unblock my team. So I decided the best thing to do for now was to paper over the problem by splitting up the build into multiple suites.

I added ember-exam and set Travis up to run builds in parallel:

env:
  matrix:
    EXAM_SUITE=1
    EXAM_SUITE=2
    EXAM_SUITE=3
    EXAM_SUITE=4

script:
  - node_modules/ember-cli/bin/ember exam --split=4 --split-file="$EXAM_SUITE"

Unfortunately, ember-exam has a bug that causes non-test files in tests/ to be filtered out in the splits. That wreaked havoc on our test suite. That can absolutely be fixed, but I needed to focus on the fastest thing that would unblock my team.

Patch 2: `--filter`

Ember-CLI's test command comes with some rudimentary built-in filtering. It lets you run all tests filtered by a positive substring match. Good enough for now! I made sure that every call to module('...') had a prefix of "Acceptance: " or "Unit: " and then configured Travis to run those (very uneven) suites in parallel:

env:
  matrix:
    TEST_FILTER="Acceptance: "
    TEST_FILTER="Unit: "
    TEST_FILTER="JSCS -"
    TEST_FILTER="JSHint -"

script:
  - node_modules/ember-cli/bin/ember test --filter="$TEST_FILTER"

Now the build was green, and it ran a little faster as a bonus!

Continuous Deployment

But this introduced a new problem. Each of the four jobs that passed would kick off the deploy step that published the compiled Ember app to our staging server. That's fine if all four pass. But if one fails, we don't want that build published!

There has been an issue open on Travis since February 2013 to support some sort of after_all_jobs_succeed hook. Unfortunately, there has been no movement.

There is, however, a third-party travis-after-all, which is a Node module that polls the Travis API for build status and runs a callback when all jobs are finished. Unfortunately, it doesn't yet work with private Travis accounts. OAuth is always a beast to deal with, and it's even worse when you can't run the code locally to test it.

Summary

What I need to do, but am dreading:

patch travis-after-all to support GitHub OAuth
patch ember-exam to support non-test modules in tests/
find and fix the memory leaks

jamesarosen/stuck.md

Corrupt PhantomJS?

Trusty?

Our Code?

Memory Problem?

Patch 1: ember-exam

Patch 2: `--filter`

Continuous Deployment

Summary

akahn commented Jan 7, 2016

Uh oh!

jamesarosen commented Jan 8, 2016

Uh oh!

jamesarosen commented Jan 8, 2016

Uh oh!

jamesarosen/stuck.md

Corrupt PhantomJS?

Trusty?

Our Code?

Memory Problem?

Patch 1: ember-exam

Patch 2: --filter

Continuous Deployment

Summary

akahn commented Jan 7, 2016

Uh oh!

jamesarosen commented Jan 8, 2016

Uh oh!

jamesarosen commented Jan 8, 2016

Uh oh!

Patch 2: `--filter`