We want to describe our flow configuration is: what's allowed, what's not. We want that description in a format that can be read by people who aren't network engineers, including prose commentary and links to bugs. And we want to be able to verify that the description programmatically, so that we know it's accurate.
Programmatic tests!
Python test code, complete with docstrings, comments, and a few utilities to link test cases to bugs.
This allows us to be as specific as we need to be, while not defining every possible flow. We can leave things unspecified when they don't matter.
I developed fwunit to gather information about flows as implemented (pulling directly from the SRXes and from AWS).
Then I wrote a collection of test files, stored in a private firewall-tests repository.
On fwunit1.private.releng.scl3.mozilla.com, there's a nightly fwunit run followed by a run of the tests.
I get emailed on test failure.
There's also an easy way to run tests against the last night's data, for updating the test scripts.
See test_deploystudio.py for an example of the test script.
The fwunit config is in fwunit.yaml , which describes how configuration is pulled from the firewalls and AWS, and how "apps" are determined.
Already, this has been a big win.
- Discovered lots of incorrectly-configured flows (either too broad or too narrow). The too-narrow flows would have hurt us when trying to operate in a degraded mode. https://bugzilla.mozilla.org/show_bug.cgi?id=1050854
- Eliminated a long list of redundant flow configurations, secure in the knowledge the changes were correct. Also in https://bugzilla.mozilla.org/show_bug.cgi?id=1050854
- Successfully refactored AWS-related flows to be defined in security groups instead of at the firewalls. This resulted in more restrictive flows within AWS and a simpler overall configuration. https://bugzilla.mozilla.org/show_bug.cgi?id=1058225
- It turns out that fwunit is useful outside of release engineering!
- Find a process to involve more groups than just me in this process
- Write tests for more bits of the network:
- pulse
- slaveapi
- vcs sync
- winadmin
- relengweb
- partner-repack
- celery
- mozpool
- blobuploader
- dev-stage, dev-master, etc.
- cruncher
- nagios
- Add features to suit opsec's needs
This is actually a little vague -- updating tests is a lot to ask of either netops or releng, and the value is difficult to quantify. When the tests fail, it can be very difficult to figure out why and how to fix them, even with a deep background in firewall configs, AWS security groups, and our IPv4 architecture. As an engineer, it's tempting to treat this as a test suite, rather than as a document, but that's not its primary value. So how do we meet the objective with minimal requirements on other teams?