The goal of this project was to integrate several BCC tools into Performance Co-Pilot (PCP) and Vector. The integration of the BCC tools into the PCP framework provides a number of benefits: 24/7 monitoring, archiving and exporting of the collected metrics to another system and much more. Furthermore, Vector can consume the metrics in near real-time and display them in the browser with meaningful visualizations, e.g. heat maps and flame graphs. Adequate visualization of collected performance metrics eases the identification and resolution of performance issues.
The following BCC tools were integrated into PCP and Vector:
- execsnoop: traces new processes
- runqlat: records the scheduler run queue latency as histogram
- profile: records stack traces at a specific interval
- biolatency*: records block device I/O latency as histogram
- biotop*: summarizes which processes are performing block I/O
- ext4dist, xfsdist, zfsdist: trace read/write/open/fsync latencies as histogram
- tcplife*: summarizes TCP sessions
- tcpretrans: traces TCP retransmits
- tcptop: summarizes TCP throughput by host and port
* PCP module was already implemented, a few changes were required for the tcplife and biotop modules
- heatmap (based on d3-heatmap2)
- table
- flamegraph (based on d3-flame-graph)
All main goals, including one stretch goal (profile), are merged in their repositories.
The latest stable PCP version 4.1.1 includes all new BCC tools. The Vector widgets are merged into the master branch of the Vector repository.
Both projects are hosted on GitHub, and every change was integrated by pull requests:
Quite a big amount of time was spent on the implementation and debugging of the automated QA tests. There were a few occurrences of race conditions and background processes influencing the test results of the QA tests. The occasional occurrence of these bugs made debugging troublesome, especially on Travis CI, where each complete run took about 20 minutes and the log output is the only source of information for debugging.
Another difficult-to-catch bug occured with the last module, profile. This module used a background thread, which ran at a glacial pace. However, the same code ran at normal pace in the main thread. The problem was that the Python Global Interpreter Lock (GIL) wasn't released while the Python PMDA was waiting for new instructions, therefore the background thread was effectively blocked (thanks to Marko Myllynen and Frank Ch. Eigler for debugging this issue!).
Vector uses nvd3, which requires d3 version 3. d3-heatmap2 and d3-flame-graph however require d3 version 4. The "fix" for d3-heatmap2 was a tiny wrapper around the renamed functions, but d3-flame-graph used new features of d3. Therefore I also included d3 version 4 in the project, renamed the global variable to d3v4 and modified the module definition of the included d3-flame-graph to use this variable instead. There is a huge refactoring going on in Vector at the moment (replacing AngularJS with React and nvd3 with semiotic, and upgrading d3 to version 4), therefore this workaround will be obsolete soon.
I would like to thank my mentors Marko Myllynen and Martin Spier for their support, code reviews and responsiveness. Also, I would like to thank Nathan Scott for merging my PRs and occasional pings for failed Travis CI builds, Frank Ch. Eigler for helping debugging the GIL issue and Mark Goodwin for fixing the PMDA shutdown issue. Furthermore, thanks to Brendan Gregg for creating the original BCC tools. Finally I like to thank Google for giving me this opportunity.
a few screenshots from the new Vector widgets:
biolatency
tcplife
profile