Guerrilla Techniques for Robust Performance Engineering

Collecting performance data, either in test or production, is only half the story. Why? First, all data is wrong ... by definition. Data is the result of some kind of measurement process and, like cutting a piece of lumber to a measured length, there's always some deviation around the expected length (the pencil line) and the actual cut length (made by the saw blade). The important question is, how much deviation should be tolerated? The building industry has standards that define a tolerance for blueprint lengths, viz., ± 1/8" (or ± 3mm). What is the tolerance for performance engineering measurements?

Second, the only sensible way to address to this question is to compare measurements with their expected values. Expected values come from a "blueprint" called a performance model: a calculable abstraction that characterizes the system being measured. This notion should come as no surprise. Accounting departments do this every day, viz., construct a forecast model of expected revenue and compare actual revenue against it. In Accounting, this procedure is de rigueur. In performance engineering, such a procedure seems to be either unknown or ignored.

In fact, the performance engineering procedure can be summarized in a simple diagram

For example, given a performance metric, C(N), that is measurable as a function of load, N (i.e., concurrent processes or threads running in test or prod), C(N) can be defined in terms of both its performance data and a corresponding performance model. Both characterizations of C(N) must be in agreement—within some acceptable tolerance. This relationship between data and a model challenges the generally held notion that data are sacrosanct. Guerrilla Mantra 1.16: Data Are Not Divine. A fortiori, the quip often attributed to Einstein: If the data don't fit the model, change the data.

The Guerrilla approach involves a set of techniques intended to overcome this lack of rigor in performance engineering by providing both students and professionals with a lingua franca that forces rigorous requirements to the surface. The base language comes from queueing theory because there is a 1-to-1 correspondence between the performance metrics that characterize queues and the performance metrics that characterize computer systems. Indeed, all computer systems, from your smartphone to Facebook.com, can be represented as a directed graph of queues.

A sad fact is that most queueing theory textbooks are written by mathematicians, for mathematicians. Those who need to understand queueing models most, like performance engineers, understandably find the math inscrutable, if not totally repugnant. The math is complicated because real queueing dynamics is indeed very complicated. Like computer systems, queues are nondeterministic and that means the dynamics can only be known in probability. Next time you're in a grocery store, pay close attention and try to predict what will happen in the next 60 seconds or so. The mathematical complexity in trying to make such predictions arises from fluctuations in the queue, i.e., instantaneous changes of state. Who could have guessed that the current customer at the checkout register would inadvertantly force the cashier to request a price-check? Now, your estimated waiting time is suddenly wrong! And there's no way you could have seen that fluctuation coming.

The primary trick in the Guerrilla approach is to turn the fluctuations off. The effect on the queueing model is to reduce the highly unintuitive mathematics of applied probability theory to simple high-school algebra. Moreover, this fluctuation-free queue is not some crippled distortion. It is completely correct and therefore can be enhanced with fluctuations, if the need arises. In my experience, however, the insights gained from a simple Guerrilla model, make obvious what needs to be corrected to improve performance in the real computer system. Any concerns about adding the missing fluctuations are quickly forgotten. Instead, the most important action is to rinse and repeat the Guerrilla procedure. In other words, apply the identified performance corrections to the actual computer system, measure its performance again and then compare those new data to the same Guerrilla performance model. This process of iterative improvement is repeated until everyone is satisfied that the system performance goal can be reached.

Students also need to understand that it is not sufficient to simply know individual performance metrics. Any performance metric, no matter how obscure, e.g., Google's "page speed", must belong to one of the following three metric types:

Time (the zeroth metric)
Rate
Number

Our brains have evolved to predict linearly, whereas computer systems behave nonlinearly. That means performance predictions cannot be guessed. They must be computed. The Guerrilla approach intrinsically captures the nonlinear relationship between performance metrics.

Another aspect of the Guerrilla approach pertains to data visualization. Time-series plots (the analog of Wall Street charts) that are so ubiquitous in modern performance monitoring tools, are not the best cognitive match for our brain. Anywhere from a quarter to a half of our cerebral cortex is devoted to visual processing. That makes our brain one of the best cognitive tools for pattern recognition. But, our brain is hungry to see patterns, even when there are none there! So, looking for patterns in time-series data should be done using rigorous statistical tools. One of the best, if not the best, is the forecast package in R. However, even with such powerful tools at our disposal, one needs to keep in mind that this type of statistical forecasting is really just a sophisticated form of data trending.

An often overlooked limitation of statistical trending is that it can only make predictions based on the information already contained in the existing data set. In that sense, forecasting cannot predict effects that have not already occurred. In particular, forecasting cannot predict future bottlenecks in the actual computer system if they have not already been detected. Guerrilla Mantra 1.20: You never remove a bottleneck, you only move it. Clearly, forecasting cannot work in the case where you would like to predict performance, e.g, scalability, of a computer system that has not yet been built. Once again, that is a role of Guerrilla queueing models. Accordingly, explicit time-series plots need to be transformed into implicit-time plots so as to make clear the nonlinear relationship between metrics. This is the basis of the Universal Scalability Law.

Beyond the technical content of the Guerrilla approach just outlined, I've always had trepidations about going online. One reason is that I'm old-school, preferring students to be physically present so that I can see visual cues that allow me to gauge how well I'm being understood, or not. Then came Covid-19, which changed everything. Everyone now has Zoom, and the video quality has improved to the point where visual cues are now as good as, or sometimes better, than in a physical classroom. Additionally, real-time delivery (rather than on-demand video), enables me to learn direclty from students. Most recently, I was given a demo of a k8s application in a way that would otherwise have taken me a lot of time to learn by myself. It's always gratifying when the student teaches the teacher.

Several examples of applying the Guerrilla approach to various computer systems, including Cloud and GenAI, will be presented. Full details are presented in the online Guerrilla Training classes.

DrQz/WEPPE2025.md

Guerrilla Techniques for Robust Performance Engineering