You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we are asked to do performance optimization of a system, we need
to ensure that we are taking a model-based, stochastic approach. An
“optimization loop” is a process that will guide you through using
your model to improve the performance of your system.
Introduction
Crap about me
Who I work for
Etc
What we’re going to talk about
Model-based performance engineering
The optimization loop
Caveat Auditor
We will consider the specific case of transactional, web-based systems
The most important bits of what we’re going to talk about will apply
more broadly
Nothing said here is canon
Your mileage may vary
Model-based Optimization
What’s wrong with this picture?
Near ship time, someone says, “Time to make it faster”
You go to your users to get performance requirements
They make up a number (e.g. 500ms) and give it to you
You measure performance of the home page and tweak until it’s fast
Problems with optimization as commonly practiced
Done late
Done haphazardly
Done in the absence of an adequate model
Or any model at all
Performance is not a scalar quantity
“How fast is the system?” rarely has an answer
Your system probably doesn’t do just one thing
This point is about transaction mix
The world is not deterministic!
E.g. When you go to the supermarket, it does not always take you
exactly 15 minutes and 37 seconds.
Need a stochastic model of your system’s performance
Models are super useful
What is the 99% case for response time when I have 10 users?
What does it look like if I get mentioned on HN and have 10,000?
How many servers will I need to handle a spike?
Models are super wrong
“Remember that all models are wrong; the practical question
is how wrong do they have to be to not be useful.”
George E. P. Box, Empirical Model-Building
Be aware of assumptions in the model
Testing versus production
For example, how big is the data in your test set? Will the system
behave the same when the database has two years’ worth of data in it?
Is the network different? Does that matter?
Weirdness in test results
That little bump that always shows up: what the hell does that even
mean? Is it sampling error? Some weird second-order effect?
Weird stuff that users do
Extrapolation to load-balanced environments
Or to additional servers in the farm
One Useful Model
For transactional systems
Latency distribution vs. throughput
For a given transaction mix
There might be other constraints
For example, on our system, we also had to meet data freshness
constraints.
Other models exist
Depending on your system (e.g. batch) you might need to design one.
Show typical curve
Note the oddities in the curve - model is wrong
Questions we can now answer
Can we meet customer expectations?
How much capacity do we need to handle spikes?
What is it going to cost us to run this system?
Start performance work early
Early enough to make a difference
Develop an automated benchmarks that run frequently
Every weekend
Every night
On every checkin
In the absence of anything else, useful as a relative measure
Holy smokes! It’s three times slower today!
Performance has high potential for requiring redesign
“Any problem in computer science can be solved with another layer of
indirection.” –David Wheeler
“Except for the problem of too many layers of indirection.” –Kelvin Henney
Working with Stakeholders
Get them educated as early as possible
Foster a stochastic mindset
Find out what questions need to be answered
But don’t let early apathy or indecision stop you
Include them on the distribution for automated load testing
Consider making them the gatekeeper for optimization decisions
The Optimization Loop
The Optimization Loop
Benchmark
Analyze
Recommend
Optimize
Benchmarking
Measure the parameters of your model
Transaction mix is critical!
Don’t mistake # of threads for load
Understand what you are measuring!
E.g. are you measuring the response time at the client or the server?
Which one do you want to measure?
Don’t forget to look for errors!
Errors essentially change the transaction mix, and the happy path and
the exception path can have wildly different perf characteristics.
Lots of tools
Load generators
jmeter
ab
httperf
Analysis
Excel
Clojure!
Analysis
Figure out what is determining the perf of your system
Generally done via profiling
And lots of hard thinking
Does not necessarily need to be analysis of benchmark run
Lots of tools
E.g. YourKit
Recommendation
Generally, looking for the one slowest thing
Especially in the early rounds
This will get way harder the longer you go
Use the data!
I.e. don’t assume
The best recommendation is “We’re done. Let’s stop.”
Optimization
Fix the one slowest thing
Don’t assume you know what this is!
Point out that I’ve seen experienced developers go straight for the
algorithmic optimizations, when network calls were four orders of
magnitude slower. Also maybe note the logging thing we ran into.
You may find redesign is necessary
This is why you want to start early
One recent experience
One recent experience
A web service written in pure Clojure
Initial benchmarks at ~75 peak rps, ~25 ms avg latency
Did several rounds of optimization over about a month
Achieved 1500 rps, 10ms latency @ 99% confidence
Think we can scale close to linearly with load balancing
Customer thrilled
Weird stuff that happened
Never tested load-balanced
Would the network become the limiting factor? Can the load balancer
handle the volume?
Uncovered an issue with huge volume of log data
Generate something like 2GB of data per hour. Customer wanted to keep
it all. Had to figure out what data was actually important to keep and
figure out how to get it off the machine, since syslog completely
choked when doing log forwarding.
Third-round analysis showed our logging tool was the bottleneck
Stuart and Aaron wrote a C/Java wrapper that had a huge impact on perf.
A one character change in a config file had big perf impact
Other Considerations
Stress testing
We didn’t talk about this, but it’s important. You might be able to
leverage your benchmarks running at or beyond your peak load. Which
you actually know now?
Fin
Thanks!
Try the fish!
Questions?
Feedback
Tim 2011-10-21
Summary: the big takeaways are that a stochastic/model-driven approach
is key; that transaction mix is critical; and the benefits of the
benchmark/analyze/recommend/optimize loop. We should also increase
emphasis on doing it earlier rather than later.
Move the description of the system lower, so that it’s sort of a payoff. “Hey we did this, and it resulted in the following awesome thing.”
Urge performance testing early on
Requires redesign comment sort of throwaway - needs to be emphasized
Expand on goal of finding the questions that needs to be answered
Important points
It’s a stochastic world
Transaction mix is critical
The cycle
This motivates the approach
Make the point of the difference between Analysis and Benchmarking
Stress that the earlier you do this, the better
Better to run your first benchmark and find out you’re done
Also good if you run this all the time, so you can tell if you need to enter the loop again
Still a safety net
Fogus 2011-10-21
I’d like to see a discussion of accidentally measuring the wrong thing. I hope my meaning is clear.
Tim 2011-10-26
Summary wording is weird.
Slides @ 33 & 44 - odd juxtaposition
“Day one” is a bit strong - “early enough to make a difference”
Doesn’t think we should have the “Develop a model” slide.
Instead, talk about the fact that other models exist and that you
might need to develop one.
Consider moving working with stakeholders above the loop slide
At the very least, needs to be above the section on “our experience”, since I’m going to wind up telling that part of the story before that slide comes up
Don’t forget the story about how we replaced the logger and got a huge perf improvment
And about how we generate huge amounts of log data
And about how we went to async with a one-character change in the config file
I believe it is. Not sure whether I can fit it into a 40-minute talk (especially being a motormouth), but I'll keep it in mind during the next round of edits.
"Day one" is a bit strong - "early enough to make a difference"
Doesn't think we should have the "Develop a model" slide. Instead, talk about the fact that other models exist and that you might need to develop one.
Consider moving working with stakeholders above the loop slide
** At the very least, needs to be above the section on "our experience", since I'm going to wind up telling that part of the story before that slide comes up
Don't forget the story about how we replaced the logger and got a huge perf improvment
** And about how we generate huge amounts of log data
** And about how we went to async with a one-character change in the config file
** It's a stochastic world
** Transaction mix is critical
** The cycle
** This motivates the approach
** Better to run your first benchmark and find out you're done
** Still a safety net