-*- mode: org -*-

Outline

Two-sentence summary of the talk:

When we are asked to do performance optimization of a system, we need to ensure that we are taking a model-based, stochastic approach. An “optimization loop” is a process that will guide you through using your model to improve the performance of your system.

Introduction

Crap about me

Who I work for

Etc

What we’re going to talk about

Model-based performance engineering

The optimization loop

Caveat Auditor

We will consider the specific case of transactional, web-based systems

The most important bits of what we’re going to talk about will apply more broadly

Nothing said here is canon

Your mileage may vary

Model-based Optimization

What’s wrong with this picture?

Near ship time, someone says, “Time to make it faster”

You go to your users to get performance requirements

They make up a number (e.g. 500ms) and give it to you

You measure performance of the home page and tweak until it’s fast

Problems with optimization as commonly practiced

Done late

Done haphazardly

Done in the absence of an adequate model

Or any model at all

Performance is not a scalar quantity

“How fast is the system?” rarely has an answer

Your system probably doesn’t do just one thing

This point is about transaction mix

The world is not deterministic!

E.g. When you go to the supermarket, it does not always take you exactly 15 minutes and 37 seconds.

Need a stochastic model of your system’s performance

Models are super useful

What is the 99% case for response time when I have 10 users?

What does it look like if I get mentioned on HN and have 10,000?

How many servers will I need to handle a spike?

Models are super wrong

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

George E. P. Box, Empirical Model-Building

Be aware of assumptions in the model

Testing versus production

For example, how big is the data in your test set? Will the system behave the same when the database has two years’ worth of data in it? Is the network different? Does that matter?

Weirdness in test results

That little bump that always shows up: what the hell does that even mean? Is it sampling error? Some weird second-order effect?

Weird stuff that users do

Extrapolation to load-balanced environments

Or to additional servers in the farm

One Useful Model

For transactional systems

Latency distribution vs. throughput

For a given transaction mix

There might be other constraints

For example, on our system, we also had to meet data freshness constraints.

Other models exist

Depending on your system (e.g. batch) you might need to design one.

Show typical curve

Note the oddities in the curve - model is wrong

Questions we can now answer

Can we meet customer expectations?

How much capacity do we need to handle spikes?

What is it going to cost us to run this system?

Start performance work early

Early enough to make a difference

Develop an automated benchmarks that run frequently

Every weekend

Every night

On every checkin

In the absence of anything else, useful as a relative measure

Holy smokes! It’s three times slower today!

Performance has high potential for requiring redesign

“Any problem in computer science can be solved with another layer of indirection.” –David Wheeler

“Except for the problem of too many layers of indirection.” –Kelvin Henney

Working with Stakeholders

Get them educated as early as possible

Foster a stochastic mindset

Find out what questions need to be answered

But don’t let early apathy or indecision stop you

Include them on the distribution for automated load testing

Consider making them the gatekeeper for optimization decisions

The Optimization Loop

Benchmark

Analyze

Recommend

Optimize

Benchmarking

Measure the parameters of your model

Transaction mix is critical!

Don’t mistake # of threads for load

Understand what you are measuring!

E.g. are you measuring the response time at the client or the server?

Which one do you want to measure?

Don’t forget to look for errors!

Errors essentially change the transaction mix, and the happy path and the exception path can have wildly different perf characteristics.

Lots of tools

Load generators

jmeter

ab

httperf

Analysis

Excel

Clojure!

Analysis

Figure out what is determining the perf of your system

Generally done via profiling

And lots of hard thinking

Does not necessarily need to be analysis of benchmark run

Lots of tools

E.g. YourKit

Recommendation

Generally, looking for the one slowest thing

Especially in the early rounds

This will get way harder the longer you go

Use the data!

I.e. don’t assume

The best recommendation is “We’re done. Let’s stop.”

Optimization

Fix the one slowest thing

Don’t assume you know what this is!

Point out that I’ve seen experienced developers go straight for the algorithmic optimizations, when network calls were four orders of magnitude slower. Also maybe note the logging thing we ran into.

You may find redesign is necessary

This is why you want to start early

One recent experience

A web service written in pure Clojure

Initial benchmarks at ~75 peak rps, ~25 ms avg latency

Did several rounds of optimization over about a month

Achieved 1500 rps, 10ms latency @ 99% confidence

Think we can scale close to linearly with load balancing

Customer thrilled

Weird stuff that happened

Never tested load-balanced

Would the network become the limiting factor? Can the load balancer handle the volume?

Uncovered an issue with huge volume of log data

Generate something like 2GB of data per hour. Customer wanted to keep it all. Had to figure out what data was actually important to keep and figure out how to get it off the machine, since syslog completely choked when doing log forwarding.

Third-round analysis showed our logging tool was the bottleneck

Stuart and Aaron wrote a C/Java wrapper that had a huge impact on perf.

A one character change in a config file had big perf impact

Other Considerations

Stress testing

We didn’t talk about this, but it’s important. You might be able to leverage your benchmarks running at or beyond your peak load. Which you actually know now?

Fin

Thanks!

Try the fish!

Questions?

Feedback

Tim 2011-10-21

Summary: the big takeaways are that a stochastic/model-driven approach is key; that transaction mix is critical; and the benefits of the benchmark/analyze/recommend/optimize loop. We should also increase emphasis on doing it earlier rather than later.

Move the description of the system lower, so that it’s sort of a payoff. “Hey we did this, and it resulted in the following awesome thing.”

Urge performance testing early on

Requires redesign comment sort of throwaway - needs to be emphasized

Expand on goal of finding the questions that needs to be answered

Important points

It’s a stochastic world

Transaction mix is critical

The cycle

This motivates the approach

Make the point of the difference between Analysis and Benchmarking

Stress that the earlier you do this, the better

Better to run your first benchmark and find out you’re done

Also good if you run this all the time, so you can tell if you need to enter the loop again

Still a safety net

Fogus 2011-10-21

I’d like to see a discussion of accidentally measuring the wrong thing. I hope my meaning is clear.

Tim 2011-10-26

Summary wording is weird.

Slides @ 33 & 44 - odd juxtaposition

“Day one” is a bit strong - “early enough to make a difference”

Doesn’t think we should have the “Develop a model” slide.

Instead, talk about the fact that other models exist and that you might need to develop one.

Consider moving working with stakeholders above the loop slide

At the very least, needs to be above the section on “our experience”, since I’m going to wind up telling that part of the story before that slide comes up

Don’t forget the story about how we replaced the logger and got a huge perf improvment

And about how we generate huge amounts of log data

And about how we went to async with a one-character change in the config file

candera/conj-outline.txt

Outline

Introduction

Crap about me

Who I work for

Etc

What we’re going to talk about

Model-based performance engineering

The optimization loop

Caveat Auditor

We will consider the specific case of transactional, web-based systems

Nothing said here is canon

Your mileage may vary

Model-based Optimization

What’s wrong with this picture?

Near ship time, someone says, “Time to make it faster”

You go to your users to get performance requirements

They make up a number (e.g. 500ms) and give it to you

You measure performance of the home page and tweak until it’s fast

Problems with optimization as commonly practiced

Done late

Done haphazardly

Done in the absence of an adequate model

Or any model at all

Performance is not a scalar quantity

“How fast is the system?” rarely has an answer

Your system probably doesn’t do just one thing

The world is not deterministic!

Need a stochastic model of your system’s performance

Models are super useful

What is the 99% case for response time when I have 10 users?

What does it look like if I get mentioned on HN and have 10,000?

How many servers will I need to handle a spike?

Models are super wrong

Be aware of assumptions in the model

Testing versus production

Weirdness in test results

Weird stuff that users do

Extrapolation to load-balanced environments

One Useful Model

For transactional systems

Latency distribution vs. throughput

For a given transaction mix

There might be other constraints

Other models exist

Show typical curve

Questions we can now answer

Can we meet customer expectations?

How much capacity do we need to handle spikes?

What is it going to cost us to run this system?

Start performance work early

Early enough to make a difference

Develop an automated benchmarks that run frequently

Every weekend

Every night

On every checkin

In the absence of anything else, useful as a relative measure

Holy smokes! It’s three times slower today!

Performance has high potential for requiring redesign

Working with Stakeholders

Get them educated as early as possible

Foster a stochastic mindset

Find out what questions need to be answered

But don’t let early apathy or indecision stop you

Include them on the distribution for automated load testing

Consider making them the gatekeeper for optimization decisions

The Optimization Loop

The Optimization Loop

Benchmark

Analyze

Recommend

Optimize

Benchmarking

Measure the parameters of your model

Transaction mix is critical!

Don’t mistake # of threads for load

Understand what you are measuring!

E.g. are you measuring the response time at the client or the server?

Which one do you want to measure?

Don’t forget to look for errors!