uucidl/00_ProfilingAndDataAnalysisInSoftware.org

Last active August 23, 2024 15:21

Star (5) You must be signed in to star a gist
Fork (3) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/uucidl/fe95a32e504452f51dcc4691a8db811e.js"></script>
Save uucidl/fe95a32e504452f51dcc4691a8db811e to your computer and use it in GitHub Desktop.

Download ZIP

Profiling and data analysis resources

Raw

00_ProfilingAndDataAnalysisInSoftware.org

Goals

Looking at software from a different angle as done during profiling and data analysis has numerous benefits. It requires and increases our understanding of the problems being solved by the software. It exposes us to unexpected discoveries and insights. It allows us to suggest improvements.

Since there is a great element of skills involved, and it is rarely taught in Computer Science degrees, the role of practice is very important to gain the necessary skills.

Exercises

Ex1: sequential processing

Reading sequentially and processing each element from:

Case 1: an array `Foo arr[N]`
Case 2: an array `Foo *arr[N];` (pointers to “random” locations)
Case 3: linked list of Foos.

Order the cases
Explain the ordering
Measure

Ex1.1: measure again while another Core continuously exercises Case 1 in a loop.

People

Paul Khuong https://twitter.com/pkhuong http://pvk.ca
Mike Acton https://twitter.com/mike_acton http://www.macton.ninja/
Fabian Giesen https://fgiesen.wordpress.com
Brendan Gregg http://www.brendangregg.com
Rico Mariani https://www.facebook.com/notes/rico-marianis-performance-tidbits/

Tips

http://uucidl.github.io/use-frame-time-not-frame-rate-when-working-on-performance.html

Reasons

Rob Pike Rules

From http://users.ece.utexas.edu/~adnan/pike.html:

Rule 1. You can’t tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don’t try to second guess and put in a speed hack until you’ve proven that’s where the bottleneck is.

Rule 2. Measure. Don’t tune for speed until you’ve measured, and even then don’t unless one part of the code overwhelms the rest.

Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don’t get fancy. (Even if n does get big, use Rule 2 first.)

Rule 4. Fancy algorithms are buggier than simple ones, and they’re much harder to implement. Use simple algorithms as well as simple data structures.

Rule 5. Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

From Reality to Model, rather than Model to Reality

Software creation is often presented as the creation of a model that will run “in reality.”

This activity of profiling and analyzing data + code is about reversing that arrow:

From http://realworldrisk.com/:

When and if we model, we go from reality to models not from models to reality

Raw

01_Tools.org

Tools

https://uica.uops.info/ latency / microops analysis for x86_64
Microsoft PIX d3d12, x64, windows
Very Sleepy windows. Free, lightweight, easy to use. Slow.
ETW/Windows Performance Toolkit/XPerf/WPA. Windows. Really worth it, yet barely useable before Bruce Dawson documented it and made a front-end for it (UIForETW)
Visual Studio’s Performance Profiler (Debug > Performance Profiler). Windows
RenderDoc OpenGL, Vulkan, D3D11, D3D12 capture/debugging, windows/linux
https://catapult.gsrc.io/ Google’s Catapult suite of tools
https://github.com/uucidl/uu.spdr Invasive tracing generating chrome://tracing reports
https://github.com/adrianstone55/SymbolSort Windows. SymbolSort sorts symbols in binaries by weight.\
NVidia NSight debugger (profiler/debugger for GPU workloads)
Python profiler: https://github.com/plasma-umass/scalene see https://www.youtube.com/watch?v=vVUnCXKuNOg

Tricks

Adding some overhead to the thing under measurement (especially if it’s diffuse) is a nice way to see its impact. (Like adding a loop around a section)

Raw

02_AutomatedSystems.org

Google Chrome

Google Chrome has a suite of performance tests that are continuously checked.

https://chromium.googlesource.com/chromium/src/+/master/docs/speed/addressing_performance_regressions.md

Reports can be found at: https://chromeperf.appspot.com/

Rico Mariani on Quantized Distributions (Problems in Benchmarking)

https://www.facebook.com/notes/rico-marianis-performance-tidbits/quantized-distributions-illustrated/475243556160154/

Raw

03_PerformanceCulture.org

Rico Mariani: “Performance Culture Best Practices”

@url: https://www.facebook.com/notes/rico-marianis-performance-tidbits/performance-culture-best-practices/473526169665226/

Quoting:

Two Rules You Must Follow

Rule #1: Measure

Just thinking about what to measure will help you do a good job If you don’t measure you can be sure it will be slow, big, or whatever else you don’t want If you haven’t measured, your job’s not finished

Rule #2: Do your homework

Good engineering requires you to understand your raw materials What are the key properties of your Framework? Your processor? Your target system? Three Steps To Success

Budget

An exercise to assess the value of a new feature and the cost your customer would be willing to pay, not a technical assessment of what is possible

Plan

Design and validate against the budget, this is a design plan and a risk assessment

Verify

Measure the final results, discard failures without remorse or penalty, don’t make your customers live with them

Raw

RR_FabianGiesen_OnBenchmarking.md

Fabian Giesen @rygorous

Top 3 benchmarking mistakes/gotchas:

Are you using a relevant test set (are you testing the right thing)

Example: If you're interested in some string processing thing and most strings you care about are short, test on 100 000 short strings, not one 1MB string. Very different workloads!

Does your test give consistent results if run repeatedly?

If you don't verify this and try to iterate on something, you're just looking at noise.

Have you verified that you're measuring what you think you're measuring?

a) In microbenchmarks especially, look at what the compiler does with it. If your test consists of repeated executions of a pure function with the same arg and you never do anything with the result many compilers will get rid of the loop entirely.

b) don't run microbenchmarks to gather data, period. They're OK for hypothesis testing when trying to confirm an expected result but if you have no idea what the result should be you also have no way of detecting the many, many ways to screw microbenchmarks up.

If you're using a recursive Fibonacci "benchmark" for any purpose other than as the compiler writer's equivalent to "Hello, world" to make sure you're function calls work, you're in a state of sin.

Using a Fib benchmark result in any debate results in instant ban. It's the law.

Raw

RR_HowNotToMeasureLatency_GilTene.org

“How NOT to Measure Latency” by Gil Tene

@url{https://www.youtube.com/watch?v=lJ8ydIuPFeU} How NOT to Meaure Latency by Gil Tene.

Presenter

Gil Tene, Azul CTO and Cofounder of Azul System. (Builder of high performance JVMs) Ex: Nortel Networks, Shasta Networks, Check Point Software Technologies.

@url: http://stuff-gil-says.blogspot.de/ @url: http://latencytipoftheday.blogspot.de/ @url: https://twitter.com/giltene

Summary

This talk has been given in multiple forms. It usually gets a Oh Shit! response from its audience. Gil Tene presents himself as an experienced System Builder.

The topic is latency, or response-time. The time it takes for one operation to happen.

More accurately, as we’re dealing with not just one operation but many, the topic is how to study the latency behavior of a system: How does latency in general behave?

Pretty, misleading charts

If there’s one thing you should and can measure and plot, it’s the max response time. This is a genuine signal of something that happened.

However often the question we ask ourselves is what the most common response time was. This can however lead to misleading visualizations, unfortunately present in many monitoring and measuring systems.

see @url: http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-q-whats- wrong-with_21.html

Percentile: a percentage corresponding to a certain amount of observations from a system. The 95% percentile of requests is: if you take 100 requests, looking at 95 of them.

A typical chart contains:

response time of a system over time (2h)
latency as one series per percentile

Beware of graphs that only plot only what’s convenient. Gil Tene here shows a graph that shows response times at most for 95% of requests. What happens for the 5% of requests that were left out?

Then the presenter makes a thought experiment. What is the user experience like?

He takes the example of users visiting a website. These users have a typical interaction of about 5 pages visit each of which make about 40 requests. Each request is essential, and may lead to a bad user experience.

Let’s figure out how many users encounter the requests that were in our 95% percentile? Just 0.95^200 = 0.0035% of your users, which is basically nobody.

So why should we look at the 95% and below percentile? This chart was really misleading in that it gave us the impression that we were looking at a common case!

Let’s do the experiment in reverse. What percentile of requests should you look at to have an idea of what 99% of the users experience?

99% = y^200 log(99%) = 200*log(y) y = exp(log(0.99)/200) = 99.9950 %

Gil Tene suggests collecting and printing the response time with x axis: percentile and y axis: response time, to get an overview of the response time at various percentiles.

Sale pitch: HdrHistogram. @url: https://github.com/HdrHistogram/HdrHistogram

The coordinated omission problem.

A problem due to measurement methodology. That induces bad conclusion about performance.

Under a serial load generator, emitting operations one after the other at a fixed rate, an unresponsive system that freezes for a long time, will give misleading read-outs, underestimating the extent of the freeze:

As the system becomes unresponsive (completely) for a long time, we get 1 bad response time, and n very good ones once it has become responsive again.

A naive load generator does that if it waits and does not emit requests until the previous one has responded.

In code measurement the same occurs: as we collect traces of operations, we won’t get new traces of what was wrong because we simply were unresponsive. This results in only keeping the “good” numbers.

Service Time vs Response Time.

The serial load generator/measurement shown in the part before was bad in that it only measured service time and did not include waiting time.

However latency is really composed of the total waiting + service times:

+---------------------+
| Response Time       | 
+------+--------------+
|      | Service Time |
+---------------------+
| Wait | Work         |
+------+--------------+

Wait is pure waste. (Queuing to get service?)
Work is productive. (Getting serviced)

Tip to verify whether your load generator/monitoring system measures actual response time:

Push it hard. After a while the response time should grow linearly.
Linear response is when the system reaches fragility.

Sustainable Throughput

Latency relating to throughput and load.

Figuring out how to avoid “saturation” is better than trying to study what happens after or at saturation. Figure out how fast are we while still being safe.

Comparing response time or latency behaviors

Differential comparison at various workloads. 40K/85K/ {90K bad} Looking at requirements can help comparing two setups/systems and assess whether which ones are more economical than others for the requirement.

Plot the max

Is a good visual summary for comparing two systems.

Raw

RR_Scylla’s Approach to Improve Performance for CPU-bound workloads.md

@title: Scylla’s Approach to Improve Performance for CPU-bound workloads

@url: https://www.scylladb.com/2017/07/06/scyllas-approach-improve-performance-cpu-bound-workloads/

Funny. Today almost everything is CPU-bound. (If you don't make a big difference between CPU and memory)

I mostly skimmed through the article because I was familiar with some of out. Things that stood out:

knowing how much processor time was spent in particular functions may be not enough. When the CPU is busy it may be so because it is actually doing some work or is stalled waiting for memory

@see: https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

completed instructions per cycle: IPC

If it is low, less than 1, it is very likely that the application is, in fact, memory bound.

modern processors have Performance Monitoring Unit (PMU)

@see: https://github.com/andikleen/pmu-tools/wiki/toplev-manual @see: https://sites.google.com/site/analysismethods/yasin-pubs @see: https://drive.google.com/file/d/0B_SDNxjh2Wbcc0lWemFNSGMzLTA/view?resourcekey=0-eHHOKhPrwPN8uq-t_L9yLg (Top-Down analysis paper)

toplev is a tool for Linux, so not directly useable here on windows. It appears that Intel has implemented the Top-Down analysis paper in V-Tune: @see: https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html

@see: https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler/download.html

The V-Tune profiler appears to be available to all without fee these days, which is good news! I've heard of it for so long and never got to try it out.

I guess the interesting idea in this article, beyond sharing information about CPU architecture is the idea behind the Top-Down paper, which is to arrange bottlenecks into hierarchies, which can help with the analysis.

Another thing I like is that it highlighted how ICache misses are quite often an issue. Can be alleviated by batching and creating

Raw

ZZ_ReadingList.org

Test: https://github.com/pierricgimmig/orbitprofiler