systems_performance.md

Systems Performance 2nd edition

Do a quick performance check in 60 seconds
Use a number of different tools available in unix
Use flamegraphs of the callstack if you have access to them
Best performance winds are elimiating unnecessary wrok, for example a thread stack in a loop, eliminating bad config
Mantras: Don't do it (elimiate); do it again (caching); do it less (polling), do it when they're not looking, do it concurrently, do it more cheaply
Latency is an essential performance metric - the time for an operation to complete

Operation request
Database query
File system operation
We can improve latency by decreasing disk reads, aka caching

Actionable Chain of Events

Counter --> Statistics --> Metrics --> Alerts

Profiling tools allow us to take simple measures of CPUs, including flamegraphs, which show us CPU footprint.

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth, counting from zero at the bottom. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its ancestry. Original flame graphs use random colors to help visually differentiate adjacent frames. Variations include inverting the y-axis (an "icicle graph"), changing the hue to indicate code type, and using a color spectrum to convey an additional dimension.

Tracing - Event-based recording where data is saved for later analysis.

Linux 60-second checklist

Also here: if you only have a bit of time to profile your system.

In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting

Don't only use top because you don't know other tools, creates a streetlight effect.

uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

High-level terminology

IOPS - input/output per second, data trasnfer
Latency - measure of time of operations spent waiting
Saturation - Degree which a resource has been queued
Hit ratio: number of times needed data is found in cache versus total access (hits+ misses)

Performance tradeoffs:

Good -- Fast -- Cheap ; high-performance -- Ontime -- inexpensive

File system size: small records perform better for I/O; larger record sizes will improve streaming workloads

Types of caches

Performance tuning is most effective when done closest to the work performed
**MRU **- most recently used
LRU - least recently used
MFU - most frequently used
LFU - least recently used

Cold cache - empty, populated with unwanted data. Hit ratio is zero as it begins to warm up. Warm cache - populated with useful data but doesn't have a large enough hit ratio

Cold --> Warm --> Hot
Ratio improving

Cache tuning: Aim to cache as high in the stack as possible, closer to where the work is, performed directly reduces the operational overload of cache hits.

p. 61: performance Mantras

State the goals of the study and define system boundaries
List system services and possible outcomes
Select performance metrics
List system and workload parameters
Select factors and their values
Select the workload
Design the experiments
Analyze and interpret the data
Present the results
If necessary, start over

Disk Utilization (p. 65)

Disk utilization can become a problem even before it hits 100%. To find the bottleneck:

Measure rate of server requests, monitor this rate over tme
Measure hardware and software resource usage
Express server requests in terms of resource used
Extrapolate severer requests for each resource

Constraints:

**Hardware: **

CPU Utilization
Memory Usage
Disk IOPS
Disk Throughput
Disk Capacity

**Software: **

Virtual memory usage
Proess/tasks
File descriptions

Sharding - a common strategy for databases where data split into logical components, each managed by its own database

p. 106 - CPU versus IO bound:

CPU: Performing heavy compute like science and math
IO-bound: performing io like web servers and file servers, low latency is important

veekaybee/systems_performance.md