zlib-ng CLAUDE.md

Project Basics

Use CMake build system.
Always check the commits for HEAD and BASE or other branch names as they can change often.
To build for other architectures than the current architecture use llvm-clang unless gcc is specified.

arch/ - Architecture specific optimizations
test/ - Unit tests written using Google Test Framework (gtest_zlib project)
test/benchmarks - Performance benchmark testing using Google Benchmark Framework (benchmark_zlib project)

To enable benchmark_zlib use -D BUILD_TESTING=ON -D WITH_BENCHMARKS=ON.
Always configure CMake with -D BUILD_SHARED_LIBS=OFF to avoid linking time.
Isolate benchmarks runs by configuring and compiling them to separate build directories.
benchmark_zlib can be found in the build directory's test/benchmarks directory.
Run benchmark_zlib --benchmark_list_tests=true to list all benchmarks.
When running benchmark_zlib with --benchmark_repetitions, also use --benchmark_report_aggregates_only=true.
Run benchmark processes sequentially, otherwise it could cause contention and unreliable results.

Use git worktree to check out the contender branch, then configure and build each to separate directories.
Run benchmarks with --benchmark_out=<file>.json --benchmark_out_format=json to produce input files for comparison.

Git clone https://github.com/google/benchmark to .benchmark directory.
Create .venv virtual environment and install requirements pip3 install -r requirements.txt.
Use tools/compare.py benchmarks <benchmark_baseline> <benchmark_contender> [benchmark options]

Create a new GitHub gist with the summary of the results
Start the title of the gist and the filename of the gist with the name of the project.
Always include the machine specs in the summary

Object files can be found in the build directory's CMakeFiles/zlib-ng.dir subdirectory.
Use objdump or dumpbin to disassemble, or /arriba:extract-asm to extract a specific function from .o, .obj, or .s files.
Always compare assembly before/after to verify an optimization has the intended effect.

When reviewing extracted assembly, check:

Instruction count — total instructions, compare before/after.
Memory operations — count loads vs stores; flag unnecessary spills to stack.
Register pressure — identify stack spills ([sp, #...] on AArch64, (%rsp) / (%rbp) on x86) that indicate the compiler ran out of registers.
Branch density — count conditional branches in hot loops; fewer branches = better pipelining.
SIMD utilization — check for vector instructions (stp q, movi v on AArch64; vmov, vpadd, vpshuf on x86) vs scalar fallbacks.
Call overhead — external calls (bl, call) in hot paths force register saves; prefer inlined operations.
Loop structure — identify the back-edge branch and count instructions per iteration.
Constant materialization — mov immediates or adrp/ldr from constant pools; repeated materialization of the same constant suggests missed CSE.

Source-level techniques — always verify results in assembly for the architecture of interest or at least x86-64 and AArch64:

Prefer branchless computation using bit masking when the zero case is a no-op.
Look for ways to optimize using bit tricks.
Reduce unnecessary casts by looking at where the data is coming from and how it is being used.
Keep hot variables in registers across inline function boundaries using locals and pass-by-pointer.
Minimize live variables in hot loops to reduce register pressure and avoid stack spills.
Audit multi-way branches for unreachable paths.