Skip to content

Instantly share code, notes, and snippets.

@ayourtch
Last active May 11, 2026 08:34
Show Gist options
  • Select an option

  • Save ayourtch/23fab9e2cda25ffa33d11ed0017fe499 to your computer and use it in GitHub Desktop.

Select an option

Save ayourtch/23fab9e2cda25ffa33d11ed0017fe499 to your computer and use it in GitHub Desktop.
GPU profiling guide (written by Claude)

GPU Kernel Profiling with ncu — A Practical Guide

Written: 2026-05-11 Audience: developer familiar with CUDA but new to hardware-counter profiling Based on: a real profiling session of a large language model inference engine on an NVIDIA GB10 (Grace Blackwell, sm_121, CUDA 13.0)


Background: what we were optimizing and why

This guide uses a concrete example throughout. Understanding the example context helps make the metrics feel real rather than abstract.

The program: a custom CUDA inference engine for DeepSeek-V4-Flash, a large language model (LLM). The engine takes a text prompt and generates a response token by token. "Prefill" is the phase where all the prompt tokens are processed at once before generation starts — it is compute-intensive and dominates latency for long prompts.

The architecture — Mixture of Experts (MoE): DeepSeek uses a MoE transformer. Instead of one large feed-forward network per layer, there are many small "expert" networks, and each token is routed to a few of them. In practice this means the two most expensive operations per layer are:

  • moe_mid: the "up-projection" expert GEMM — projects activations to a wider space
  • moe_down: the "down-projection" expert GEMM — projects back to the model dimension

These two kernels are the dominant cost of prefill.

Quantization — IQ2_XXS and Q2_K: The model weights are stored in compressed quantized formats to fit in GPU memory and reduce memory bandwidth:

  • IQ2_XXS ("importance-quantized 2-bit extra-extra-small"): weights stored as approximately 2 bits per value using a learned lookup table (LUT). To use a weight, the kernel looks up its float value in a table — these LUT lookups are a key source of memory traffic.
  • Q2_K ("2-bit k-quant"): a different 2-bit quantization scheme using block-level scales and min values. Slightly different access pattern from IQ2_XXS.

Both formats require dequantization before the multiply-accumulate can happen: load the compressed weight → look up or decode its float value → multiply by activation.

The performance baseline: after enabling chunked layer-major prefill, the engine ran at ~111 t/s on the GB10. An nsys timeline profiler (see Section 1) showed the kernel time breakdown:

  • moe_mid_iq2_xxs_kernel — 65.8% of prefill time
  • moe_down_q2_k_kernel — 23.2% of prefill time
  • Everything else — 11% combined

With 89% of runtime in two kernels, the question became: why are these kernels slow, and what is the right lever to speed them up? That is what ncu is for.


1. The two NVIDIA profiling tools

NVIDIA provides two complementary profilers that serve different purposes.

nsys (Nsight Systems) — timeline profiler

nsys answers "what is happening and when?"

It records a timeline of every kernel launch, memory copy, API call, and CUDA event. You can see which kernels run in what order, whether they overlap, and how long each takes as a fraction of total runtime. nsys also does a fast roofline estimate by sampling DRAM bandwidth and SM utilization at coarse granularity — enough to flag a kernel as "memory-bound" or "compute-bound" without deep-diving into it.

nsys is fast — roughly 5-10% overhead. You can profile a complete multi-second inference run.

nsys profile --stats=true ./my-inference-engine --model weights.gguf --prompt "hello" -n 1

After the run, nsys prints a kernel summary sorted by total GPU time. This is the first step in any optimization workflow: find the kernel that owns the most time. In our case nsys told us moe_mid = 65.8% and moe_down = 23.2% of prefill — done in under a minute, with zero code changes.

ncu (Nsight Compute) — hardware counter profiler

ncu answers "WHY is a specific kernel fast or slow?"

It reads the GPU's on-chip performance monitoring units (PMUs) — hardware counters inside each SM, L1 cache, L2 cache, and memory controller. These counters accumulate counts of events like: warp instructions issued, warp instructions stalled, L1 cache bytes read, tensor-core instructions executed. Because they are hardware registers inside the chip, they report exact counts — not samples.

ncu is slow — it needs to replay the target kernel many times (14 times in our session). Use nsys to rank kernels by impact first, then ncu to root-cause the winner.

What ncu revealed for our two kernels:

  • moe_mid_iq2_xxs_kernel: L1 texture pipeline fully saturated — the LUT dequantization is flooding the L1 cache with lookups, causing 41% of warp cycles to stall waiting for the L1/MIO pipe to accept the next request.
  • moe_down_q2_k_kernel: only 50% SM occupancy due to high register pressure (80 registers per thread → only 6 thread blocks fit on each SM). With so few active warps, the SM cannot hide L1 cache latency, causing 45% of warp cycles to stall on L1 hits.

These are two completely different root causes requiring two different fixes.


2. The access control problem

By default, NVIDIA restricts hardware-counter access to the root user. This is controlled by a parameter in the nvidia kernel module:

NVreg_RestrictProfilingToAdminUsers=1   (default: ON)

A non-root user attempting to run ncu gets:

==ERROR== ERR_NVGPUCTRPERM: Permission issue with Performance Monitoring
          Unable to profile all kernels as the CUDA application running under
          user [...] does not have sufficient privileges.

Important: lowering /proc/sys/kernel/perf_event_paranoid to -1 does not fix this. That sysctl governs Linux's own perf subsystem. NVIDIA's restriction is enforced entirely inside the GPU driver, independently.

Two ways to unblock ncu:

Option A — immediate, no reboot required: run ncu as root

sudo ncu [flags] -- ./my_program [args]

Option B — permanent, survives reboots: disable the restriction via modprobe config

# Create the config file (requires root)
echo 'options nvidia NVreg_RestrictProfilingToAdminUsers=0' \
    | sudo tee /etc/modprobe.d/nvidia-prof.conf

# Also set perf_event_paranoid for full perf tool access
echo 'kernel.perf_event_paranoid = -1' \
    | sudo tee /etc/sysctl.d/99-perf.conf

# Reboot for the module parameter to take effect
sudo reboot

After the reboot, any user can run ncu without sudo.


3. The memory backup problem (and --replay-mode application)

When ncu collects hardware counters it must replay the kernel multiple times — once per group of metrics that can be collected simultaneously on the hardware. For each replay, the GPU must be in exactly the same memory state as when the kernel originally ran.

The default replay mode (--replay-mode kernel) achieves this by:

  1. Before the kernel runs: save the entire GPU memory state to host RAM or disk
  2. Run the kernel, collect one pass of metrics
  3. Restore the saved GPU memory state
  4. Run the kernel again, collect another pass
  5. Repeat for every pass needed

This works well for small programs. For a large language model inference engine with 82 GB of weights loaded into GPU memory, it is a disaster: each save/restore cycle moves 82 GB, and with 14 passes that is 14 × 82 GB = 1.1 TB of data movement. The run stalls at "0%" indefinitely (or runs out of disk space). ncu even warns you:

==WARNING== Backing up device or system memory to file. Kernel replay might be very slow.
            Consider using "--replay-mode application" to avoid memory save-and-restore.

The fix: --replay-mode application

Instead of saving/restoring GPU state, ncu re-launches the entire application once per metric pass. The application loads the model from scratch each time, runs the inference, and ncu intercepts the target kernel launches to collect that pass's counters.

ncu --replay-mode application [other flags] -- ./my_program [args]

The trade-off:

Kernel replay Application replay
App launches 1 N passes
Memory save/restore Yes, per pass No
Good when Model is small, app startup is slow Model is large, app startup is fast

For our 82 GB model with 14 passes and ~30s startup:

  • Kernel replay: 14 × 82 GB restore ≈ infeasible
  • Application replay: 14 × (30s startup + 125s inference) ≈ 35 minutes

For small CUDA programs (say, a benchmark with a few MB of data), kernel replay is fine and faster. For large model inference, always use application replay.


4. The metric collection process

How passes work

A GPU SM contains a fixed number of hardware performance counter registers, and they are organized into groups ("counter sets") where only one set can be read at a time. A single counter set might let you measure "SM throughput + L1 hit rate" simultaneously, but measuring "tensor-core instructions + DRAM bytes" requires a different counter set.

If you request 22 metrics that span 14 different counter sets, ncu needs 14 passes. In our profiling script we collected 22 metrics (6 groups: throughput%, compute pipeline, occupancy, stall reasons, DRAM byte counts, launch config), which ncu scheduled into 14 passes.

With -s 8 -c 5 (skip 8 warmup launches, profile 5), each pass runs 5 kernel instances, so: 14 passes × 5 instances = 70 kernel executions per kernel type profiled.

Specifying metrics

ncu --metrics "sm__throughput.avg.pct_of_peak_sustained_elapsed,\
               smsp__inst_executed_pipe_tensor.avg,\
               smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct" \
    -- ./program

Metric names follow the pattern <unit>__<counter>.<aggregation>.<normalization>:

  • sm__throughput.avg.pct_of_peak_sustained_elapsed → average SM throughput as % of theoretical peak, over elapsed kernel time
  • smsp__inst_executed_pipe_tensor.avg → average tensor-core warp instructions per SM sub-partition (SMSP)
  • smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct → % of active cycles a warp spends stalled waiting for the MIO memory pipeline

Architecture-specific metric names: metric names are not fully portable across GPU generations. On NVIDIA Hopper and Blackwell (sm_90 and sm_121), the DRAM memory controller counters use fb__ (framebuffer) names rather than dram__:

What you want Ampere/Ada (sm_8x/sm_89) Hopper/Blackwell (sm_90/sm_121)
DRAM read bytes dram__bytes_read.sum fb__bytes_read.sum
DRAM write bytes dram__bytes_write.sum fb__bytes_write.sum
DRAM throughput % dram__throughput.avg.pct_... fb__throughput.avg.pct_...

In our session, dram__bytes_read.sum returned no data on the GB10 (sm_121). Using fb__bytes_read.sum in future scripts would give the correct DRAM byte counts needed for arithmetic intensity calculation.

Kernel filtering and warmup skip

ncu -k "regex:moe_mid_iq2_xxs_kernel"  # only profile matching kernels
    -s 8                                # skip first 8 matching launches
    -c 5                                # then profile the next 5

The -s (skip) flag is important for inference workloads. Early kernel launches in a neural network often correspond to the first few layers, where GPU caches are cold and are not representative of the steady-state performance the real workload sees. By skipping 8 launches (roughly the first few transformer layers), you profile kernels that see a warm and realistic cache state.

Output format flags

ncu --page raw             # show all collected metrics; default shows curated sections only
    --print-summary per-kernel  # statistics across the N profiled instances
    --print-fp             # float values, no rounding to integers
    -o output_dir/kernel   # save binary .ncu-rep for ncu-ui GUI viewer
    -- ./program \
    2>&1 | tee output_dir/kernel.txt   # also capture to text file for grep/awk

Without --page raw you only get a curated "section" summary that omits most raw counters. For root-cause analysis you want all the numbers.


5. Understanding the key metrics

SM throughput %

sm__throughput.avg.pct_of_peak_sustained_elapsed

The fraction of time the SM's instruction dispatch pipeline was issuing instructions, averaged over the kernel's elapsed time. This is roughly "how busy was the SM".

Value Interpretation
95-100% SM fully occupied — check stall reasons to see what it's waiting on
50-70% SM often idle — likely an occupancy or launch config issue
<30% SM frequently idle — poor GPU utilization

Pitfall: high SM throughput does not mean high compute throughput. The SM dispatches load/store instructions too — a kernel can show 97% SM throughput while spending 40% of that time stalled waiting for a memory dependency to resolve. In our moe_mid case, SM throughput was 97.9% yet the kernel was clearly memory-bound (41% mio_throttle stall).

L1tex throughput %

l1tex__throughput.avg.pct_of_peak_sustained_elapsed

Utilization of the L1 texture cache pipeline. All global loads, texture fetches, and shared memory operations route through this pipeline. When L1tex throughput matches SM throughput and both are near 100%, the SM is fully occupied with memory operations.

In our moe_mid kernel both SM% and L1tex% were 97.9% — the SM's work was dominated by L1 cache accesses (the IQ2_XXS lookup-table dequantization), not by compute.

L2 (lts) throughput %

lts__throughput.avg.pct_of_peak_sustained_elapsed

Utilization of the L2 cache ("LTS" = L2 tile/slice). A high value here means many L1 cache misses are reaching L2; a low value means L1 is mostly hitting.

Our two kernels differed dramatically:

  • moe_mid: 40.8% — significant L2 traffic, consistent with LUT misses escaping L1
  • moe_down: 4.7% — almost no L2 traffic; the Q2_K weights appear to be L1-resident (or arriving via a memory path — NVLink-C2C on Grace Blackwell — that bypasses L2 as measured by lts__ counters)

Tensor-core instructions

smsp__inst_executed_pipe_tensor.avg

Tensor-core warp instructions per SM sub-partition. If this is zero, the kernel is doing all its matrix math as scalar FP32/FP16 FMA instructions, leaving the tensor cores — which are typically 8-16× faster than scalar FMA for matrix ops — completely idle.

Both our kernels showed TC = 0. For kernels that spend the majority of their time on dequant + matrix-vector multiply, this is the highest-ceiling optimization lever: restructure the kernel to use FP16/BF16 tensor-core GEMMs.

FMA instructions

smsp__inst_executed_pipe_fma.avg

Scalar FP32 FMA warp instructions per SM sub-partition. Together with TC=0, a high FMA count tells you the kernel is doing a lot of multiply-accumulate work entirely through the slower scalar FP32 pipe, not tensor cores.

Occupancy %

sm__warps_active.avg.pct_of_peak_sustained_active

Fraction of the SM's maximum concurrent warp slots that were actually filled. On our GB10 (48 SMs × 48 max warps/SM = 2304 total slots), 81% means ~1866 warps active; 50% means ~1152.

Occupancy matters because GPUs hide memory latency by switching to a different ready warp when the current warp stalls. With more active warps, the scheduler has more choices and can better fill stall gaps. At 50% occupancy you have half the latency- hiding capacity compared to 100%.

Low occupancy is often the root cause of short_scoreboard stalls: even though L1 latency is only 4-8 cycles, if there are only 24 active warps per SM instead of 48, many cycles have no ready warp to run.

Occupancy limiters

launch__occupancy_limit_registers    (blocks/SM limited by register file)
launch__occupancy_limit_shared_mem   (blocks/SM limited by shared memory)
launch__occupancy_limit_warps        (blocks/SM hardware maximum)

These three tell you which resource is the binding constraint. The actual blocks/SM is the minimum of all three. Common cases:

Register-limited (our moe_down case):

47 regs/thread × 128 threads/block = 6,016 regs/block
65,536 regs/SM ÷ 6,016 regs/block = 10.9 → floor = 10 blocks/SM  (moe_mid)

80 regs/thread × 128 threads/block = 10,240 regs/block
65,536 regs/SM ÷ 10,240 regs/block = 6.4 → floor = 6 blocks/SM   (moe_down)

80 registers per thread is high — it means the compiler is keeping many intermediate values simultaneously live. The fix is __launch_bounds__ (see Section 6).

Shared-memory-limited (not our case, but common in attention kernels): A kernel using 32 KB of shared memory per block with 48 KB total available gets only 1 block/SM regardless of register count. This was the failure mode for an earlier FA-2 K-tile attempt in this project where shmem grew to 66 KB.

Warp stall reasons

These are the most diagnostic metrics for understanding what is holding a kernel back. Each measures the percentage of "active warp cycles" (cycles where at least one warp is active) during which warps were stalled for that reason.

Stall reason ncu metric name What it means
mio_throttle smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct The L1/L2 memory I/O (MIO) pipeline is backed up. Warps are issuing memory requests faster than the pipe can accept them. L1tex throughput is typically near 100% when this is high.
short_scoreboard smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct Warp waiting for an L1 cache hit to return (~4-8 cycles). Data is in L1 but the warp has a dependent instruction that can't run until the load completes. Fix: more active warps to hide this latency.
long_scoreboard smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct Warp waiting for L2 or DRAM access (~100-1000 cycles). Needs many more warps to hide; even 100% occupancy may not fully cover DRAM latency.
math_throttle smsp__warp_issue_stalled_math_throttle_per_warp_active.pct Warp waiting for the FP32/FP64/INT compute pipeline. True compute bottleneck. Fix: tensor cores, algorithmic reduction.
not_selected smsp__warp_issue_stalled_not_selected_per_warp_active.pct Warp is ready but the warp scheduler picked a different warp. Indicates there are too many ready warps for the available issue slots — actually a sign of good utilization.
wait smsp__warp_issue_stalled_wait_per_warp_active.pct Warp at a __syncthreads() barrier, waiting for other warps in the same block.

What our two kernels showed:

moe_mid_iq2_xxs_kernel:

mio_throttle:     41.2%  ← dominant — L1 texture pipe congested by LUT lookups
short_scoreboard: 16.4%
long_scoreboard:  13.4%
not_selected:      9.9%
wait:              8.9%

The LUT dequantization for IQ2_XXS issues many texture fetches per element. Even with good occupancy (81%), the fetch rate exceeds the L1 pipe's acceptance rate.

moe_down_q2_k_kernel:

short_scoreboard: 44.9%  ← dominant — not enough warps to hide L1 hit latency
long_scoreboard:  18.2%
wait:             16.3%
mio_throttle:      5.7%  ← low; L1 pipe is not the problem here
not_selected:      5.0%

Same GPU, same operation type, completely different bottleneck — because moe_down has 80 regs/thread → only 6 blocks/SM → 24 active warps/SM → not enough to hide 4-8 cycle L1 latency. mio_throttle is low because the L1 pipe is not congested; it just isn't being fed fast enough due to the warp shortfall.

The key lesson: two kernels doing similar work (quantized matmul) can have completely different bottlenecks. Always measure each kernel separately.


6. The __launch_bounds__ hint

CUDA allows a kernel to tell the compiler its expected launch configuration:

__global__ __launch_bounds__(maxThreadsPerBlock, minBlocksPerSM)
void my_kernel(...)

maxThreadsPerBlock: the maximum block size you will use. The compiler uses this as a budget: since blocks won't be larger than this, it only needs enough registers to support this many threads simultaneously in the register file.

minBlocksPerSM: the minimum number of blocks per SM you want to achieve. The compiler will try to reduce per-thread register usage to meet this — potentially by spilling registers to "local memory" (off-chip, slow) if necessary.

Example: moe_down currently uses 80 registers/thread → 6 blocks/SM. To target 9 blocks:

__global__ __launch_bounds__(128, 9)   // 128 threads/block, want ≥9 blocks/SM
void ds4_cuda_routed_moe_down_q2_k_kernel(...)

For 9 blocks/SM with 128 threads/block:

max regs/block = 65,536 / 9 = 7,281
max regs/thread = 7,281 / 128 = 56 → compiler targets ≤56 regs/thread

The risk — register spilling: if the kernel genuinely needs more than 56 registers to do its work without spilling, the compiler stores the excess registers in "local memory" (a per-thread private region in DRAM). Local memory accesses are slow — similar to global memory latency. If spilling is heavy, __launch_bounds__ can make things worse.

How to verify whether the hint helped:

# Check register count in PTX assembly output
nvcc -Xptxas -v my_kernel.cu 2>&1 | grep "moe_down"
# → "used N registers, X bytes lmem" — if lmem > 0, there is spilling

# Then run ncu again and check:
# 1. launch__registers_per_thread decreased
# 2. launch__occupancy_limit_registers increased
# 3. sm__warps_active increased
# 4. short_scoreboard stall % decreased
# 5. Overall kernel time decreased

Never assume __launch_bounds__ helped — measure it.


7. The roofline model (simplified)

The roofline model is a framework for understanding whether a kernel is limited by compute throughput or by memory bandwidth.

Arithmetic intensity (AI): floating-point operations per byte of memory transferred

AI = total FLOPs executed
     ─────────────────────
     total DRAM bytes read + written

Plotting AI against the hardware's compute-to-bandwidth ratio (the "ridge point") tells you which resource is the binding constraint:

  • AI > ridge point → compute-bound (do fewer FLOPs, or use faster compute like TC)
  • AI < ridge point → memory-bound (move less data, use caching, increase reuse)

For our GB10 (Grace Blackwell unified memory):

  • FP32 compute peak: roughly 600+ TFLOPS (theoretical)
  • Memory bandwidth: ~273 GB/s (LPDDR5X)
  • Ridge point: ~2200 FLOP/byte — a very high compute-to-bandwidth ratio

For a quantized matmul with IQ2_XXS weights: each 2-bit weight byte produces roughly 2-4 multiply-accumulate operations. That puts AI around 4-8 FLOP/byte, far below the ridge point of 2200. The kernels are solidly memory-bound. Tensor cores are great at compute, but they won't help a kernel that is waiting on memory.

Where TC still matters despite being memory-bound:

Even in a memory-bound kernel, restructuring for tensor cores can help if:

  1. The restructuring separates the dequant (memory-heavy) from the matmul (compute-heavy)
  2. The dequantized weights fit in L1/L2 cache for the GEMM pass
  3. The GEMM pass then becomes compute-bound at a higher arithmetic intensity

This is the "dequant → cache → TC GEMM" strategy: a potential future lever for our kernels.

Computing AI with ncu:

AI ≈ smsp__sass_thread_inst_executed_op_ffma_pred_on.sum × 2 FLOPs/FMA
     ──────────────────────────────────────────────────────────────────
     fb__bytes_read.sum + fb__bytes_write.sum   (use fb__ on Hopper/Blackwell)

In our session, dram__bytes_read.sum returned nothing on GB10 (wrong metric name for sm_121 — should be fb__bytes_read.sum). We used L2 throughput as a qualitative proxy.


8. A complete profiling workflow

Here is the end-to-end process used in this session, generalized for any CUDA inference workload:

Step 1: Identify hot kernels with nsys
─────────────────────────────────────
nsys profile --stats=true ./my_program [args]

→ Outputs a ranked table: kernel name, total time, % of runtime.
→ Find the 1-2 kernels that own >50% of the relevant phase's time.
→ Time cost: 2-5 minutes including the inference run.

Step 2: Write an ncu spec script
─────────────────────────────────────
Capture the ncu command with:
  --replay-mode application       (mandatory for large-model inference)
  -k "regex:<kernel_name>"        (target only the hot kernel)
  -s <skip> -c <count>            (skip cold-cache warmup launches)
  --metrics "<metric1>,<metric2>" (see metric list in Section 5)
  --page raw --print-summary per-kernel --print-fp
  -o <output_dir>/<kernel_name>

Recommended minimal metric set (~5-6 passes, ~10 min per kernel):
  sm__throughput.avg.pct_of_peak_sustained_elapsed
  l1tex__throughput.avg.pct_of_peak_sustained_elapsed
  lts__throughput.avg.pct_of_peak_sustained_elapsed
  smsp__inst_executed_pipe_tensor.avg
  sm__warps_active.avg.pct_of_peak_sustained_active
  smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct
  smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct
  smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct
  launch__occupancy_limit_registers

Step 3: Unblock ncu
─────────────────────────────────────
Either: sudo ncu [flags] -- ./program   (immediate)
Or: write /etc/modprobe.d/nvidia-prof.conf + reboot (permanent)

Step 4: Run ncu
─────────────────────────────────────
bash my_ncu_spec.sh 2>&1 | tee output.txt

For 9-metric set: ~6 passes × ~inference_time_per_pass
For 22-metric set: ~14 passes × ~inference_time_per_pass

Step 5: Parse results
─────────────────────────────────────
grep key metric names from the .txt output.
The three columns shown are min / max / avg across the N profiled kernel instances.

Step 6: Diagnose and decide
─────────────────────────────────────
Use the bottleneck→lever table (Section 9) to identify the right fix.
Write down your measurement-derived projection before implementing.
Never project a speedup from theory alone — only from Phase 0 measurement.

9. The bottleneck-to-lever map

Bottleneck (from ncu) Likely root cause Recommended lever
mio_throttle high (>20%) L1 texture pipeline congested — too many texture/cache requests Move lookup tables to shared memory; coalesce global loads; reduce LUT access frequency
short_scoreboard high (>25%) Low occupancy → too few warps to hide L1 hit latency Reduce register pressure (__launch_bounds__, algorithmic simplification)
long_scoreboard high (>20%) L2/DRAM latency exposed — L1 miss rate is high Increase L2 reuse (tiling, blocking); prefetch; increase occupancy if possible
math_throttle high (>20%) Compute pipeline saturated Use tensor cores; reduce FLOPs algorithmically
TC = 0 + kernel is matmul-heavy No tensor core use Restructure for WMMA or cuBLAS GEMM with TC-compatible data layout
Low occupancy (<60%) + register-limited Too many registers per thread __launch_bounds__(block_size, target_blocks_per_sm)
Low occupancy (<60%) + shmem-limited Too much shared memory per block Reduce shmem (smaller tiles, streaming instead of blocking)
Both SM% and L1tex% low Too few active warps, scheduler idles Increase grid size; check block size vs warp count

10. Lessons learned from this session

1. --replay-mode application is mandatory for large models. For any inference workload that fills most of GPU memory, default kernel replay mode will either take hours or run out of disk space. Always use application replay for LLM inference profiling. The ncu warning text even tells you this.

2. Budget time correctly: 22 metrics = 14 passes ≈ 30 min per kernel. For a first diagnosis, a 9-metric set (5-6 passes, ~10 min) is usually enough to identify the dominant bottleneck. Reserve the full 22-metric set for when you need precise stall breakdowns and occupancy limiter detail.

3. Metric names are architecture-specific. dram__bytes_read.sum works on Ampere (sm_80/86); on Hopper/Blackwell (sm_90/sm_121) you need fb__bytes_read.sum. Always verify your metric names return data before committing to a long profiling run — add a 1-pass sanity check first.

4. Two kernels doing the same operation can have completely different bottlenecks. Our moe_mid and moe_down both do quantized matmul but had opposite primary stalls (mio_throttle vs short_scoreboard) because of a difference in quantization format (IQ2_XXS LUT-heavy vs Q2_K higher-register-count). The fix for one makes no difference for the other. Profile each kernel individually.

5. nsys first, ncu second — always. nsys tells you which kernel to care about in minutes. ncu tells you why in 30 minutes. Spending 30 minutes on the wrong kernel (one that owns 2% of runtime, say) is pure waste.

6. __launch_bounds__ is a hint, not a guarantee — verify with ncu. The compiler may or may not achieve the target register count without spilling. After adding __launch_bounds__, re-run ncu and confirm: register count decreased, occupancy increased, the relevant stall reason decreased, and kernel time actually improved.

7. Derive projections from measurement, not from structural analysis. "This kernel does a 2D matmul and tensor cores are 8× faster than scalar FMA, therefore I project 8×" is wrong. Tensor cores help the compute portion; if the kernel is memory- bound, the memory part still takes the same time. Measure first with a quick Phase 0 prototype, then derive a realistic projection from the measured improvement. Structural napkin math consistently overshoots.


11. Quick reference

# 0. Make sure ncu is in PATH
export PATH="/usr/local/cuda-13.0/bin:$PATH"

# 1. Quick nsys kernel ranking
nsys profile --stats=true ./my_program [args]
# → look for the top 1-2 kernels by total GPU time

# 2. Minimal ncu run (~5-6 passes, good for first diagnosis)
ncu --replay-mode application \
    -k "regex:my_hot_kernel" \
    -s 4 -c 3 \
    --metrics "sm__throughput.avg.pct_of_peak_sustained_elapsed,\
l1tex__throughput.avg.pct_of_peak_sustained_elapsed,\
lts__throughput.avg.pct_of_peak_sustained_elapsed,\
smsp__inst_executed_pipe_tensor.avg,\
sm__warps_active.avg.pct_of_peak_sustained_active,\
smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct,\
smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct,\
smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct,\
launch__occupancy_limit_registers" \
    --page raw --print-summary per-kernel --print-fp \
    -- ./my_program [args] \
    2>&1 | tee output_minimal.txt

# 3. Full ncu run (~14 passes, saves binary for ncu-ui)
ncu --replay-mode application \
    -k "regex:my_hot_kernel" \
    -s 8 -c 5 \
    --metrics "${MY_FULL_METRICS}" \
    -o output/my_kernel \
    --page raw --print-summary per-kernel --print-fp \
    -- ./my_program [args] \
    2>&1 | tee output/my_kernel.txt

# 4. Extract key numbers from text output
grep -E "sm__throughput\.avg|smsp__inst_executed_pipe_tensor\.avg|\
smsp__warp_issue_stalled_mio_throttle|smsp__warp_issue_stalled_short_scoreboard|\
sm__warps_active\.avg|launch__occupancy_limit_registers" output/my_kernel.txt \
    | grep -v "max\.\|min\.\|sum\."

# 5. Open in GUI (on a machine with a display)
ncu-ui output/my_kernel.ncu-rep

# 6. Permanent unblock (requires root + reboot)
echo 'options nvidia NVreg_RestrictProfilingToAdminUsers=0' \
    | sudo tee /etc/modprobe.d/nvidia-prof.conf
echo 'kernel.perf_event_paranoid = -1' \
    | sudo tee /etc/sysctl.d/99-perf.conf

Hardware used in examples: NVIDIA GB10 (Grace Blackwell, sm_121), CUDA 13.0, 48 SMs, 273 GB/s LPDDR5X bandwidth. Written 2026-05-11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment