Written: 2026-05-11 Audience: developer familiar with CUDA but new to hardware-counter profiling Based on: a real profiling session of a large language model inference engine on an NVIDIA GB10 (Grace Blackwell, sm_121, CUDA 13.0)
This guide uses a concrete example throughout. Understanding the example context helps make the metrics feel real rather than abstract.
The program: a custom CUDA inference engine for DeepSeek-V4-Flash, a large language model (LLM). The engine takes a text prompt and generates a response token by token. "Prefill" is the phase where all the prompt tokens are processed at once before generation starts — it is compute-intensive and dominates latency for long prompts.
The architecture — Mixture of Experts (MoE): DeepSeek uses a MoE transformer. Instead of one large feed-forward network per layer, there are many small "expert" networks, and each token is routed to a few of them. In practice this means the two most expensive operations per layer are:
- moe_mid: the "up-projection" expert GEMM — projects activations to a wider space
- moe_down: the "down-projection" expert GEMM — projects back to the model dimension
These two kernels are the dominant cost of prefill.
Quantization — IQ2_XXS and Q2_K: The model weights are stored in compressed quantized formats to fit in GPU memory and reduce memory bandwidth:
- IQ2_XXS ("importance-quantized 2-bit extra-extra-small"): weights stored as approximately 2 bits per value using a learned lookup table (LUT). To use a weight, the kernel looks up its float value in a table — these LUT lookups are a key source of memory traffic.
- Q2_K ("2-bit k-quant"): a different 2-bit quantization scheme using block-level scales and min values. Slightly different access pattern from IQ2_XXS.
Both formats require dequantization before the multiply-accumulate can happen: load the compressed weight → look up or decode its float value → multiply by activation.
The performance baseline: after enabling chunked layer-major prefill, the engine ran at ~111 t/s on the GB10. An nsys timeline profiler (see Section 1) showed the kernel time breakdown:
moe_mid_iq2_xxs_kernel— 65.8% of prefill timemoe_down_q2_k_kernel— 23.2% of prefill time- Everything else — 11% combined
With 89% of runtime in two kernels, the question became: why are these kernels slow, and what is the right lever to speed them up? That is what ncu is for.
NVIDIA provides two complementary profilers that serve different purposes.
nsys answers "what is happening and when?"
It records a timeline of every kernel launch, memory copy, API call, and CUDA event. You can see which kernels run in what order, whether they overlap, and how long each takes as a fraction of total runtime. nsys also does a fast roofline estimate by sampling DRAM bandwidth and SM utilization at coarse granularity — enough to flag a kernel as "memory-bound" or "compute-bound" without deep-diving into it.
nsys is fast — roughly 5-10% overhead. You can profile a complete multi-second inference run.
nsys profile --stats=true ./my-inference-engine --model weights.gguf --prompt "hello" -n 1After the run, nsys prints a kernel summary sorted by total GPU time. This is the first step in any optimization workflow: find the kernel that owns the most time. In our case nsys told us moe_mid = 65.8% and moe_down = 23.2% of prefill — done in under a minute, with zero code changes.
ncu answers "WHY is a specific kernel fast or slow?"
It reads the GPU's on-chip performance monitoring units (PMUs) — hardware counters inside each SM, L1 cache, L2 cache, and memory controller. These counters accumulate counts of events like: warp instructions issued, warp instructions stalled, L1 cache bytes read, tensor-core instructions executed. Because they are hardware registers inside the chip, they report exact counts — not samples.
ncu is slow — it needs to replay the target kernel many times (14 times in our session). Use nsys to rank kernels by impact first, then ncu to root-cause the winner.
What ncu revealed for our two kernels:
moe_mid_iq2_xxs_kernel: L1 texture pipeline fully saturated — the LUT dequantization is flooding the L1 cache with lookups, causing 41% of warp cycles to stall waiting for the L1/MIO pipe to accept the next request.moe_down_q2_k_kernel: only 50% SM occupancy due to high register pressure (80 registers per thread → only 6 thread blocks fit on each SM). With so few active warps, the SM cannot hide L1 cache latency, causing 45% of warp cycles to stall on L1 hits.
These are two completely different root causes requiring two different fixes.
By default, NVIDIA restricts hardware-counter access to the root user. This is
controlled by a parameter in the nvidia kernel module:
NVreg_RestrictProfilingToAdminUsers=1 (default: ON)
A non-root user attempting to run ncu gets:
==ERROR== ERR_NVGPUCTRPERM: Permission issue with Performance Monitoring
Unable to profile all kernels as the CUDA application running under
user [...] does not have sufficient privileges.
Important: lowering /proc/sys/kernel/perf_event_paranoid to -1 does not fix
this. That sysctl governs Linux's own perf subsystem. NVIDIA's restriction is enforced
entirely inside the GPU driver, independently.
Two ways to unblock ncu:
Option A — immediate, no reboot required: run ncu as root
sudo ncu [flags] -- ./my_program [args]Option B — permanent, survives reboots: disable the restriction via modprobe config
# Create the config file (requires root)
echo 'options nvidia NVreg_RestrictProfilingToAdminUsers=0' \
| sudo tee /etc/modprobe.d/nvidia-prof.conf
# Also set perf_event_paranoid for full perf tool access
echo 'kernel.perf_event_paranoid = -1' \
| sudo tee /etc/sysctl.d/99-perf.conf
# Reboot for the module parameter to take effect
sudo rebootAfter the reboot, any user can run ncu without sudo.
When ncu collects hardware counters it must replay the kernel multiple times — once per group of metrics that can be collected simultaneously on the hardware. For each replay, the GPU must be in exactly the same memory state as when the kernel originally ran.
The default replay mode (--replay-mode kernel) achieves this by:
- Before the kernel runs: save the entire GPU memory state to host RAM or disk
- Run the kernel, collect one pass of metrics
- Restore the saved GPU memory state
- Run the kernel again, collect another pass
- Repeat for every pass needed
This works well for small programs. For a large language model inference engine with 82 GB of weights loaded into GPU memory, it is a disaster: each save/restore cycle moves 82 GB, and with 14 passes that is 14 × 82 GB = 1.1 TB of data movement. The run stalls at "0%" indefinitely (or runs out of disk space). ncu even warns you:
==WARNING== Backing up device or system memory to file. Kernel replay might be very slow.
Consider using "--replay-mode application" to avoid memory save-and-restore.
The fix: --replay-mode application
Instead of saving/restoring GPU state, ncu re-launches the entire application once per metric pass. The application loads the model from scratch each time, runs the inference, and ncu intercepts the target kernel launches to collect that pass's counters.
ncu --replay-mode application [other flags] -- ./my_program [args]The trade-off:
| Kernel replay | Application replay | |
|---|---|---|
| App launches | 1 | N passes |
| Memory save/restore | Yes, per pass | No |
| Good when | Model is small, app startup is slow | Model is large, app startup is fast |
For our 82 GB model with 14 passes and ~30s startup:
- Kernel replay: 14 × 82 GB restore ≈ infeasible
- Application replay: 14 × (30s startup + 125s inference) ≈ 35 minutes ✓
For small CUDA programs (say, a benchmark with a few MB of data), kernel replay is fine and faster. For large model inference, always use application replay.
A GPU SM contains a fixed number of hardware performance counter registers, and they are organized into groups ("counter sets") where only one set can be read at a time. A single counter set might let you measure "SM throughput + L1 hit rate" simultaneously, but measuring "tensor-core instructions + DRAM bytes" requires a different counter set.
If you request 22 metrics that span 14 different counter sets, ncu needs 14 passes. In our profiling script we collected 22 metrics (6 groups: throughput%, compute pipeline, occupancy, stall reasons, DRAM byte counts, launch config), which ncu scheduled into 14 passes.
With -s 8 -c 5 (skip 8 warmup launches, profile 5), each pass runs 5 kernel instances,
so: 14 passes × 5 instances = 70 kernel executions per kernel type profiled.
ncu --metrics "sm__throughput.avg.pct_of_peak_sustained_elapsed,\
smsp__inst_executed_pipe_tensor.avg,\
smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct" \
-- ./programMetric names follow the pattern <unit>__<counter>.<aggregation>.<normalization>:
sm__throughput.avg.pct_of_peak_sustained_elapsed→ average SM throughput as % of theoretical peak, over elapsed kernel timesmsp__inst_executed_pipe_tensor.avg→ average tensor-core warp instructions per SM sub-partition (SMSP)smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct→ % of active cycles a warp spends stalled waiting for the MIO memory pipeline
Architecture-specific metric names: metric names are not fully portable across GPU
generations. On NVIDIA Hopper and Blackwell (sm_90 and sm_121), the DRAM memory
controller counters use fb__ (framebuffer) names rather than dram__:
| What you want | Ampere/Ada (sm_8x/sm_89) | Hopper/Blackwell (sm_90/sm_121) |
|---|---|---|
| DRAM read bytes | dram__bytes_read.sum |
fb__bytes_read.sum |
| DRAM write bytes | dram__bytes_write.sum |
fb__bytes_write.sum |
| DRAM throughput % | dram__throughput.avg.pct_... |
fb__throughput.avg.pct_... |
In our session, dram__bytes_read.sum returned no data on the GB10 (sm_121). Using
fb__bytes_read.sum in future scripts would give the correct DRAM byte counts needed
for arithmetic intensity calculation.
ncu -k "regex:moe_mid_iq2_xxs_kernel" # only profile matching kernels
-s 8 # skip first 8 matching launches
-c 5 # then profile the next 5The -s (skip) flag is important for inference workloads. Early kernel launches
in a neural network often correspond to the first few layers, where GPU caches are
cold and are not representative of the steady-state performance the real workload sees.
By skipping 8 launches (roughly the first few transformer layers), you profile kernels
that see a warm and realistic cache state.
ncu --page raw # show all collected metrics; default shows curated sections only
--print-summary per-kernel # statistics across the N profiled instances
--print-fp # float values, no rounding to integers
-o output_dir/kernel # save binary .ncu-rep for ncu-ui GUI viewer
-- ./program \
2>&1 | tee output_dir/kernel.txt # also capture to text file for grep/awkWithout --page raw you only get a curated "section" summary that omits most raw
counters. For root-cause analysis you want all the numbers.
sm__throughput.avg.pct_of_peak_sustained_elapsed
The fraction of time the SM's instruction dispatch pipeline was issuing instructions, averaged over the kernel's elapsed time. This is roughly "how busy was the SM".
| Value | Interpretation |
|---|---|
| 95-100% | SM fully occupied — check stall reasons to see what it's waiting on |
| 50-70% | SM often idle — likely an occupancy or launch config issue |
| <30% | SM frequently idle — poor GPU utilization |
Pitfall: high SM throughput does not mean high compute throughput. The SM dispatches load/store instructions too — a kernel can show 97% SM throughput while spending 40% of that time stalled waiting for a memory dependency to resolve. In our moe_mid case, SM throughput was 97.9% yet the kernel was clearly memory-bound (41% mio_throttle stall).
l1tex__throughput.avg.pct_of_peak_sustained_elapsed
Utilization of the L1 texture cache pipeline. All global loads, texture fetches, and shared memory operations route through this pipeline. When L1tex throughput matches SM throughput and both are near 100%, the SM is fully occupied with memory operations.
In our moe_mid kernel both SM% and L1tex% were 97.9% — the SM's work was dominated by L1 cache accesses (the IQ2_XXS lookup-table dequantization), not by compute.
lts__throughput.avg.pct_of_peak_sustained_elapsed
Utilization of the L2 cache ("LTS" = L2 tile/slice). A high value here means many L1 cache misses are reaching L2; a low value means L1 is mostly hitting.
Our two kernels differed dramatically:
- moe_mid: 40.8% — significant L2 traffic, consistent with LUT misses escaping L1
- moe_down: 4.7% — almost no L2 traffic; the Q2_K weights appear to be L1-resident
(or arriving via a memory path — NVLink-C2C on Grace Blackwell — that bypasses L2
as measured by
lts__counters)
smsp__inst_executed_pipe_tensor.avg
Tensor-core warp instructions per SM sub-partition. If this is zero, the kernel is doing all its matrix math as scalar FP32/FP16 FMA instructions, leaving the tensor cores — which are typically 8-16× faster than scalar FMA for matrix ops — completely idle.
Both our kernels showed TC = 0. For kernels that spend the majority of their time on dequant + matrix-vector multiply, this is the highest-ceiling optimization lever: restructure the kernel to use FP16/BF16 tensor-core GEMMs.
smsp__inst_executed_pipe_fma.avg
Scalar FP32 FMA warp instructions per SM sub-partition. Together with TC=0, a high FMA count tells you the kernel is doing a lot of multiply-accumulate work entirely through the slower scalar FP32 pipe, not tensor cores.
sm__warps_active.avg.pct_of_peak_sustained_active
Fraction of the SM's maximum concurrent warp slots that were actually filled. On our GB10 (48 SMs × 48 max warps/SM = 2304 total slots), 81% means ~1866 warps active; 50% means ~1152.
Occupancy matters because GPUs hide memory latency by switching to a different ready warp when the current warp stalls. With more active warps, the scheduler has more choices and can better fill stall gaps. At 50% occupancy you have half the latency- hiding capacity compared to 100%.
Low occupancy is often the root cause of short_scoreboard stalls: even though L1 latency is only 4-8 cycles, if there are only 24 active warps per SM instead of 48, many cycles have no ready warp to run.
launch__occupancy_limit_registers (blocks/SM limited by register file)
launch__occupancy_limit_shared_mem (blocks/SM limited by shared memory)
launch__occupancy_limit_warps (blocks/SM hardware maximum)
These three tell you which resource is the binding constraint. The actual blocks/SM is the minimum of all three. Common cases:
Register-limited (our moe_down case):
47 regs/thread × 128 threads/block = 6,016 regs/block
65,536 regs/SM ÷ 6,016 regs/block = 10.9 → floor = 10 blocks/SM (moe_mid)
80 regs/thread × 128 threads/block = 10,240 regs/block
65,536 regs/SM ÷ 10,240 regs/block = 6.4 → floor = 6 blocks/SM (moe_down)
80 registers per thread is high — it means the compiler is keeping many intermediate
values simultaneously live. The fix is __launch_bounds__ (see Section 6).
Shared-memory-limited (not our case, but common in attention kernels): A kernel using 32 KB of shared memory per block with 48 KB total available gets only 1 block/SM regardless of register count. This was the failure mode for an earlier FA-2 K-tile attempt in this project where shmem grew to 66 KB.
These are the most diagnostic metrics for understanding what is holding a kernel back. Each measures the percentage of "active warp cycles" (cycles where at least one warp is active) during which warps were stalled for that reason.
| Stall reason | ncu metric name | What it means |
|---|---|---|
mio_throttle |
smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct |
The L1/L2 memory I/O (MIO) pipeline is backed up. Warps are issuing memory requests faster than the pipe can accept them. L1tex throughput is typically near 100% when this is high. |
short_scoreboard |
smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct |
Warp waiting for an L1 cache hit to return (~4-8 cycles). Data is in L1 but the warp has a dependent instruction that can't run until the load completes. Fix: more active warps to hide this latency. |
long_scoreboard |
smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct |
Warp waiting for L2 or DRAM access (~100-1000 cycles). Needs many more warps to hide; even 100% occupancy may not fully cover DRAM latency. |
math_throttle |
smsp__warp_issue_stalled_math_throttle_per_warp_active.pct |
Warp waiting for the FP32/FP64/INT compute pipeline. True compute bottleneck. Fix: tensor cores, algorithmic reduction. |
not_selected |
smsp__warp_issue_stalled_not_selected_per_warp_active.pct |
Warp is ready but the warp scheduler picked a different warp. Indicates there are too many ready warps for the available issue slots — actually a sign of good utilization. |
wait |
smsp__warp_issue_stalled_wait_per_warp_active.pct |
Warp at a __syncthreads() barrier, waiting for other warps in the same block. |
What our two kernels showed:
moe_mid_iq2_xxs_kernel:
mio_throttle: 41.2% ← dominant — L1 texture pipe congested by LUT lookups
short_scoreboard: 16.4%
long_scoreboard: 13.4%
not_selected: 9.9%
wait: 8.9%
The LUT dequantization for IQ2_XXS issues many texture fetches per element. Even with good occupancy (81%), the fetch rate exceeds the L1 pipe's acceptance rate.
moe_down_q2_k_kernel:
short_scoreboard: 44.9% ← dominant — not enough warps to hide L1 hit latency
long_scoreboard: 18.2%
wait: 16.3%
mio_throttle: 5.7% ← low; L1 pipe is not the problem here
not_selected: 5.0%
Same GPU, same operation type, completely different bottleneck — because moe_down has 80 regs/thread → only 6 blocks/SM → 24 active warps/SM → not enough to hide 4-8 cycle L1 latency. mio_throttle is low because the L1 pipe is not congested; it just isn't being fed fast enough due to the warp shortfall.
The key lesson: two kernels doing similar work (quantized matmul) can have completely different bottlenecks. Always measure each kernel separately.
CUDA allows a kernel to tell the compiler its expected launch configuration:
__global__ __launch_bounds__(maxThreadsPerBlock, minBlocksPerSM)
void my_kernel(...)maxThreadsPerBlock: the maximum block size you will use. The compiler uses this as
a budget: since blocks won't be larger than this, it only needs enough registers to
support this many threads simultaneously in the register file.
minBlocksPerSM: the minimum number of blocks per SM you want to achieve. The compiler
will try to reduce per-thread register usage to meet this — potentially by spilling
registers to "local memory" (off-chip, slow) if necessary.
Example: moe_down currently uses 80 registers/thread → 6 blocks/SM. To target 9 blocks:
__global__ __launch_bounds__(128, 9) // 128 threads/block, want ≥9 blocks/SM
void ds4_cuda_routed_moe_down_q2_k_kernel(...)For 9 blocks/SM with 128 threads/block:
max regs/block = 65,536 / 9 = 7,281
max regs/thread = 7,281 / 128 = 56 → compiler targets ≤56 regs/thread
The risk — register spilling: if the kernel genuinely needs more than 56 registers
to do its work without spilling, the compiler stores the excess registers in "local
memory" (a per-thread private region in DRAM). Local memory accesses are slow — similar
to global memory latency. If spilling is heavy, __launch_bounds__ can make things worse.
How to verify whether the hint helped:
# Check register count in PTX assembly output
nvcc -Xptxas -v my_kernel.cu 2>&1 | grep "moe_down"
# → "used N registers, X bytes lmem" — if lmem > 0, there is spilling
# Then run ncu again and check:
# 1. launch__registers_per_thread decreased
# 2. launch__occupancy_limit_registers increased
# 3. sm__warps_active increased
# 4. short_scoreboard stall % decreased
# 5. Overall kernel time decreasedNever assume __launch_bounds__ helped — measure it.
The roofline model is a framework for understanding whether a kernel is limited by compute throughput or by memory bandwidth.
Arithmetic intensity (AI): floating-point operations per byte of memory transferred
AI = total FLOPs executed
─────────────────────
total DRAM bytes read + written
Plotting AI against the hardware's compute-to-bandwidth ratio (the "ridge point") tells you which resource is the binding constraint:
- AI > ridge point → compute-bound (do fewer FLOPs, or use faster compute like TC)
- AI < ridge point → memory-bound (move less data, use caching, increase reuse)
For our GB10 (Grace Blackwell unified memory):
- FP32 compute peak: roughly 600+ TFLOPS (theoretical)
- Memory bandwidth: ~273 GB/s (LPDDR5X)
- Ridge point: ~2200 FLOP/byte — a very high compute-to-bandwidth ratio
For a quantized matmul with IQ2_XXS weights: each 2-bit weight byte produces roughly 2-4 multiply-accumulate operations. That puts AI around 4-8 FLOP/byte, far below the ridge point of 2200. The kernels are solidly memory-bound. Tensor cores are great at compute, but they won't help a kernel that is waiting on memory.
Where TC still matters despite being memory-bound:
Even in a memory-bound kernel, restructuring for tensor cores can help if:
- The restructuring separates the dequant (memory-heavy) from the matmul (compute-heavy)
- The dequantized weights fit in L1/L2 cache for the GEMM pass
- The GEMM pass then becomes compute-bound at a higher arithmetic intensity
This is the "dequant → cache → TC GEMM" strategy: a potential future lever for our kernels.
Computing AI with ncu:
AI ≈ smsp__sass_thread_inst_executed_op_ffma_pred_on.sum × 2 FLOPs/FMA
──────────────────────────────────────────────────────────────────
fb__bytes_read.sum + fb__bytes_write.sum (use fb__ on Hopper/Blackwell)
In our session, dram__bytes_read.sum returned nothing on GB10 (wrong metric name for
sm_121 — should be fb__bytes_read.sum). We used L2 throughput as a qualitative proxy.
Here is the end-to-end process used in this session, generalized for any CUDA inference workload:
Step 1: Identify hot kernels with nsys
─────────────────────────────────────
nsys profile --stats=true ./my_program [args]
→ Outputs a ranked table: kernel name, total time, % of runtime.
→ Find the 1-2 kernels that own >50% of the relevant phase's time.
→ Time cost: 2-5 minutes including the inference run.
Step 2: Write an ncu spec script
─────────────────────────────────────
Capture the ncu command with:
--replay-mode application (mandatory for large-model inference)
-k "regex:<kernel_name>" (target only the hot kernel)
-s <skip> -c <count> (skip cold-cache warmup launches)
--metrics "<metric1>,<metric2>" (see metric list in Section 5)
--page raw --print-summary per-kernel --print-fp
-o <output_dir>/<kernel_name>
Recommended minimal metric set (~5-6 passes, ~10 min per kernel):
sm__throughput.avg.pct_of_peak_sustained_elapsed
l1tex__throughput.avg.pct_of_peak_sustained_elapsed
lts__throughput.avg.pct_of_peak_sustained_elapsed
smsp__inst_executed_pipe_tensor.avg
sm__warps_active.avg.pct_of_peak_sustained_active
smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct
smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct
smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct
launch__occupancy_limit_registers
Step 3: Unblock ncu
─────────────────────────────────────
Either: sudo ncu [flags] -- ./program (immediate)
Or: write /etc/modprobe.d/nvidia-prof.conf + reboot (permanent)
Step 4: Run ncu
─────────────────────────────────────
bash my_ncu_spec.sh 2>&1 | tee output.txt
For 9-metric set: ~6 passes × ~inference_time_per_pass
For 22-metric set: ~14 passes × ~inference_time_per_pass
Step 5: Parse results
─────────────────────────────────────
grep key metric names from the .txt output.
The three columns shown are min / max / avg across the N profiled kernel instances.
Step 6: Diagnose and decide
─────────────────────────────────────
Use the bottleneck→lever table (Section 9) to identify the right fix.
Write down your measurement-derived projection before implementing.
Never project a speedup from theory alone — only from Phase 0 measurement.
| Bottleneck (from ncu) | Likely root cause | Recommended lever |
|---|---|---|
mio_throttle high (>20%) |
L1 texture pipeline congested — too many texture/cache requests | Move lookup tables to shared memory; coalesce global loads; reduce LUT access frequency |
short_scoreboard high (>25%) |
Low occupancy → too few warps to hide L1 hit latency | Reduce register pressure (__launch_bounds__, algorithmic simplification) |
long_scoreboard high (>20%) |
L2/DRAM latency exposed — L1 miss rate is high | Increase L2 reuse (tiling, blocking); prefetch; increase occupancy if possible |
math_throttle high (>20%) |
Compute pipeline saturated | Use tensor cores; reduce FLOPs algorithmically |
TC = 0 + kernel is matmul-heavy |
No tensor core use | Restructure for WMMA or cuBLAS GEMM with TC-compatible data layout |
| Low occupancy (<60%) + register-limited | Too many registers per thread | __launch_bounds__(block_size, target_blocks_per_sm) |
| Low occupancy (<60%) + shmem-limited | Too much shared memory per block | Reduce shmem (smaller tiles, streaming instead of blocking) |
| Both SM% and L1tex% low | Too few active warps, scheduler idles | Increase grid size; check block size vs warp count |
1. --replay-mode application is mandatory for large models.
For any inference workload that fills most of GPU memory, default kernel replay mode
will either take hours or run out of disk space. Always use application replay for
LLM inference profiling. The ncu warning text even tells you this.
2. Budget time correctly: 22 metrics = 14 passes ≈ 30 min per kernel. For a first diagnosis, a 9-metric set (5-6 passes, ~10 min) is usually enough to identify the dominant bottleneck. Reserve the full 22-metric set for when you need precise stall breakdowns and occupancy limiter detail.
3. Metric names are architecture-specific.
dram__bytes_read.sum works on Ampere (sm_80/86); on Hopper/Blackwell (sm_90/sm_121)
you need fb__bytes_read.sum. Always verify your metric names return data before
committing to a long profiling run — add a 1-pass sanity check first.
4. Two kernels doing the same operation can have completely different bottlenecks. Our moe_mid and moe_down both do quantized matmul but had opposite primary stalls (mio_throttle vs short_scoreboard) because of a difference in quantization format (IQ2_XXS LUT-heavy vs Q2_K higher-register-count). The fix for one makes no difference for the other. Profile each kernel individually.
5. nsys first, ncu second — always.
nsys tells you which kernel to care about in minutes. ncu tells you why in 30 minutes.
Spending 30 minutes on the wrong kernel (one that owns 2% of runtime, say) is pure waste.
6. __launch_bounds__ is a hint, not a guarantee — verify with ncu.
The compiler may or may not achieve the target register count without spilling. After
adding __launch_bounds__, re-run ncu and confirm: register count decreased, occupancy
increased, the relevant stall reason decreased, and kernel time actually improved.
7. Derive projections from measurement, not from structural analysis. "This kernel does a 2D matmul and tensor cores are 8× faster than scalar FMA, therefore I project 8×" is wrong. Tensor cores help the compute portion; if the kernel is memory- bound, the memory part still takes the same time. Measure first with a quick Phase 0 prototype, then derive a realistic projection from the measured improvement. Structural napkin math consistently overshoots.
# 0. Make sure ncu is in PATH
export PATH="/usr/local/cuda-13.0/bin:$PATH"
# 1. Quick nsys kernel ranking
nsys profile --stats=true ./my_program [args]
# → look for the top 1-2 kernels by total GPU time
# 2. Minimal ncu run (~5-6 passes, good for first diagnosis)
ncu --replay-mode application \
-k "regex:my_hot_kernel" \
-s 4 -c 3 \
--metrics "sm__throughput.avg.pct_of_peak_sustained_elapsed,\
l1tex__throughput.avg.pct_of_peak_sustained_elapsed,\
lts__throughput.avg.pct_of_peak_sustained_elapsed,\
smsp__inst_executed_pipe_tensor.avg,\
sm__warps_active.avg.pct_of_peak_sustained_active,\
smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct,\
smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct,\
smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct,\
launch__occupancy_limit_registers" \
--page raw --print-summary per-kernel --print-fp \
-- ./my_program [args] \
2>&1 | tee output_minimal.txt
# 3. Full ncu run (~14 passes, saves binary for ncu-ui)
ncu --replay-mode application \
-k "regex:my_hot_kernel" \
-s 8 -c 5 \
--metrics "${MY_FULL_METRICS}" \
-o output/my_kernel \
--page raw --print-summary per-kernel --print-fp \
-- ./my_program [args] \
2>&1 | tee output/my_kernel.txt
# 4. Extract key numbers from text output
grep -E "sm__throughput\.avg|smsp__inst_executed_pipe_tensor\.avg|\
smsp__warp_issue_stalled_mio_throttle|smsp__warp_issue_stalled_short_scoreboard|\
sm__warps_active\.avg|launch__occupancy_limit_registers" output/my_kernel.txt \
| grep -v "max\.\|min\.\|sum\."
# 5. Open in GUI (on a machine with a display)
ncu-ui output/my_kernel.ncu-rep
# 6. Permanent unblock (requires root + reboot)
echo 'options nvidia NVreg_RestrictProfilingToAdminUsers=0' \
| sudo tee /etc/modprobe.d/nvidia-prof.conf
echo 'kernel.perf_event_paranoid = -1' \
| sudo tee /etc/sysctl.d/99-perf.confHardware used in examples: NVIDIA GB10 (Grace Blackwell, sm_121), CUDA 13.0, 48 SMs, 273 GB/s LPDDR5X bandwidth. Written 2026-05-11.