Skip to content

Instantly share code, notes, and snippets.

@shreyansh26
Forked from getianao/ncu_metrics.md
Created January 30, 2025 05:46
Show Gist options
  • Save shreyansh26/e13a65beb3dcea8d5156e67f3dd8ef6e to your computer and use it in GitHub Desktop.
Save shreyansh26/e13a65beb3dcea8d5156e67f3dd8ef6e to your computer and use it in GitHub Desktop.
ncu --list-sets  # The configuration for sets. A set defines a set of sections.
ncu --list-sections  # The configuration for sections. A section defines a set of metrics.
ncu --query-metrics   # All individual metrics.
ncu --query-metrics-mode suffix --metrics <metrics list> # Check various suffixes for a base metric name.

ncu_cli

NCU Usage

ncu --metrics ${MATRICS} (--set full) -f -o ${OUTPUT} ${COMMAND}
ncu --import ${OUTPUT} --print-units=base
ncu --import ${OUTPUT} --page=details --print-metric-name=name --print-details=all

matric:

  • Warp divergence:
    • smsp__thread_inst_executed_per_inst_executed
    • the ratio of active threads that are not predicated off over the maximum number of threads per warp for each executed instruction
  • L1cache hit rate:
    • l1tex__t_sector_hit_rate
    • # of sector hits per sector
  • L2cache hit rate:
    • lts__t_sector_hit_rate
    • # proportion of L2 sector lookups that hit

Occupancy: Executing other warps when one warp is paused or stalled is the only way to hide latencies and keep the hardware busy. Higher occupancy does not always equate to higher performance-there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation. Check CUDA C++ Best Practices Guide

The reason of low theoretical occupancy: warp/block/register/shared mem per SM, register/shared mem per block

The reason of low achieved occupancy: workload imbalance in/between block, small kernel dimension

  • Achieved occupancy:
    • sm__warps_active.avg.pct_of_peak_sustained_active
    • # cumulative # of warps in flight
  • Theoretical occupancy
    • sm__maximum_warps_per_active_cycle_pct

Eligble Warp: From the set of eligible warps the scheduler selects a single warp from which to issue one or more instructions (Issued Warp). To increase the number of eligible warps, reduce the time the active warps are stalled by inspecting the top stall reasons

  • Theoretical warp per scheduler

    • smsp__maximum_warps_avg_per_active_cycle
  • Active warp per scheduler

    • smsp__warps_active
  • Eligble warp per scheduler

    • smsp__warps_eligible
  • Warp Stall

The requests in L1 cache and L2 cache are different (sector/request)

  • L1TEX: a request from an instruction which is the memory operation from a wrap. one instruction (request) can contain multiple sectors depanding on the memeory coleased.
  • L2: a request from the required sector in a thread. A request(cache line: 128B) contains for sector (32B) and some of the sector can be used for L1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment