Skip to content

Instantly share code, notes, and snippets.

@getianao
Last active January 30, 2025 05:46
Show Gist options
  • Save getianao/1686c4d0dac02a0b91a2885e18d9c9a3 to your computer and use it in GitHub Desktop.
Save getianao/1686c4d0dac02a0b91a2885e18d9c9a3 to your computer and use it in GitHub Desktop.
ncu --list-sets  # The configuration for sets. A set defines a set of sections.
ncu --list-sections  # The configuration for sections. A section defines a set of metrics.
ncu --query-metrics   # All individual metrics.
ncu --query-metrics-mode suffix --metrics <metrics list> # Check various suffixes for a base metric name.

ncu_cli

NCU Usage

ncu --metrics ${MATRICS} (--set full) -f -o ${OUTPUT} ${COMMAND}
ncu --import ${OUTPUT} --print-units=base
ncu --import ${OUTPUT} --page=details --print-metric-name=name --print-details=all

matric:

  • Warp divergence:
    • smsp__thread_inst_executed_per_inst_executed
    • the ratio of active threads that are not predicated off over the maximum number of threads per warp for each executed instruction
  • L1cache hit rate:
    • l1tex__t_sector_hit_rate
    • # of sector hits per sector
  • L2cache hit rate:
    • lts__t_sector_hit_rate
    • # proportion of L2 sector lookups that hit

Occupancy: Executing other warps when one warp is paused or stalled is the only way to hide latencies and keep the hardware busy. Higher occupancy does not always equate to higher performance-there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation. Check CUDA C++ Best Practices Guide

The reason of low theoretical occupancy: warp/block/register/shared mem per SM, register/shared mem per block

The reason of low achieved occupancy: workload imbalance in/between block, small kernel dimension

  • Achieved occupancy:
    • sm__warps_active.avg.pct_of_peak_sustained_active
    • # cumulative # of warps in flight
  • Theoretical occupancy
    • sm__maximum_warps_per_active_cycle_pct

Eligble Warp: From the set of eligible warps the scheduler selects a single warp from which to issue one or more instructions (Issued Warp). To increase the number of eligible warps, reduce the time the active warps are stalled by inspecting the top stall reasons

  • Theoretical warp per scheduler

    • smsp__maximum_warps_avg_per_active_cycle
  • Active warp per scheduler

    • smsp__warps_active
  • Eligble warp per scheduler

    • smsp__warps_eligible
  • Warp Stall

The requests in L1 cache and L2 cache are different (sector/request)

  • L1TEX: a request from an instruction which is the memory operation from a wrap. one instruction (request) can contain multiple sectors depanding on the memeory coleased.
  • L2: a request from the required sector in a thread. A request(cache line: 128B) contains for sector (32B) and some of the sector can be used for L1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment