ncu_metrics.md

NCU Usage

ncu --metrics ${MATRICS} (--set full) -f -o ${OUTPUT} ${COMMAND}
ncu --import ${OUTPUT} --print-units=base
ncu --import ${OUTPUT} --page=details --print-metric-name=name --print-details=all

matric:

Warp divergence:
- smsp__thread_inst_executed_per_inst_executed
- the ratio of active threads that are not predicated off over the maximum number of threads per warp for each executed instruction
L1cache hit rate:
- l1tex__t_sector_hit_rate
- # of sector hits per sector
L2cache hit rate:
- lts__t_sector_hit_rate
- # proportion of L2 sector lookups that hit

Occupancy: Executing other warps when one warp is paused or stalled is the only way to hide latencies and keep the hardware busy. Higher occupancy does not always equate to higher performance-there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation. Check CUDA C++ Best Practices Guide

The reason of low theoretical occupancy: warp/block/register/shared mem per SM, register/shared mem per block

The reason of low achieved occupancy: workload imbalance in/between block, small kernel dimension

Achieved occupancy:
- sm__warps_active.avg.pct_of_peak_sustained_active
- # cumulative # of warps in flight
Theoretical occupancy
- sm__maximum_warps_per_active_cycle_pct

Eligble Warp: From the set of eligible warps the scheduler selects a single warp from which to issue one or more instructions (Issued Warp). To increase the number of eligible warps, reduce the time the active warps are stalled by inspecting the top stall reasons

Theoretical warp per scheduler
- smsp__maximum_warps_avg_per_active_cycle
Active warp per scheduler
- smsp__warps_active
Eligble warp per scheduler
- smsp__warps_eligible
Warp Stall
- See ncu
- Stackoverflow: Scoreboard

getianao/ncu_metrics.md

NCU Usage