ncu --list-sets # The configuration for sets. A set defines a set of sections.
ncu --list-sections # The configuration for sections. A section defines a set of metrics.
ncu --query-metrics # All individual metrics.
ncu --query-metrics-mode suffix --metrics <metrics list> # Check various suffixes for a base metric name.
-
-
Save shreyansh26/e13a65beb3dcea8d5156e67f3dd8ef6e to your computer and use it in GitHub Desktop.
ncu --metrics ${MATRICS} (--set full) -f -o ${OUTPUT} ${COMMAND}
ncu --import ${OUTPUT} --print-units=base
ncu --import ${OUTPUT} --page=details --print-metric-name=name --print-details=all
matric:
- Warp divergence:
smsp__thread_inst_executed_per_inst_executed
- the ratio of active threads that are not predicated off over the maximum number of threads per warp for each executed instruction
- L1cache hit rate:
l1tex__t_sector_hit_rate
- # of sector hits per sector
- L2cache hit rate:
lts__t_sector_hit_rate
- # proportion of L2 sector lookups that hit
Occupancy: Executing other warps when one warp is paused or stalled is the only way to hide latencies and keep the hardware busy. Higher occupancy does not always equate to higher performance-there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation. Check CUDA C++ Best Practices Guide
The reason of low theoretical occupancy: warp/block/register/shared mem per SM, register/shared mem per block
The reason of low achieved occupancy: workload imbalance in/between block, small kernel dimension
- Achieved occupancy:
sm__warps_active.avg.pct_of_peak_sustained_active
- # cumulative # of warps in flight
- Theoretical occupancy
sm__maximum_warps_per_active_cycle_pct
Eligble Warp: From the set of eligible warps the scheduler selects a single warp from which to issue one or more instructions (Issued Warp). To increase the number of eligible warps, reduce the time the active warps are stalled by inspecting the top stall reasons
-
Theoretical warp per scheduler
smsp__maximum_warps_avg_per_active_cycle
-
Active warp per scheduler
smsp__warps_active
-
Eligble warp per scheduler
smsp__warps_eligible
-
Warp Stall
- See ncu
- Stackoverflow: Scoreboard
The requests in L1 cache and L2 cache are different (sector/request)
- L1TEX: a request from an instruction which is the memory operation from a wrap. one instruction (request) can contain multiple sectors depanding on the memeory coleased.
- L2: a request from the required sector in a thread. A request(cache line: 128B) contains for sector (32B) and some of the sector can be used for L1.