Skip to content

Instantly share code, notes, and snippets.

@mrprajesh
Last active January 9, 2024 02:59
Show Gist options
  • Save mrprajesh/352cbe661ee27a6b4627ae72d89479e6 to your computer and use it in GitHub Desktop.
Save mrprajesh/352cbe661ee27a6b4627ae72d89479e6 to your computer and use it in GitHub Desktop.
Learn nvprof - Profiling CUDA Programs

nvprof - NVCC Profiler

It is Nvidia's Profiler, profiles any executable including CUDA programs.

How to use it?

nvprof ./executable

In case if you want the obsolute url /usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof where MAJOR.minor is your CUDA version installed.

What all you can see

  1. Number of times kernel is invoked
  2. Kernel execution time
  3. Time taken to DtoH and HtoD
  4. ..

How to get all/certain meterics from nvprof

  1. How to find all the metric available for the device? It is a big list see at EOF.
nvprof --query-metrics
  1. How to query for all metric?

nvprof --metrics all ./executable

  1. How to query for a specific metric? say Dram reads.

nvprof --metrics dram_read_transactions ./executable

List

  • Available Metrics: Name Description
  • inst_per_warp: Average number of instructions executed by each warp
  • branch_efficiency: Ratio of non-divergent branches to total branches
  • warp_execution_efficiency: Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
  • warp_nonpred_execution_efficiency: Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor
  • inst_replay_overhead: Average number of replays for each instruction executed
  • shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load
  • shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store
  • local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
  • local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
  • gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load.
  • gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store
  • shared_store_transactions: Number of shared memory store transactions
  • shared_load_transactions: Number of shared memory load transactions
  • local_load_transactions: Number of local memory load transactions
  • local_store_transactions: Number of local memory store transactions
  • gld_transactions: Number of global memory load transactions
  • gst_transactions: Number of global memory store transactions
  • sysmem_read_transactions: Number of system memory read transactions
  • sysmem_write_transactions: Number of system memory write transactions
  • l2_read_transactions: Memory read transactions seen at L2 cache for all read requests
  • l2_write_transactions: Memory write transactions seen at L2 cache for all write requests
  • dram_read_transactions: Device memory read transactions
  • dram_write_transactions: Device memory write transactions
  • global_hit_rate: Hit rate for global loads in unified l1/tex cache
  • local_hit_rate: Hit rate for local loads and stores
  • gld_requested_throughput: Requested global memory load throughput
  • gst_requested_throughput: Requested global memory store throughput
  • gld_throughput: Global memory load throughput
  • gst_throughput: Global memory store throughput
  • local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
  • tex_cache_hit_rate: Unified cache hit rate
  • l2_tex_read_hit_rate: Hit rate at L2 cache for all read requests from texture cache
  • l2_tex_write_hit_rate: Hit Rate at L2 cache for all write requests from texture cache
  • dram_read_throughput: Device memory read throughput
  • dram_write_throughput: Device memory write throughput
  • tex_cache_throughput: Unified cache throughput
  • l2_tex_read_throughput: Memory read throughput seen at L2 cache for read requests from the texture cache
  • l2_tex_write_throughput: Memory write throughput seen at L2 cache for write requests from the texture cache
  • l2_read_throughput: Memory read throughput seen at L2 cache for all read requests
  • l2_write_throughput: Memory write throughput seen at L2 cache for all write requests
  • sysmem_read_throughput: System memory read throughput
  • sysmem_write_throughput: System memory write throughput
  • local_load_throughput: Local memory load throughput
  • local_store_throughput: Local memory store throughput
  • shared_load_throughput: Shared memory load throughput
  • shared_store_throughput: Shared memory store throughput
  • gld_efficiency: Ratio of requested global memory load throughput to required global memory load throughput.
  • gst_efficiency: Ratio of requested global memory store throughput to required global memory store throughput.
  • tex_cache_transactions: Unified cache read transactions
  • flop_count_dp: Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
  • flop_count_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads.
  • flop_count_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
  • flop_count_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads.
  • flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.
  • flop_count_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads.
  • flop_count_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
  • flop_count_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads.
  • flop_count_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads.
  • inst_executed: The number of instructions executed
  • inst_issued: The number of instructions issued
  • dram_utilization: The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10
  • sysmem_utilization: The utilization level of the system memory relative to the peak utilization
  • stall_inst_fetch: Percentage of stalls occurring because the next assembly instruction has not yet been fetched
  • stall_exec_dependency: Percentage of stalls occurring because an input required by the instruction is not yet available
  • stall_memory_dependency: Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
  • stall_texture: Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
  • stall_sync: Percentage of stalls occurring because the warp is blocked at a __syncthreads() call
  • stall_other: Percentage of stalls occurring due to miscellaneous reasons
  • stall_constant_memory_dependency: Percentage of stalls occurring because of immediate constant cache miss
  • stall_pipe_busy: Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
  • shared_efficiency: Ratio of requested shared memory throughput to required shared memory throughput
  • inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
  • inst_fp_64: Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
  • inst_integer: Number of integer instructions executed by non-predicated threads
  • inst_bit_convert: Number of bit-conversion instructions executed by non-predicated threads
  • inst_control: Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
  • inst_compute_ld_st: Number of compute load/store instructions executed by non-predicated threads
  • inst_misc: Number of miscellaneous instructions executed by non-predicated threads
  • inst_inter_thread_communication: Number of inter-thread communication instructions executed by non-predicated threads
  • issue_slots: The number of issue slots used
  • cf_issued: Number of issued control-flow instructions
  • cf_executed: Number of executed control-flow instructions
  • ldst_issued: Number of issued local, global, shared and texture memory load and store instructions
  • ldst_executed: Number of executed local, global, shared and texture memory load and store instructions
  • atomic_transactions: Global memory atomic and reduction transactions
  • atomic_transactions_per_request: Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction
  • l2_atomic_throughput: Memory read throughput seen at L2 cache for atomic and reduction requests
  • l2_atomic_transactions: Memory read transactions seen at L2 cache for atomic and reduction requests
  • l2_tex_read_transactions: Memory read transactions seen at L2 cache for read requests from the texture cache
  • stall_memory_throttle: Percentage of stalls occurring because of memory throttle
  • stall_not_selected: Percentage of stalls occurring because warp was not selected
  • l2_tex_write_transactions: Memory write transactions seen at L2 cache for write requests from the texture cache
  • flop_count_hp: Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
  • flop_count_hp_add: Number of half-precision floating-point add operations executed by non-predicated threads.
  • flop_count_hp_mul: Number of half-precision floating-point multiply operations executed by non-predicated threads.
  • flop_count_hp_fma: Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
  • inst_fp_16: Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
  • ipc: Instructions executed per cycle
  • issued_ipc: Instructions issued per cycle
  • issue_slot_utilization: Percentage of issue slots that issued at least one instruction, averaged across all cycles
  • sm_efficiency: The percentage of time at least one warp is active on a specific multiprocessor
  • achieved_occupancy: Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
  • eligible_warps_per_cycle: Average number of warps that are eligible to issue per active cycle
  • shared_utilization: The utilization level of the shared memory relative to peak utilization
  • l2_utilization: The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10
  • tex_utilization: The utilization level of the unified cache relative to the peak utilization
  • ldst_fu_utilization: The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
  • cf_fu_utilization: The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10
  • tex_fu_utilization: The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10
  • special_fu_utilization: The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10
  • half_precision_fu_utilization: The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10
  • single_precision_fu_utilization: The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions
  • double_precision_fu_utilization: The utilization level of the multiprocessor function units that execute double-precision floating-point instructions
  • flop_hp_efficiency: Ratio of achieved to peak half-precision floating-point operations
  • flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations
  • flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations
  • sysmem_read_utilization: The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
  • sysmem_write_utilization: The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment