It is Nvidia's Profiler, profiles any executable including CUDA programs.
nvprof ./executable
In case if you want the obsolute url /usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof
where MAJOR.minor
is your CUDA version installed.
- Number of times kernel is invoked
- Kernel execution time
- Time taken to DtoH and HtoD
- ..
- How to find all the metric available for the device? It is a big list see at EOF.
nvprof --query-metrics
- How to query for all metric?
nvprof --metrics all ./executable
- How to query for a specific metric? say Dram reads.
nvprof --metrics dram_read_transactions ./executable
- Available Metrics: Name Description
- inst_per_warp: Average number of instructions executed by each warp
- branch_efficiency: Ratio of non-divergent branches to total branches
- warp_execution_efficiency: Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
- warp_nonpred_execution_efficiency: Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor
- inst_replay_overhead: Average number of replays for each instruction executed
- shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load
- shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store
- local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
- local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
- gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load.
- gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store
- shared_store_transactions: Number of shared memory store transactions
- shared_load_transactions: Number of shared memory load transactions
- local_load_transactions: Number of local memory load transactions
- local_store_transactions: Number of local memory store transactions
- gld_transactions: Number of global memory load transactions
- gst_transactions: Number of global memory store transactions
- sysmem_read_transactions: Number of system memory read transactions
- sysmem_write_transactions: Number of system memory write transactions
- l2_read_transactions: Memory read transactions seen at L2 cache for all read requests
- l2_write_transactions: Memory write transactions seen at L2 cache for all write requests
- dram_read_transactions: Device memory read transactions
- dram_write_transactions: Device memory write transactions
- global_hit_rate: Hit rate for global loads in unified l1/tex cache
- local_hit_rate: Hit rate for local loads and stores
- gld_requested_throughput: Requested global memory load throughput
- gst_requested_throughput: Requested global memory store throughput
- gld_throughput: Global memory load throughput
- gst_throughput: Global memory store throughput
- local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
- tex_cache_hit_rate: Unified cache hit rate
- l2_tex_read_hit_rate: Hit rate at L2 cache for all read requests from texture cache
- l2_tex_write_hit_rate: Hit Rate at L2 cache for all write requests from texture cache
- dram_read_throughput: Device memory read throughput
- dram_write_throughput: Device memory write throughput
- tex_cache_throughput: Unified cache throughput
- l2_tex_read_throughput: Memory read throughput seen at L2 cache for read requests from the texture cache
- l2_tex_write_throughput: Memory write throughput seen at L2 cache for write requests from the texture cache
- l2_read_throughput: Memory read throughput seen at L2 cache for all read requests
- l2_write_throughput: Memory write throughput seen at L2 cache for all write requests
- sysmem_read_throughput: System memory read throughput
- sysmem_write_throughput: System memory write throughput
- local_load_throughput: Local memory load throughput
- local_store_throughput: Local memory store throughput
- shared_load_throughput: Shared memory load throughput
- shared_store_throughput: Shared memory store throughput
- gld_efficiency: Ratio of requested global memory load throughput to required global memory load throughput.
- gst_efficiency: Ratio of requested global memory store throughput to required global memory store throughput.
- tex_cache_transactions: Unified cache read transactions
- flop_count_dp: Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
- flop_count_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads.
- flop_count_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- flop_count_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.
- flop_count_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads.
- flop_count_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- flop_count_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads.
- inst_executed: The number of instructions executed
- inst_issued: The number of instructions issued
- dram_utilization: The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10
- sysmem_utilization: The utilization level of the system memory relative to the peak utilization
- stall_inst_fetch: Percentage of stalls occurring because the next assembly instruction has not yet been fetched
- stall_exec_dependency: Percentage of stalls occurring because an input required by the instruction is not yet available
- stall_memory_dependency: Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
- stall_texture: Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
- stall_sync: Percentage of stalls occurring because the warp is blocked at a
__syncthreads()
call - stall_other: Percentage of stalls occurring due to miscellaneous reasons
- stall_constant_memory_dependency: Percentage of stalls occurring because of immediate constant cache miss
- stall_pipe_busy: Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
- shared_efficiency: Ratio of requested shared memory throughput to required shared memory throughput
- inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- inst_fp_64: Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- inst_integer: Number of integer instructions executed by non-predicated threads
- inst_bit_convert: Number of bit-conversion instructions executed by non-predicated threads
- inst_control: Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
- inst_compute_ld_st: Number of compute load/store instructions executed by non-predicated threads
- inst_misc: Number of miscellaneous instructions executed by non-predicated threads
- inst_inter_thread_communication: Number of inter-thread communication instructions executed by non-predicated threads
- issue_slots: The number of issue slots used
- cf_issued: Number of issued control-flow instructions
- cf_executed: Number of executed control-flow instructions
- ldst_issued: Number of issued local, global, shared and texture memory load and store instructions
- ldst_executed: Number of executed local, global, shared and texture memory load and store instructions
- atomic_transactions: Global memory atomic and reduction transactions
- atomic_transactions_per_request: Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction
- l2_atomic_throughput: Memory read throughput seen at L2 cache for atomic and reduction requests
- l2_atomic_transactions: Memory read transactions seen at L2 cache for atomic and reduction requests
- l2_tex_read_transactions: Memory read transactions seen at L2 cache for read requests from the texture cache
- stall_memory_throttle: Percentage of stalls occurring because of memory throttle
- stall_not_selected: Percentage of stalls occurring because warp was not selected
- l2_tex_write_transactions: Memory write transactions seen at L2 cache for write requests from the texture cache
- flop_count_hp: Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
- flop_count_hp_add: Number of half-precision floating-point add operations executed by non-predicated threads.
- flop_count_hp_mul: Number of half-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_hp_fma: Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- inst_fp_16: Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- ipc: Instructions executed per cycle
- issued_ipc: Instructions issued per cycle
- issue_slot_utilization: Percentage of issue slots that issued at least one instruction, averaged across all cycles
- sm_efficiency: The percentage of time at least one warp is active on a specific multiprocessor
- achieved_occupancy: Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
- eligible_warps_per_cycle: Average number of warps that are eligible to issue per active cycle
- shared_utilization: The utilization level of the shared memory relative to peak utilization
- l2_utilization: The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10
- tex_utilization: The utilization level of the unified cache relative to the peak utilization
- ldst_fu_utilization: The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
- cf_fu_utilization: The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10
- tex_fu_utilization: The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10
- special_fu_utilization: The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10
- half_precision_fu_utilization: The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10
- single_precision_fu_utilization: The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions
- double_precision_fu_utilization: The utilization level of the multiprocessor function units that execute double-precision floating-point instructions
- flop_hp_efficiency: Ratio of achieved to peak half-precision floating-point operations
- flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations
- flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations
- sysmem_read_utilization: The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
- sysmem_write_utilization: The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10