Skip to content

Instantly share code, notes, and snippets.

@alexcrichton
Created January 17, 2025 19:05
Show Gist options
  • Save alexcrichton/75eca31603706054fb2dcc83a7c0b5be to your computer and use it in GitHub Desktop.
Save alexcrichton/75eca31603706054fb2dcc83a7c0b5be to your computer and use it in GitHub Desktop.
[~/code/wasmtime[pulley-less-instruction-loads]] $ /opt/intel/oneapi/vtune/latest/bin64/vtune -collect uarch-exploration ./target/x86_64-unknown-linux-gnu/release/wasmtime run -C cache=n --target pulley64 --invoke run --preload env=time.wasm ../wasmi-benchmarks/benches/res/wasm/coremark-minimal.wasm
vtune: Warning: To profile kernel modules during the session, make sure they are available in the /lib/modules/kernel_version/ location.
vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /home/alex/code/wasmtime/r001ue -command stop.
warning: using `--invoke` with a function that returns values is experimental and may break in the future
1557.9965
vtune: Collection stopped.
vtune: Using result path `/home/alex/code/wasmtime/r001ue'
vtune: Executing actions 19 % Resolving information for `libc.so.6'
vtune: Warning: Cannot locate file `vmlinux'.
vtune: Executing actions 20 % Resolving information for `wasmtime'
vtune: Warning: Cannot locate debugging information for the Linux kernel. Source-level analysis will not be possible. Function-level analysis will be limited to kernel symbol tables. See the Enabling Linux Kernel Analysis topic in the product online help for instructions.
vtune: Warning: Cannot locate debugging information for file `/home/alex/code/wasmtime/target/x86_64-unknown-linux-gnu/release/wasmtime'.
vtune: Executing actions 75 % Generating a report Elapsed Time: 19.991s
Clockticks: 107,513,600,000
Performance-core (P-core): 107,459,200,000
Efficient-core (E-core): 54,400,000
Instructions Retired: 286,480,000,000
Performance-core (P-core): 286,432,000,000
Efficient-core (E-core): 48,000,000
CPI Rate: 0.375
Performance-core (P-core): 0.375
Efficient-core (E-core): 1.133
Performance-core (P-core)
Retiring: 100.0% of Pipeline Slots
| A high fraction of pipeline slots was utilized by useful work. While
| the goal is to make this metric value as big as possible, a high
| Retiring value for non-vectorized code could prompt you to consider
| code vectorization. Vectorization enables doing more computations
| without significantly increasing the number of instructions, thus
| improving the performance. Note that this metric value may be
| highlighted due to Microcode Sequencer (MS) issue, so the performance
| can be improved by avoiding using the MS.
|
Light Operations: 100.0% of Pipeline Slots
| CPU retired light-weight operations (ones which require no more
| than one uop) in a significant fraction of cycles. This
| correlates with total number of instructions used by the program.
| Optimum value of uops-per-instruction ratio is 1. While this is
| the most desirable metric, high values can also provide
| opportunities for performance optimizations.
|
FP Arithmetic: 0.0% of uOps
FP x87: 0.0% of uOps
FP Scalar: 0.0% of uOps
FP Vector: 0.0% of uOps
128-bit FP Vector: 0.0% of uOps
256-bit FP Vector: 0.0% of uOps
Integer Operations: 0.0% of uOps
128-bit Integer Vector Operations: 0.0% of uOps
256-bit Vector Operations: 0.0% of uOps
Memory Operations: 93.2% of Pipeline Slots
| For a significant fraction of pipeline slots the CPU was
| retiring memory operations - uops for memory load or store
| accesses.
|
Fused Instructions: 0.7% of Pipeline Slots
Non Fused Branches: 15.0% of Pipeline Slots
| For a significant fraction of slots the CPU was retiring
| branch instructions that were not fused. Non-conditional
| branches like direct JMP or CALL would count here. Can be
| used to detect fusable conditional jumps.
|
Other: 100.0% of Pipeline Slots
| This metric represents a non-floating-point (FP) uop fraction
| the CPU has executed. If your application has no FP
| operations, this is likely to be the biggest fraction.
|
Nop Instructions: 0.0% of Pipeline Slots
Shuffles_256b: 0.0% of Pipeline Slots
Heavy Operations: 20.9% of Pipeline Slots
| CPU retired heavy-weight operations (instructions that required
| 2+ uops) in a significant fraction of cycles.
|
Few Uops Instructions: 20.9% of Pipeline Slots
| This metric represents fraction of slots where the CPU was
| retiring instructions that that are decoder into two or up to
| ([SNB+] four; [ADL+] five) uops. This highly-correlates with
| the number of uops in such instructions.
|
Microcode Sequencer: 0.0% of Pipeline Slots
Assists: 0.0% of Pipeline Slots
Page Faults: 0.0% of Pipeline Slots
FP Assists: 0.0% of Pipeline Slots
AVX Assists: 0.0% of Pipeline Slots
CISC: 0.0% of Pipeline Slots
Front-End Bound: 25.1% of Pipeline Slots
| Issue: A significant portion of Pipeline Slots is remaining empty due
| to issues in the Front-End.
|
| Tips: Make sure the code working size is not too large, the code
| layout does not require too many memory accesses per cycle to get
| enough instructions for filling four pipeline slots, or check for
| microcode assists.
|
Front-End Latency: 4.8% of Pipeline Slots
ICache Misses: 0.1% of Clockticks
ITLB Overhead: 0.0% of Clockticks
Branch Resteers: 0.5% of Clockticks
Mispredicts Resteers
Clears Resteers
Unknown Branches: 0.0% of Clockticks
MS Switches: 0.0% of Clockticks
Length Changing Prefixes: 0.0% of Clockticks
DSB Switches: 1.2% of Clockticks
Front-End Bandwidth: 20.3% of Pipeline Slots
| This metric represents a fraction of slots during which CPU was
| stalled due to front-end bandwidth issues, such as inefficiencies
| in the instruction decoders or code restrictions for caching in
| the DSB (decoded uOps cache). In such cases, the front-end
| typically delivers a non-optimal amount of uOps to the back-end.
|
Front-End Bandwidth MITE: 0.1% of Pipeline Slots
Decoder-0 Alone: 0.0% of Pipeline Slots
Front-End Bandwidth DSB: 5.4% of Pipeline Slots
Front-End Bandwidth LSD: 0.4% of Pipeline Slots
(Info) DSB Coverage: 88.5%
(Info) LSD Coverage: 7.1%
(Info) DSB Misses: 100.0% of Pipeline Slots
| %DSB_MissesIssueTextAll
|
Bad Speculation: 0.0% of Pipeline Slots
Branch Mispredict: 37.6% of Pipeline Slots
Machine Clears: 0.0% of Pipeline Slots
Back-End Bound: 100.0% of Pipeline Slots
| A significant portion of pipeline slots are remaining empty. When
| operations take too long in the back-end, they introduce bubbles in
| the pipeline that ultimately cause fewer pipeline slots containing
| useful work to be retired per cycle than the machine is capable to
| support. This opportunity cost results in slower execution. Long-
| latency operations like divides and memory operations can cause this,
| as can too many operations being directed to a single execution port
| (for example, more multiply operations arriving in the back-end per
| cycle than the execution unit can support).
|
Memory Bound: 73.0% of Pipeline Slots
| The metric value is high. This can indicate that the significant
| fraction of execution pipeline slots could be stalled due to
| demand memory load and stores. Use Memory Access analysis to have
| the metric breakdown by memory hierarchy, memory bandwidth
| information, correlation by memory objects.
|
L1 Bound: 16.4% of Clockticks
DTLB Overhead: 0.1% of Clockticks
Load STLB Hit: 0.0% of Clockticks
Load STLB Miss: 0.1% of Clockticks
Loads Blocked by Store Forwarding: 0.0% of Clockticks
Lock Latency: 0.0% of Clockticks
Split Loads: 0.0% of Clockticks
4K Aliasing: 0.0% of Clockticks
FB Full: 0.0% of Clockticks
L2 Bound: 0.1% of Clockticks
L3 Bound: 0.0% of Clockticks
Contested Accesses: 0.0% of Clockticks
Data Sharing: 0.0% of Clockticks
L3 Latency: 0.0% of Clockticks
SQ Full: 0.0% of Clockticks
DRAM Bound: 0.0% of Clockticks
Memory Bandwidth: 0.0% of Clockticks
Memory Latency: 0.0% of Clockticks
Store Bound: 0.0% of Clockticks
Store Latency: 0.0% of Clockticks
False Sharing: 0.0% of Clockticks
Split Stores: 0.0%
Streaming Stores: 0.0% of Clockticks
DTLB Store Overhead: 0.0% of Clockticks
Store STLB Hit: 0.0% of Clockticks
Store STLB Miss: 0.0% of Clockticks
Core Bound: 100.0% of Pipeline Slots
| This metric represents how much Core non-memory issues were of a
| bottleneck. Shortage in hardware compute resources, or
| dependencies software's instructions are both categorized under
| Core Bound. Hence it may indicate the machine ran out of an OOO
| resources, certain execution units are overloaded or dependencies
| in program's data- or instruction- flow are limiting the
| performance (e.g. FP-chained long-latency arithmetic operations).
|
Divider: 0.0% of Clockticks
Serializing Operations: 0.2% of Clockticks
Slow Pause: 0.0% of Clockticks
C01 Wait: 0.0% of Clockticks
C02 Wait: 0.0% of Clockticks
Memory Fence: 0.0% of Clockticks
Port Utilization: 20.1% of Clockticks
| Issue: A significant fraction of cycles was stalled due to
| Core non-divider-related issues.
|
| Tips: Use vectorization to reduce pressure on the execution
| ports as multiple elements are calculated with same uOp.
|
Cycles of 0 Ports Utilized: 0.0% of Clockticks
Mixing Vectors: 0.0% of Clockticks
Cycles of 1 Port Utilized: 20.1% of Clockticks
| This metric represents cycles fraction where the CPU
| executed total of 1 uop per cycle on all execution ports
| (Logical Processor cycles since ICL, Physical Core cycles
| otherwise). This can be due to heavy data-dependency
| among software instructions, or oversubscribing a
| particular hardware resource. In some other cases with
| high 1_Port_Utilized and L1 Bound, this metric can point
| to L1 data-cache latency bottleneck that may not
| necessarily manifest with complete execution starvation
| (due to the short L1 latency e.g. walking a linked list)
| - looking at the assembly can be helpful. Note that this
| metric value may be highlighted due to L1 Bound issue.
|
Cycles of 2 Ports Utilized: 4.4% of Clockticks
Cycles of 3+ Ports Utilized: 45.4% of Clockticks
| This metric represents Core cycles fraction CPU executed
| total of 3 or more uops per cycle on all execution ports
| (Logical Processor cycles since ICL, Physical Core cycles
| otherwise).
|
ALU Operation Utilization: 28.6% of Clockticks
Port 0: 32.9% of Clockticks
Port 1: 34.0% of Clockticks
Port 6: 34.4% of Clockticks
Load Operation Utilization: 23.7% of Clockticks
Store Operation Utilization: 14.3% of Clockticks
Efficient-core (E-core)
Retiring: 0.0% of Pipeline Slots
General Retirement: 0.0% of Pipeline Slots
FP Arithmetic: 0.0% of Pipeline Slots
Other: 0.0% of Pipeline Slots
Microcode Sequencer: 0.0% of Pipeline Slots
Front-End Bound: 0.0% of Pipeline Slots
Front-End Latency: 0.0% of Pipeline Slots
ICache Misses: 0.0% of Pipeline Slots
ITLB Overhead: 0.0% of Pipeline Slots
BACLEARS: 0.0% of Pipeline Slots
Branch Resteers: 0.0% of Pipeline Slots
Front-End Bandwidth: 0.0% of Pipeline Slots
Cisc: 0.0% of Pipeline Slots
Decode: 0.0% of Pipeline Slots
Pre-Decode Wrong: 0.0% of Pipeline Slots
Front-End Other: 0.0% of Pipeline Slots
Bad Speculation: 100.0% of Pipeline Slots
| A significant proportion of pipeline slots containing useful work are
| being cancelled. This can be caused by mispredicting branches or by
| machine clears. Note that this metric value may be highlighted due to
| Branch Resteers issue.
|
Branch Mispredict: 88.2% of Pipeline Slots
| Issue:: A significant proportion of branches are mispredicted,
| leading to excessive wasted work or Backend stalls due to the
| machine need to recover its state from a speculative path.
|
| Tips:
|
| 1. Identify heavily mispredicted branches and consider making
| your algorithm more predictable or reducing the number of
| branches. You can add more work to 'if' statements and move them
| higher in the code flow for earlier execution. If using 'switch'
| or 'case' statements, put the most commonly executed cases first.
| Avoid using virtual function pointers for heavily executed calls.
|
| 2. Use profile-guided optimization in the compiler.
|
| See the Intel 64 and IA-32 Architectures Optimization Reference
| Manual for general strategies to address branch misprediction
| issues.
|
Machine Clears: 0.0% of Pipeline Slots
Machine Clear: 0.0% of Pipeline Slots
SMC Machine Clear: 0.000
MO Machine Clear Overhead: 0.000
FP Assists: 0.000
Disambiguation: 0.000
Page Faults: 0.000
Fast Machine Clears: 0.0% of Pipeline Slots
Back-End Bound: 0.0% of Pipeline Slots
Core Bound: 0.0% of Clockticks
Memory Bound: 0.0% of Clockticks
Store Bound: 0.0% of Clockticks
L1 Bound: 0.0% of Clockticks
Loads Blocked by Store Forwarding: 0.0% of Clockticks
Load STLB Hit: 0.0% of Clockticks
Load STLB Miss: 0.0% of Clockticks
Other L1: 0.0% of Clockticks
L2 Bound: 0.0% of Clockticks
L3 Bound: 0.0% of Clockticks
DRAM Bound: 0.0% of Clockticks
Other Load Store: 0.0% of Clockticks
Back-End Bound Auxiliary: 0.0% of Pipeline Slots
Resource Bound: 0.0% of Pipeline Slots
Memory Scheduler: 0.0% of Pipeline Slots
ST Buffer: 0.000
LD Buffer: 0.000
RSV: 0.000
Non-memory Scheduler: 0.0% of Pipeline Slots
Register: 0.0% of Pipeline Slots
Full Re-order Buffer (ROB): 0.0% of Pipeline Slots
Allocation Restriction: 0.0% of Pipeline Slots
Serializing Operations: 0.0% of Pipeline Slots
Average CPU Frequency: 18.963 GHz
Total Thread Count: 66
Paused Time: 0s
Effective Physical Core Utilization: 1.2% (0.283 out of 24)
| The metric value is low, which may signal a poor physical CPU cores
| utilization caused by:
| - load imbalance
| - threading runtime overhead
| - contended synchronization
| - thread/process underutilization
| - incorrect affinity that utilizes logical cores instead of physical
| cores
| Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
| or run the Locks and Waits analysis to identify parallel bottlenecks for
| other parallel runtimes.
|
Effective Logical Core Utilization: 0.9% (0.284 out of 32)
| The metric value is low, which may signal a poor logical CPU cores
| utilization. Consider improving physical core utilization as the first
| step and then look at opportunities to utilize logical cores, which in
| some cases can improve processor throughput and overall performance of
| multi-threaded applications.
|
Collection and Platform Info
Application Command Line: ./target/x86_64-unknown-linux-gnu/release/wasmtime "run" "-C" "cache=n" "--target" "pulley64" "--invoke" "run" "--preload" "env=time.wasm" "../wasmi-benchmarks/benches/res/wasm/coremark-minimal.wasm"
Operating System: 6.8.0-51-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=24.04 DISTRIB_CODENAME=noble DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"
Computer Name: fhaweignbi
Result Size: 48.0 MB
Collection start time: 18:59:44 17/01/2025 UTC
Collection stop time: 19:00:04 17/01/2025 UTC
Collector Type: Driverless Perf system-wide sampling
CPU
Name: Intel(R) microarchitecture code named Raptorlake-DT
Frequency: 3.187 GHz
Logical CPU Count: 32
Cache Allocation Technology
Level 2 capability: not detected
Level 3 capability: not detected
If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment