Created
January 17, 2025 19:05
-
-
Save alexcrichton/75eca31603706054fb2dcc83a7c0b5be to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[~/code/wasmtime[pulley-less-instruction-loads]] $ /opt/intel/oneapi/vtune/latest/bin64/vtune -collect uarch-exploration ./target/x86_64-unknown-linux-gnu/release/wasmtime run -C cache=n --target pulley64 --invoke run --preload env=time.wasm ../wasmi-benchmarks/benches/res/wasm/coremark-minimal.wasm | |
vtune: Warning: To profile kernel modules during the session, make sure they are available in the /lib/modules/kernel_version/ location. | |
vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /home/alex/code/wasmtime/r001ue -command stop. | |
warning: using `--invoke` with a function that returns values is experimental and may break in the future | |
1557.9965 | |
vtune: Collection stopped. | |
vtune: Using result path `/home/alex/code/wasmtime/r001ue' | |
vtune: Executing actions 19 % Resolving information for `libc.so.6' | |
vtune: Warning: Cannot locate file `vmlinux'. | |
vtune: Executing actions 20 % Resolving information for `wasmtime' | |
vtune: Warning: Cannot locate debugging information for the Linux kernel. Source-level analysis will not be possible. Function-level analysis will be limited to kernel symbol tables. See the Enabling Linux Kernel Analysis topic in the product online help for instructions. | |
vtune: Warning: Cannot locate debugging information for file `/home/alex/code/wasmtime/target/x86_64-unknown-linux-gnu/release/wasmtime'. | |
vtune: Executing actions 75 % Generating a report Elapsed Time: 19.991s | |
Clockticks: 107,513,600,000 | |
Performance-core (P-core): 107,459,200,000 | |
Efficient-core (E-core): 54,400,000 | |
Instructions Retired: 286,480,000,000 | |
Performance-core (P-core): 286,432,000,000 | |
Efficient-core (E-core): 48,000,000 | |
CPI Rate: 0.375 | |
Performance-core (P-core): 0.375 | |
Efficient-core (E-core): 1.133 | |
Performance-core (P-core) | |
Retiring: 100.0% of Pipeline Slots | |
| A high fraction of pipeline slots was utilized by useful work. While | |
| the goal is to make this metric value as big as possible, a high | |
| Retiring value for non-vectorized code could prompt you to consider | |
| code vectorization. Vectorization enables doing more computations | |
| without significantly increasing the number of instructions, thus | |
| improving the performance. Note that this metric value may be | |
| highlighted due to Microcode Sequencer (MS) issue, so the performance | |
| can be improved by avoiding using the MS. | |
| | |
Light Operations: 100.0% of Pipeline Slots | |
| CPU retired light-weight operations (ones which require no more | |
| than one uop) in a significant fraction of cycles. This | |
| correlates with total number of instructions used by the program. | |
| Optimum value of uops-per-instruction ratio is 1. While this is | |
| the most desirable metric, high values can also provide | |
| opportunities for performance optimizations. | |
| | |
FP Arithmetic: 0.0% of uOps | |
FP x87: 0.0% of uOps | |
FP Scalar: 0.0% of uOps | |
FP Vector: 0.0% of uOps | |
128-bit FP Vector: 0.0% of uOps | |
256-bit FP Vector: 0.0% of uOps | |
Integer Operations: 0.0% of uOps | |
128-bit Integer Vector Operations: 0.0% of uOps | |
256-bit Vector Operations: 0.0% of uOps | |
Memory Operations: 93.2% of Pipeline Slots | |
| For a significant fraction of pipeline slots the CPU was | |
| retiring memory operations - uops for memory load or store | |
| accesses. | |
| | |
Fused Instructions: 0.7% of Pipeline Slots | |
Non Fused Branches: 15.0% of Pipeline Slots | |
| For a significant fraction of slots the CPU was retiring | |
| branch instructions that were not fused. Non-conditional | |
| branches like direct JMP or CALL would count here. Can be | |
| used to detect fusable conditional jumps. | |
| | |
Other: 100.0% of Pipeline Slots | |
| This metric represents a non-floating-point (FP) uop fraction | |
| the CPU has executed. If your application has no FP | |
| operations, this is likely to be the biggest fraction. | |
| | |
Nop Instructions: 0.0% of Pipeline Slots | |
Shuffles_256b: 0.0% of Pipeline Slots | |
Heavy Operations: 20.9% of Pipeline Slots | |
| CPU retired heavy-weight operations (instructions that required | |
| 2+ uops) in a significant fraction of cycles. | |
| | |
Few Uops Instructions: 20.9% of Pipeline Slots | |
| This metric represents fraction of slots where the CPU was | |
| retiring instructions that that are decoder into two or up to | |
| ([SNB+] four; [ADL+] five) uops. This highly-correlates with | |
| the number of uops in such instructions. | |
| | |
Microcode Sequencer: 0.0% of Pipeline Slots | |
Assists: 0.0% of Pipeline Slots | |
Page Faults: 0.0% of Pipeline Slots | |
FP Assists: 0.0% of Pipeline Slots | |
AVX Assists: 0.0% of Pipeline Slots | |
CISC: 0.0% of Pipeline Slots | |
Front-End Bound: 25.1% of Pipeline Slots | |
| Issue: A significant portion of Pipeline Slots is remaining empty due | |
| to issues in the Front-End. | |
| | |
| Tips: Make sure the code working size is not too large, the code | |
| layout does not require too many memory accesses per cycle to get | |
| enough instructions for filling four pipeline slots, or check for | |
| microcode assists. | |
| | |
Front-End Latency: 4.8% of Pipeline Slots | |
ICache Misses: 0.1% of Clockticks | |
ITLB Overhead: 0.0% of Clockticks | |
Branch Resteers: 0.5% of Clockticks | |
Mispredicts Resteers | |
Clears Resteers | |
Unknown Branches: 0.0% of Clockticks | |
MS Switches: 0.0% of Clockticks | |
Length Changing Prefixes: 0.0% of Clockticks | |
DSB Switches: 1.2% of Clockticks | |
Front-End Bandwidth: 20.3% of Pipeline Slots | |
| This metric represents a fraction of slots during which CPU was | |
| stalled due to front-end bandwidth issues, such as inefficiencies | |
| in the instruction decoders or code restrictions for caching in | |
| the DSB (decoded uOps cache). In such cases, the front-end | |
| typically delivers a non-optimal amount of uOps to the back-end. | |
| | |
Front-End Bandwidth MITE: 0.1% of Pipeline Slots | |
Decoder-0 Alone: 0.0% of Pipeline Slots | |
Front-End Bandwidth DSB: 5.4% of Pipeline Slots | |
Front-End Bandwidth LSD: 0.4% of Pipeline Slots | |
(Info) DSB Coverage: 88.5% | |
(Info) LSD Coverage: 7.1% | |
(Info) DSB Misses: 100.0% of Pipeline Slots | |
| %DSB_MissesIssueTextAll | |
| | |
Bad Speculation: 0.0% of Pipeline Slots | |
Branch Mispredict: 37.6% of Pipeline Slots | |
Machine Clears: 0.0% of Pipeline Slots | |
Back-End Bound: 100.0% of Pipeline Slots | |
| A significant portion of pipeline slots are remaining empty. When | |
| operations take too long in the back-end, they introduce bubbles in | |
| the pipeline that ultimately cause fewer pipeline slots containing | |
| useful work to be retired per cycle than the machine is capable to | |
| support. This opportunity cost results in slower execution. Long- | |
| latency operations like divides and memory operations can cause this, | |
| as can too many operations being directed to a single execution port | |
| (for example, more multiply operations arriving in the back-end per | |
| cycle than the execution unit can support). | |
| | |
Memory Bound: 73.0% of Pipeline Slots | |
| The metric value is high. This can indicate that the significant | |
| fraction of execution pipeline slots could be stalled due to | |
| demand memory load and stores. Use Memory Access analysis to have | |
| the metric breakdown by memory hierarchy, memory bandwidth | |
| information, correlation by memory objects. | |
| | |
L1 Bound: 16.4% of Clockticks | |
DTLB Overhead: 0.1% of Clockticks | |
Load STLB Hit: 0.0% of Clockticks | |
Load STLB Miss: 0.1% of Clockticks | |
Loads Blocked by Store Forwarding: 0.0% of Clockticks | |
Lock Latency: 0.0% of Clockticks | |
Split Loads: 0.0% of Clockticks | |
4K Aliasing: 0.0% of Clockticks | |
FB Full: 0.0% of Clockticks | |
L2 Bound: 0.1% of Clockticks | |
L3 Bound: 0.0% of Clockticks | |
Contested Accesses: 0.0% of Clockticks | |
Data Sharing: 0.0% of Clockticks | |
L3 Latency: 0.0% of Clockticks | |
SQ Full: 0.0% of Clockticks | |
DRAM Bound: 0.0% of Clockticks | |
Memory Bandwidth: 0.0% of Clockticks | |
Memory Latency: 0.0% of Clockticks | |
Store Bound: 0.0% of Clockticks | |
Store Latency: 0.0% of Clockticks | |
False Sharing: 0.0% of Clockticks | |
Split Stores: 0.0% | |
Streaming Stores: 0.0% of Clockticks | |
DTLB Store Overhead: 0.0% of Clockticks | |
Store STLB Hit: 0.0% of Clockticks | |
Store STLB Miss: 0.0% of Clockticks | |
Core Bound: 100.0% of Pipeline Slots | |
| This metric represents how much Core non-memory issues were of a | |
| bottleneck. Shortage in hardware compute resources, or | |
| dependencies software's instructions are both categorized under | |
| Core Bound. Hence it may indicate the machine ran out of an OOO | |
| resources, certain execution units are overloaded or dependencies | |
| in program's data- or instruction- flow are limiting the | |
| performance (e.g. FP-chained long-latency arithmetic operations). | |
| | |
Divider: 0.0% of Clockticks | |
Serializing Operations: 0.2% of Clockticks | |
Slow Pause: 0.0% of Clockticks | |
C01 Wait: 0.0% of Clockticks | |
C02 Wait: 0.0% of Clockticks | |
Memory Fence: 0.0% of Clockticks | |
Port Utilization: 20.1% of Clockticks | |
| Issue: A significant fraction of cycles was stalled due to | |
| Core non-divider-related issues. | |
| | |
| Tips: Use vectorization to reduce pressure on the execution | |
| ports as multiple elements are calculated with same uOp. | |
| | |
Cycles of 0 Ports Utilized: 0.0% of Clockticks | |
Mixing Vectors: 0.0% of Clockticks | |
Cycles of 1 Port Utilized: 20.1% of Clockticks | |
| This metric represents cycles fraction where the CPU | |
| executed total of 1 uop per cycle on all execution ports | |
| (Logical Processor cycles since ICL, Physical Core cycles | |
| otherwise). This can be due to heavy data-dependency | |
| among software instructions, or oversubscribing a | |
| particular hardware resource. In some other cases with | |
| high 1_Port_Utilized and L1 Bound, this metric can point | |
| to L1 data-cache latency bottleneck that may not | |
| necessarily manifest with complete execution starvation | |
| (due to the short L1 latency e.g. walking a linked list) | |
| - looking at the assembly can be helpful. Note that this | |
| metric value may be highlighted due to L1 Bound issue. | |
| | |
Cycles of 2 Ports Utilized: 4.4% of Clockticks | |
Cycles of 3+ Ports Utilized: 45.4% of Clockticks | |
| This metric represents Core cycles fraction CPU executed | |
| total of 3 or more uops per cycle on all execution ports | |
| (Logical Processor cycles since ICL, Physical Core cycles | |
| otherwise). | |
| | |
ALU Operation Utilization: 28.6% of Clockticks | |
Port 0: 32.9% of Clockticks | |
Port 1: 34.0% of Clockticks | |
Port 6: 34.4% of Clockticks | |
Load Operation Utilization: 23.7% of Clockticks | |
Store Operation Utilization: 14.3% of Clockticks | |
Efficient-core (E-core) | |
Retiring: 0.0% of Pipeline Slots | |
General Retirement: 0.0% of Pipeline Slots | |
FP Arithmetic: 0.0% of Pipeline Slots | |
Other: 0.0% of Pipeline Slots | |
Microcode Sequencer: 0.0% of Pipeline Slots | |
Front-End Bound: 0.0% of Pipeline Slots | |
Front-End Latency: 0.0% of Pipeline Slots | |
ICache Misses: 0.0% of Pipeline Slots | |
ITLB Overhead: 0.0% of Pipeline Slots | |
BACLEARS: 0.0% of Pipeline Slots | |
Branch Resteers: 0.0% of Pipeline Slots | |
Front-End Bandwidth: 0.0% of Pipeline Slots | |
Cisc: 0.0% of Pipeline Slots | |
Decode: 0.0% of Pipeline Slots | |
Pre-Decode Wrong: 0.0% of Pipeline Slots | |
Front-End Other: 0.0% of Pipeline Slots | |
Bad Speculation: 100.0% of Pipeline Slots | |
| A significant proportion of pipeline slots containing useful work are | |
| being cancelled. This can be caused by mispredicting branches or by | |
| machine clears. Note that this metric value may be highlighted due to | |
| Branch Resteers issue. | |
| | |
Branch Mispredict: 88.2% of Pipeline Slots | |
| Issue:: A significant proportion of branches are mispredicted, | |
| leading to excessive wasted work or Backend stalls due to the | |
| machine need to recover its state from a speculative path. | |
| | |
| Tips: | |
| | |
| 1. Identify heavily mispredicted branches and consider making | |
| your algorithm more predictable or reducing the number of | |
| branches. You can add more work to 'if' statements and move them | |
| higher in the code flow for earlier execution. If using 'switch' | |
| or 'case' statements, put the most commonly executed cases first. | |
| Avoid using virtual function pointers for heavily executed calls. | |
| | |
| 2. Use profile-guided optimization in the compiler. | |
| | |
| See the Intel 64 and IA-32 Architectures Optimization Reference | |
| Manual for general strategies to address branch misprediction | |
| issues. | |
| | |
Machine Clears: 0.0% of Pipeline Slots | |
Machine Clear: 0.0% of Pipeline Slots | |
SMC Machine Clear: 0.000 | |
MO Machine Clear Overhead: 0.000 | |
FP Assists: 0.000 | |
Disambiguation: 0.000 | |
Page Faults: 0.000 | |
Fast Machine Clears: 0.0% of Pipeline Slots | |
Back-End Bound: 0.0% of Pipeline Slots | |
Core Bound: 0.0% of Clockticks | |
Memory Bound: 0.0% of Clockticks | |
Store Bound: 0.0% of Clockticks | |
L1 Bound: 0.0% of Clockticks | |
Loads Blocked by Store Forwarding: 0.0% of Clockticks | |
Load STLB Hit: 0.0% of Clockticks | |
Load STLB Miss: 0.0% of Clockticks | |
Other L1: 0.0% of Clockticks | |
L2 Bound: 0.0% of Clockticks | |
L3 Bound: 0.0% of Clockticks | |
DRAM Bound: 0.0% of Clockticks | |
Other Load Store: 0.0% of Clockticks | |
Back-End Bound Auxiliary: 0.0% of Pipeline Slots | |
Resource Bound: 0.0% of Pipeline Slots | |
Memory Scheduler: 0.0% of Pipeline Slots | |
ST Buffer: 0.000 | |
LD Buffer: 0.000 | |
RSV: 0.000 | |
Non-memory Scheduler: 0.0% of Pipeline Slots | |
Register: 0.0% of Pipeline Slots | |
Full Re-order Buffer (ROB): 0.0% of Pipeline Slots | |
Allocation Restriction: 0.0% of Pipeline Slots | |
Serializing Operations: 0.0% of Pipeline Slots | |
Average CPU Frequency: 18.963 GHz | |
Total Thread Count: 66 | |
Paused Time: 0s | |
Effective Physical Core Utilization: 1.2% (0.283 out of 24) | |
| The metric value is low, which may signal a poor physical CPU cores | |
| utilization caused by: | |
| - load imbalance | |
| - threading runtime overhead | |
| - contended synchronization | |
| - thread/process underutilization | |
| - incorrect affinity that utilizes logical cores instead of physical | |
| cores | |
| Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism | |
| or run the Locks and Waits analysis to identify parallel bottlenecks for | |
| other parallel runtimes. | |
| | |
Effective Logical Core Utilization: 0.9% (0.284 out of 32) | |
| The metric value is low, which may signal a poor logical CPU cores | |
| utilization. Consider improving physical core utilization as the first | |
| step and then look at opportunities to utilize logical cores, which in | |
| some cases can improve processor throughput and overall performance of | |
| multi-threaded applications. | |
| | |
Collection and Platform Info | |
Application Command Line: ./target/x86_64-unknown-linux-gnu/release/wasmtime "run" "-C" "cache=n" "--target" "pulley64" "--invoke" "run" "--preload" "env=time.wasm" "../wasmi-benchmarks/benches/res/wasm/coremark-minimal.wasm" | |
Operating System: 6.8.0-51-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=24.04 DISTRIB_CODENAME=noble DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS" | |
Computer Name: fhaweignbi | |
Result Size: 48.0 MB | |
Collection start time: 18:59:44 17/01/2025 UTC | |
Collection stop time: 19:00:04 17/01/2025 UTC | |
Collector Type: Driverless Perf system-wide sampling | |
CPU | |
Name: Intel(R) microarchitecture code named Raptorlake-DT | |
Frequency: 3.187 GHz | |
Logical CPU Count: 32 | |
Cache Allocation Technology | |
Level 2 capability: not detected | |
Level 3 capability: not detected | |
If you want to skip descriptions of detected performance issues in the report, | |
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>. | |
Alternatively, you may view the report in the csv format: vtune -report | |
<report_name> -format=csv. | |
vtune: Executing actions 100 % done |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment