alexcrichton · January 17, 2025 19:05
diff --git a/foo.txt b/foo.txt
 [~/code/wasmtime[pulley-less-instruction-loads]] $ /opt/intel/oneapi/vtune/latest/bin64/vtune -collect uarch-exploration ./target/x86_64-unknown-linux-gnu/release/wasmtime run -C cache=n --target pulley64 --invoke run --preload env=time.wasm ../wasmi-benchmarks/benches/res/wasm/coremark-minimal.wasm
 vtune: Warning: To profile kernel modules during the session, make sure they are available in the /lib/modules/kernel_version/ location.
 vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /home/alex/code/wasmtime/r001ue -command stop.
 warning: using `--invoke` with a function that returns values is experimental and may break in the future
 1557.9965
 vtune: Collection stopped.
 vtune: Using result path `/home/alex/code/wasmtime/r001ue'
 vtune: Executing actions 19 % Resolving information for `libc.so.6'
 vtune: Warning: Cannot locate file `vmlinux'.
 vtune: Executing actions 20 % Resolving information for `wasmtime'
 vtune: Warning: Cannot locate debugging information for the Linux kernel. Source-level analysis will not be possible. Function-level analysis will be limited to kernel symbol tables. See the Enabling Linux Kernel Analysis topic in the product online help for instructions.
 vtune: Warning: Cannot locate debugging information for file `/home/alex/code/wasmtime/target/x86_64-unknown-linux-gnu/release/wasmtime'.
 vtune: Executing actions 75 % Generating a report                              Elapsed Time: 19.991s
    Clockticks: 107,513,600,000
        Performance-core (P-core): 107,459,200,000
        Efficient-core (E-core): 54,400,000
    Instructions Retired: 286,480,000,000
        Performance-core (P-core): 286,432,000,000
        Efficient-core (E-core): 48,000,000
    CPI Rate: 0.375
        Performance-core (P-core): 0.375
        Efficient-core (E-core): 1.133
    Performance-core (P-core)
        Retiring: 100.0% of Pipeline Slots
         | A high fraction of pipeline slots was utilized by useful work. While
         | the goal is to make this metric value as big as possible, a high
         | Retiring value for non-vectorized code could prompt you to consider
         | code vectorization. Vectorization enables doing more computations
         | without significantly increasing the number of instructions, thus
         | improving the performance. Note that this metric value may be
         | highlighted due to Microcode Sequencer (MS) issue, so the performance
         | can be improved by avoiding using the MS.
         |
            Light Operations: 100.0% of Pipeline Slots
             | CPU retired light-weight operations (ones which require no more
             | than one uop) in a significant fraction of cycles. This
             | correlates with total number of instructions used by the program.
             | Optimum value of uops-per-instruction ratio is 1. While this is
             | the most desirable metric, high values can also provide
             | opportunities for performance optimizations.
             |
                FP Arithmetic: 0.0% of uOps
                    FP x87: 0.0% of uOps
                    FP Scalar: 0.0% of uOps
                    FP Vector: 0.0% of uOps
                        128-bit FP Vector: 0.0% of uOps
                        256-bit FP Vector: 0.0% of uOps
                Integer Operations: 0.0% of uOps
                    128-bit Integer Vector Operations: 0.0% of uOps
                    256-bit Vector Operations: 0.0% of uOps
                Memory Operations: 93.2% of Pipeline Slots
                 | For a significant fraction of pipeline slots the CPU was
                 | retiring memory operations - uops for memory load or store
                 | accesses.
                 |
                Fused Instructions: 0.7% of Pipeline Slots
                Non Fused Branches: 15.0% of Pipeline Slots
                 | For a significant fraction of slots the CPU was retiring
                 | branch instructions that were not fused. Non-conditional
                 | branches like direct JMP or CALL would count here. Can be
                 | used to detect fusable conditional jumps.
                 |
                Other: 100.0% of Pipeline Slots
                 | This metric represents a non-floating-point (FP) uop fraction
                 | the CPU has executed. If your application has no FP
                 | operations, this is likely to be the biggest fraction.
                 |
                    Nop Instructions: 0.0% of Pipeline Slots
                    Shuffles_256b: 0.0% of Pipeline Slots
            Heavy Operations: 20.9% of Pipeline Slots
             | CPU retired heavy-weight operations (instructions that required
             | 2+ uops) in a significant fraction of cycles.
             |
                Few Uops Instructions: 20.9% of Pipeline Slots
                 | This metric represents fraction of slots where the CPU was
                 | retiring instructions that that are decoder into two or up to
                 | ([SNB+] four; [ADL+] five) uops. This highly-correlates with
                 | the number of uops in such instructions.
                 |
                Microcode Sequencer: 0.0% of Pipeline Slots
                    Assists: 0.0% of Pipeline Slots
                        Page Faults: 0.0% of Pipeline Slots
                        FP Assists: 0.0% of Pipeline Slots
                        AVX Assists: 0.0% of Pipeline Slots
                    CISC: 0.0% of Pipeline Slots
        Front-End Bound: 25.1% of Pipeline Slots
         | Issue: A significant portion of Pipeline Slots is remaining empty due
         | to issues in the Front-End.
         |
         | Tips:  Make sure the code working size is not too large, the code
         | layout does not require too many memory accesses per cycle to get
         | enough instructions for filling four pipeline slots, or check for
         | microcode assists.
         |
            Front-End Latency: 4.8% of Pipeline Slots
                ICache Misses: 0.1% of Clockticks
                ITLB Overhead: 0.0% of Clockticks
                Branch Resteers: 0.5% of Clockticks
                    Mispredicts Resteers
                    Clears Resteers
                    Unknown Branches: 0.0% of Clockticks
                MS Switches: 0.0% of Clockticks
                Length Changing Prefixes: 0.0% of Clockticks
                DSB Switches: 1.2% of Clockticks
            Front-End Bandwidth: 20.3% of Pipeline Slots
             | This metric represents a fraction of slots during which CPU was
             | stalled due to front-end bandwidth issues, such as inefficiencies
             | in the instruction decoders or code restrictions for caching in
             | the DSB (decoded uOps cache). In such cases, the front-end
             | typically delivers a non-optimal amount of uOps to the back-end.
             |
                Front-End Bandwidth MITE: 0.1% of Pipeline Slots
                    Decoder-0 Alone: 0.0% of Pipeline Slots
                Front-End Bandwidth DSB: 5.4% of Pipeline Slots
                Front-End Bandwidth LSD: 0.4% of Pipeline Slots
                (Info) DSB Coverage: 88.5%
                (Info) LSD Coverage: 7.1%
                (Info) DSB Misses: 100.0% of Pipeline Slots
                 | %DSB_MissesIssueTextAll
                 |
        Bad Speculation: 0.0% of Pipeline Slots
            Branch Mispredict: 37.6% of Pipeline Slots
            Machine Clears: 0.0% of Pipeline Slots
        Back-End Bound: 100.0% of Pipeline Slots
         | A significant portion of pipeline slots are remaining empty. When
         | operations take too long in the back-end, they introduce bubbles in
         | the pipeline that ultimately cause fewer pipeline slots containing
         | useful work to be retired per cycle than the machine is capable to
         | support. This opportunity cost results in slower execution. Long-
         | latency operations like divides and memory operations can cause this,
         | as can too many operations being directed to a single execution port
         | (for example, more multiply operations arriving in the back-end per
         | cycle than the execution unit can support).
         |
            Memory Bound: 73.0% of Pipeline Slots
             | The metric value is high. This can indicate that the significant
             | fraction of execution pipeline slots could be stalled due to
             | demand memory load and stores. Use Memory Access analysis to have
             | the metric breakdown by memory hierarchy, memory bandwidth
             | information, correlation by memory objects.
             |
                L1 Bound: 16.4% of Clockticks
                    DTLB Overhead: 0.1% of Clockticks
                        Load STLB Hit: 0.0% of Clockticks
                        Load STLB Miss: 0.1% of Clockticks
                    Loads Blocked by Store Forwarding: 0.0% of Clockticks
                    Lock Latency: 0.0% of Clockticks
                    Split Loads: 0.0% of Clockticks
                    4K Aliasing: 0.0% of Clockticks
                    FB Full: 0.0% of Clockticks
                L2 Bound: 0.1% of Clockticks
                L3 Bound: 0.0% of Clockticks
                    Contested Accesses: 0.0% of Clockticks
                    Data Sharing: 0.0% of Clockticks
                    L3 Latency: 0.0% of Clockticks
                    SQ Full: 0.0% of Clockticks
                DRAM Bound: 0.0% of Clockticks
                    Memory Bandwidth: 0.0% of Clockticks
                    Memory Latency: 0.0% of Clockticks
                Store Bound: 0.0% of Clockticks
                    Store Latency: 0.0% of Clockticks
                    False Sharing: 0.0% of Clockticks
                    Split Stores: 0.0%
                    Streaming Stores: 0.0% of Clockticks
                    DTLB Store Overhead: 0.0% of Clockticks
                        Store STLB Hit: 0.0% of Clockticks
                        Store STLB Miss: 0.0% of Clockticks
            Core Bound: 100.0% of Pipeline Slots
             | This metric represents how much Core non-memory issues were of a
             | bottleneck. Shortage in hardware compute resources, or
             | dependencies software's instructions are both categorized under
             | Core Bound. Hence it may indicate the machine ran out of an OOO
             | resources, certain execution units are overloaded or dependencies
             | in program's data- or instruction- flow are limiting the
             | performance (e.g. FP-chained long-latency arithmetic operations).
             |
                Divider: 0.0% of Clockticks
                Serializing Operations: 0.2% of Clockticks
                    Slow Pause: 0.0% of Clockticks
                    C01 Wait: 0.0% of Clockticks
                    C02 Wait: 0.0% of Clockticks
                    Memory Fence: 0.0% of Clockticks
                Port Utilization: 20.1% of Clockticks
                 | Issue: A significant fraction of cycles was stalled due to
                 | Core non-divider-related issues.
                 |
                 | Tips: Use vectorization to reduce pressure on the execution
                 | ports as multiple elements are calculated with same uOp.
                 |
                    Cycles of 0 Ports Utilized: 0.0% of Clockticks
                        Mixing Vectors: 0.0% of Clockticks
                    Cycles of 1 Port Utilized: 20.1% of Clockticks
                     | This metric represents cycles fraction where the CPU
                     | executed total of 1 uop per cycle on all execution ports
                     | (Logical Processor cycles since ICL, Physical Core cycles
                     | otherwise). This can be due to heavy data-dependency
                     | among software instructions, or oversubscribing a
                     | particular hardware resource. In some other cases with
                     | high 1_Port_Utilized and L1 Bound, this metric can point
                     | to L1 data-cache latency bottleneck that may not
                     | necessarily manifest with complete execution starvation
                     | (due to the short L1 latency e.g. walking a linked list)
                     | - looking at the assembly can be helpful. Note that this
                     | metric value may be highlighted due to L1 Bound issue.
                     |
                    Cycles of 2 Ports Utilized: 4.4% of Clockticks
                    Cycles of 3+ Ports Utilized: 45.4% of Clockticks
                     | This metric represents Core cycles fraction CPU executed
                     | total of 3 or more uops per cycle on all execution ports
                     | (Logical Processor cycles since ICL, Physical Core cycles
                     | otherwise).
                     |
                        ALU Operation Utilization: 28.6% of Clockticks
                            Port 0: 32.9% of Clockticks
                            Port 1: 34.0% of Clockticks
                            Port 6: 34.4% of Clockticks
                        Load Operation Utilization: 23.7% of Clockticks
                        Store Operation Utilization: 14.3% of Clockticks
    Efficient-core (E-core)
        Retiring: 0.0% of Pipeline Slots
            General Retirement: 0.0% of Pipeline Slots
                FP Arithmetic: 0.0% of Pipeline Slots
                Other: 0.0% of Pipeline Slots
            Microcode Sequencer: 0.0% of Pipeline Slots
        Front-End Bound: 0.0% of Pipeline Slots
            Front-End Latency: 0.0% of Pipeline Slots
                ICache Misses: 0.0% of Pipeline Slots
                ITLB Overhead: 0.0% of Pipeline Slots
                BACLEARS: 0.0% of Pipeline Slots
                Branch Resteers: 0.0% of Pipeline Slots
            Front-End Bandwidth: 0.0% of Pipeline Slots
                Cisc: 0.0% of Pipeline Slots
                Decode: 0.0% of Pipeline Slots
                Pre-Decode Wrong: 0.0% of Pipeline Slots
                Front-End Other: 0.0% of Pipeline Slots
        Bad Speculation: 100.0% of Pipeline Slots
         | A significant proportion of pipeline slots containing useful work are
         | being cancelled. This can be caused by mispredicting branches or by
         | machine clears. Note that this metric value may be highlighted due to
         | Branch Resteers issue.
         |
            Branch Mispredict: 88.2% of Pipeline Slots
             | Issue:: A significant proportion of branches are mispredicted,
             | leading to excessive wasted work or Backend stalls due to the
             | machine need to recover its state from a speculative path.
             |
             | Tips:
             |
             | 1. Identify heavily mispredicted branches and consider making
             | your algorithm more predictable or reducing the number of
             | branches. You can add more work to 'if' statements and move them
             | higher in the code flow for earlier execution. If using 'switch'
             | or 'case' statements, put the most commonly executed cases first.
             | Avoid using virtual function pointers for heavily executed calls.
             |
             | 2. Use profile-guided optimization in the compiler.
             |
             | See the Intel 64 and IA-32 Architectures Optimization Reference
             | Manual for general strategies to address branch misprediction
             | issues.
             |
            Machine Clears: 0.0% of Pipeline Slots
                Machine Clear: 0.0% of Pipeline Slots
                    SMC Machine Clear: 0.000
                    MO Machine Clear Overhead: 0.000
                    FP Assists: 0.000
                    Disambiguation: 0.000
                    Page Faults: 0.000
                Fast Machine Clears: 0.0% of Pipeline Slots
        Back-End Bound: 0.0% of Pipeline Slots
            Core Bound: 0.0% of Clockticks
            Memory Bound: 0.0% of Clockticks
                Store Bound: 0.0% of Clockticks
                L1 Bound: 0.0% of Clockticks
                    Loads Blocked by Store Forwarding: 0.0% of Clockticks
                    Load STLB Hit: 0.0% of Clockticks
                    Load STLB Miss: 0.0% of Clockticks
                    Other L1: 0.0% of Clockticks
                L2 Bound: 0.0% of Clockticks
                L3 Bound: 0.0% of Clockticks
                DRAM Bound: 0.0% of Clockticks
                Other Load Store: 0.0% of Clockticks
        Back-End Bound Auxiliary: 0.0% of Pipeline Slots
            Resource Bound: 0.0% of Pipeline Slots
                Memory Scheduler: 0.0% of Pipeline Slots
                    ST Buffer: 0.000
                    LD Buffer: 0.000
                    RSV: 0.000
                Non-memory Scheduler: 0.0% of Pipeline Slots
                Register: 0.0% of Pipeline Slots
                Full Re-order Buffer (ROB): 0.0% of Pipeline Slots
                Allocation Restriction: 0.0% of Pipeline Slots
                Serializing Operations: 0.0% of Pipeline Slots
    Average CPU Frequency: 18.963 GHz
    Total Thread Count: 66
    Paused Time: 0s
 Effective Physical Core Utilization: 1.2% (0.283 out of 24)
 | The metric value is low, which may signal a poor physical CPU cores
 | utilization caused by:
 |     - load imbalance
 |     - threading runtime overhead
 |     - contended synchronization
 |     - thread/process underutilization
 |     - incorrect affinity that utilizes logical cores instead of physical
 |       cores
 | Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
 | or run the Locks and Waits analysis to identify parallel bottlenecks for
 | other parallel runtimes.
 |
    Effective Logical Core Utilization: 0.9% (0.284 out of 32)
     | The metric value is low, which may signal a poor logical CPU cores
     | utilization. Consider improving physical core utilization as the first
     | step and then look at opportunities to utilize logical cores, which in
     | some cases can improve processor throughput and overall performance of
     | multi-threaded applications.
     |
 Collection and Platform Info
    Application Command Line: ./target/x86_64-unknown-linux-gnu/release/wasmtime "run" "-C" "cache=n" "--target" "pulley64" "--invoke" "run" "--preload" "env=time.wasm" "../wasmi-benchmarks/benches/res/wasm/coremark-minimal.wasm"
    Operating System: 6.8.0-51-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=24.04 DISTRIB_CODENAME=noble DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"
    Computer Name: fhaweignbi
    Result Size: 48.0 MB
    Collection start time: 18:59:44 17/01/2025 UTC
    Collection stop time: 19:00:04 17/01/2025 UTC
    Collector Type: Driverless Perf system-wide sampling
    CPU
        Name: Intel(R) microarchitecture code named Raptorlake-DT
        Frequency: 3.187 GHz
        Logical CPU Count: 32
        Cache Allocation Technology
            Level 2 capability: not detected
            Level 3 capability: not detected

 If you want to skip descriptions of detected performance issues in the report,
 enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
 Alternatively, you may view the report in the csv format: vtune -report
 <report_name> -format=csv.
 vtune: Executing actions 100 % done
	[~/code/wasmtime[pulley-less-instruction-loads]] $ /opt/intel/oneapi/vtune/latest/bin64/vtune -collect uarch-exploration ./target/x86_64-unknown-linux-gnu/release/wasmtime run -C cache=n --target pulley64 --invoke run --preload env=time.wasm ../wasmi-benchmarks/benches/res/wasm/coremark-minimal.wasm
	vtune: Warning: To profile kernel modules during the session, make sure they are available in the /lib/modules/kernel_version/ location.
	vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /home/alex/code/wasmtime/r001ue -command stop.
	warning: using `--invoke` with a function that returns values is experimental and may break in the future
	1557.9965
	vtune: Collection stopped.
	vtune: Using result path `/home/alex/code/wasmtime/r001ue'
	vtune: Executing actions 19 % Resolving information for `libc.so.6'
	vtune: Warning: Cannot locate file `vmlinux'.
	vtune: Executing actions 20 % Resolving information for `wasmtime'
	vtune: Warning: Cannot locate debugging information for the Linux kernel. Source-level analysis will not be possible. Function-level analysis will be limited to kernel symbol tables. See the Enabling Linux Kernel Analysis topic in the product online help for instructions.
	vtune: Warning: Cannot locate debugging information for file `/home/alex/code/wasmtime/target/x86_64-unknown-linux-gnu/release/wasmtime'.
	vtune: Executing actions 75 % Generating a report Elapsed Time: 19.991s
	Clockticks: 107,513,600,000
	Performance-core (P-core): 107,459,200,000
	Efficient-core (E-core): 54,400,000
	Instructions Retired: 286,480,000,000
	Performance-core (P-core): 286,432,000,000
	Efficient-core (E-core): 48,000,000
	CPI Rate: 0.375
	Performance-core (P-core): 0.375
	Efficient-core (E-core): 1.133
	Performance-core (P-core)
	Retiring: 100.0% of Pipeline Slots
	\| A high fraction of pipeline slots was utilized by useful work. While
	\| the goal is to make this metric value as big as possible, a high
	\| Retiring value for non-vectorized code could prompt you to consider
	\| code vectorization. Vectorization enables doing more computations
	\| without significantly increasing the number of instructions, thus
	\| improving the performance. Note that this metric value may be
	\| highlighted due to Microcode Sequencer (MS) issue, so the performance
	\| can be improved by avoiding using the MS.
	\|
	Light Operations: 100.0% of Pipeline Slots
	\| CPU retired light-weight operations (ones which require no more
	\| than one uop) in a significant fraction of cycles. This
	\| correlates with total number of instructions used by the program.
	\| Optimum value of uops-per-instruction ratio is 1. While this is
	\| the most desirable metric, high values can also provide
	\| opportunities for performance optimizations.
	\|
	FP Arithmetic: 0.0% of uOps
	FP x87: 0.0% of uOps
	FP Scalar: 0.0% of uOps
	FP Vector: 0.0% of uOps
	128-bit FP Vector: 0.0% of uOps
	256-bit FP Vector: 0.0% of uOps
	Integer Operations: 0.0% of uOps
	128-bit Integer Vector Operations: 0.0% of uOps
	256-bit Vector Operations: 0.0% of uOps
	Memory Operations: 93.2% of Pipeline Slots
	\| For a significant fraction of pipeline slots the CPU was
	\| retiring memory operations - uops for memory load or store
	\| accesses.
	\|
	Fused Instructions: 0.7% of Pipeline Slots
	Non Fused Branches: 15.0% of Pipeline Slots
	\| For a significant fraction of slots the CPU was retiring
	\| branch instructions that were not fused. Non-conditional
	\| branches like direct JMP or CALL would count here. Can be
	\| used to detect fusable conditional jumps.
	\|
	Other: 100.0% of Pipeline Slots
	\| This metric represents a non-floating-point (FP) uop fraction
	\| the CPU has executed. If your application has no FP
	\| operations, this is likely to be the biggest fraction.
	\|
	Nop Instructions: 0.0% of Pipeline Slots
	Shuffles_256b: 0.0% of Pipeline Slots
	Heavy Operations: 20.9% of Pipeline Slots
	\| CPU retired heavy-weight operations (instructions that required
	\| 2+ uops) in a significant fraction of cycles.
	\|
	Few Uops Instructions: 20.9% of Pipeline Slots
	\| This metric represents fraction of slots where the CPU was
	\| retiring instructions that that are decoder into two or up to
	\| ([SNB+] four; [ADL+] five) uops. This highly-correlates with
	\| the number of uops in such instructions.
	\|
	Microcode Sequencer: 0.0% of Pipeline Slots
	Assists: 0.0% of Pipeline Slots
	Page Faults: 0.0% of Pipeline Slots
	FP Assists: 0.0% of Pipeline Slots
	AVX Assists: 0.0% of Pipeline Slots
	CISC: 0.0% of Pipeline Slots
	Front-End Bound: 25.1% of Pipeline Slots
	\| Issue: A significant portion of Pipeline Slots is remaining empty due
	\| to issues in the Front-End.
	\|
	\| Tips: Make sure the code working size is not too large, the code
	\| layout does not require too many memory accesses per cycle to get
	\| enough instructions for filling four pipeline slots, or check for
	\| microcode assists.
	\|
	Front-End Latency: 4.8% of Pipeline Slots
	ICache Misses: 0.1% of Clockticks
	ITLB Overhead: 0.0% of Clockticks
	Branch Resteers: 0.5% of Clockticks
	Mispredicts Resteers
	Clears Resteers
	Unknown Branches: 0.0% of Clockticks
	MS Switches: 0.0% of Clockticks
	Length Changing Prefixes: 0.0% of Clockticks
	DSB Switches: 1.2% of Clockticks
	Front-End Bandwidth: 20.3% of Pipeline Slots
	\| This metric represents a fraction of slots during which CPU was
	\| stalled due to front-end bandwidth issues, such as inefficiencies
	\| in the instruction decoders or code restrictions for caching in
	\| the DSB (decoded uOps cache). In such cases, the front-end
	\| typically delivers a non-optimal amount of uOps to the back-end.
	\|
	Front-End Bandwidth MITE: 0.1% of Pipeline Slots
	Decoder-0 Alone: 0.0% of Pipeline Slots
	Front-End Bandwidth DSB: 5.4% of Pipeline Slots
	Front-End Bandwidth LSD: 0.4% of Pipeline Slots
	(Info) DSB Coverage: 88.5%
	(Info) LSD Coverage: 7.1%
	(Info) DSB Misses: 100.0% of Pipeline Slots
	\| %DSB_MissesIssueTextAll
	\|
	Bad Speculation: 0.0% of Pipeline Slots
	Branch Mispredict: 37.6% of Pipeline Slots
	Machine Clears: 0.0% of Pipeline Slots
	Back-End Bound: 100.0% of Pipeline Slots
	\| A significant portion of pipeline slots are remaining empty. When
	\| operations take too long in the back-end, they introduce bubbles in
	\| the pipeline that ultimately cause fewer pipeline slots containing
	\| useful work to be retired per cycle than the machine is capable to
	\| support. This opportunity cost results in slower execution. Long-
	\| latency operations like divides and memory operations can cause this,
	\| as can too many operations being directed to a single execution port
	\| (for example, more multiply operations arriving in the back-end per
	\| cycle than the execution unit can support).
	\|
	Memory Bound: 73.0% of Pipeline Slots
	\| The metric value is high. This can indicate that the significant
	\| fraction of execution pipeline slots could be stalled due to
	\| demand memory load and stores. Use Memory Access analysis to have
	\| the metric breakdown by memory hierarchy, memory bandwidth
	\| information, correlation by memory objects.
	\|
	L1 Bound: 16.4% of Clockticks
	DTLB Overhead: 0.1% of Clockticks
	Load STLB Hit: 0.0% of Clockticks
	Load STLB Miss: 0.1% of Clockticks
	Loads Blocked by Store Forwarding: 0.0% of Clockticks
	Lock Latency: 0.0% of Clockticks
	Split Loads: 0.0% of Clockticks
	4K Aliasing: 0.0% of Clockticks
	FB Full: 0.0% of Clockticks
	L2 Bound: 0.1% of Clockticks
	L3 Bound: 0.0% of Clockticks
	Contested Accesses: 0.0% of Clockticks
	Data Sharing: 0.0% of Clockticks
	L3 Latency: 0.0% of Clockticks
	SQ Full: 0.0% of Clockticks
	DRAM Bound: 0.0% of Clockticks
	Memory Bandwidth: 0.0% of Clockticks
	Memory Latency: 0.0% of Clockticks
	Store Bound: 0.0% of Clockticks
	Store Latency: 0.0% of Clockticks
	False Sharing: 0.0% of Clockticks
	Split Stores: 0.0%
	Streaming Stores: 0.0% of Clockticks
	DTLB Store Overhead: 0.0% of Clockticks
	Store STLB Hit: 0.0% of Clockticks
	Store STLB Miss: 0.0% of Clockticks
	Core Bound: 100.0% of Pipeline Slots
	\| This metric represents how much Core non-memory issues were of a
	\| bottleneck. Shortage in hardware compute resources, or
	\| dependencies software's instructions are both categorized under
	\| Core Bound. Hence it may indicate the machine ran out of an OOO
	\| resources, certain execution units are overloaded or dependencies
	\| in program's data- or instruction- flow are limiting the
	\| performance (e.g. FP-chained long-latency arithmetic operations).
	\|
	Divider: 0.0% of Clockticks
	Serializing Operations: 0.2% of Clockticks
	Slow Pause: 0.0% of Clockticks
	C01 Wait: 0.0% of Clockticks
	C02 Wait: 0.0% of Clockticks
	Memory Fence: 0.0% of Clockticks
	Port Utilization: 20.1% of Clockticks
	\| Issue: A significant fraction of cycles was stalled due to
	\| Core non-divider-related issues.
	\|
	\| Tips: Use vectorization to reduce pressure on the execution
	\| ports as multiple elements are calculated with same uOp.
	\|
	Cycles of 0 Ports Utilized: 0.0% of Clockticks
	Mixing Vectors: 0.0% of Clockticks
	Cycles of 1 Port Utilized: 20.1% of Clockticks
	\| This metric represents cycles fraction where the CPU
	\| executed total of 1 uop per cycle on all execution ports
	\| (Logical Processor cycles since ICL, Physical Core cycles
	\| otherwise). This can be due to heavy data-dependency
	\| among software instructions, or oversubscribing a
	\| particular hardware resource. In some other cases with
	\| high 1_Port_Utilized and L1 Bound, this metric can point
	\| to L1 data-cache latency bottleneck that may not
	\| necessarily manifest with complete execution starvation
	\| (due to the short L1 latency e.g. walking a linked list)
	\| - looking at the assembly can be helpful. Note that this
	\| metric value may be highlighted due to L1 Bound issue.
	\|
	Cycles of 2 Ports Utilized: 4.4% of Clockticks
	Cycles of 3+ Ports Utilized: 45.4% of Clockticks
	\| This metric represents Core cycles fraction CPU executed
	\| total of 3 or more uops per cycle on all execution ports
	\| (Logical Processor cycles since ICL, Physical Core cycles
	\| otherwise).
	\|
	ALU Operation Utilization: 28.6% of Clockticks
	Port 0: 32.9% of Clockticks
	Port 1: 34.0% of Clockticks
	Port 6: 34.4% of Clockticks
	Load Operation Utilization: 23.7% of Clockticks
	Store Operation Utilization: 14.3% of Clockticks
	Efficient-core (E-core)
	Retiring: 0.0% of Pipeline Slots
	General Retirement: 0.0% of Pipeline Slots
	FP Arithmetic: 0.0% of Pipeline Slots
	Other: 0.0% of Pipeline Slots
	Microcode Sequencer: 0.0% of Pipeline Slots
	Front-End Bound: 0.0% of Pipeline Slots
	Front-End Latency: 0.0% of Pipeline Slots
	ICache Misses: 0.0% of Pipeline Slots
	ITLB Overhead: 0.0% of Pipeline Slots
	BACLEARS: 0.0% of Pipeline Slots
	Branch Resteers: 0.0% of Pipeline Slots
	Front-End Bandwidth: 0.0% of Pipeline Slots
	Cisc: 0.0% of Pipeline Slots
	Decode: 0.0% of Pipeline Slots
	Pre-Decode Wrong: 0.0% of Pipeline Slots
	Front-End Other: 0.0% of Pipeline Slots
	Bad Speculation: 100.0% of Pipeline Slots
	\| A significant proportion of pipeline slots containing useful work are
	\| being cancelled. This can be caused by mispredicting branches or by
	\| machine clears. Note that this metric value may be highlighted due to
	\| Branch Resteers issue.
	\|
	Branch Mispredict: 88.2% of Pipeline Slots
	\| Issue:: A significant proportion of branches are mispredicted,
	\| leading to excessive wasted work or Backend stalls due to the
	\| machine need to recover its state from a speculative path.
	\|
	\| Tips:
	\|
	\| 1. Identify heavily mispredicted branches and consider making
	\| your algorithm more predictable or reducing the number of
	\| branches. You can add more work to 'if' statements and move them
	\| higher in the code flow for earlier execution. If using 'switch'
	\| or 'case' statements, put the most commonly executed cases first.
	\| Avoid using virtual function pointers for heavily executed calls.
	\|
	\| 2. Use profile-guided optimization in the compiler.
	\|
	\| See the Intel 64 and IA-32 Architectures Optimization Reference
	\| Manual for general strategies to address branch misprediction
	\| issues.
	\|
	Machine Clears: 0.0% of Pipeline Slots
	Machine Clear: 0.0% of Pipeline Slots
	SMC Machine Clear: 0.000
	MO Machine Clear Overhead: 0.000
	FP Assists: 0.000
	Disambiguation: 0.000
	Page Faults: 0.000
	Fast Machine Clears: 0.0% of Pipeline Slots
	Back-End Bound: 0.0% of Pipeline Slots
	Core Bound: 0.0% of Clockticks
	Memory Bound: 0.0% of Clockticks
	Store Bound: 0.0% of Clockticks
	L1 Bound: 0.0% of Clockticks
	Loads Blocked by Store Forwarding: 0.0% of Clockticks
	Load STLB Hit: 0.0% of Clockticks
	Load STLB Miss: 0.0% of Clockticks
	Other L1: 0.0% of Clockticks
	L2 Bound: 0.0% of Clockticks
	L3 Bound: 0.0% of Clockticks
	DRAM Bound: 0.0% of Clockticks
	Other Load Store: 0.0% of Clockticks
	Back-End Bound Auxiliary: 0.0% of Pipeline Slots
	Resource Bound: 0.0% of Pipeline Slots
	Memory Scheduler: 0.0% of Pipeline Slots
	ST Buffer: 0.000
	LD Buffer: 0.000
	RSV: 0.000
	Non-memory Scheduler: 0.0% of Pipeline Slots
	Register: 0.0% of Pipeline Slots
	Full Re-order Buffer (ROB): 0.0% of Pipeline Slots
	Allocation Restriction: 0.0% of Pipeline Slots
	Serializing Operations: 0.0% of Pipeline Slots
	Average CPU Frequency: 18.963 GHz
	Total Thread Count: 66
	Paused Time: 0s
	Effective Physical Core Utilization: 1.2% (0.283 out of 24)
	\| The metric value is low, which may signal a poor physical CPU cores
	\| utilization caused by:
	\| - load imbalance
	\| - threading runtime overhead
	\| - contended synchronization
	\| - thread/process underutilization
	\| - incorrect affinity that utilizes logical cores instead of physical
	\| cores
	\| Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
	\| or run the Locks and Waits analysis to identify parallel bottlenecks for
	\| other parallel runtimes.
	\|
	Effective Logical Core Utilization: 0.9% (0.284 out of 32)
	\| The metric value is low, which may signal a poor logical CPU cores
	\| utilization. Consider improving physical core utilization as the first
	\| step and then look at opportunities to utilize logical cores, which in
	\| some cases can improve processor throughput and overall performance of
	\| multi-threaded applications.
	\|
	Collection and Platform Info
	Application Command Line: ./target/x86_64-unknown-linux-gnu/release/wasmtime "run" "-C" "cache=n" "--target" "pulley64" "--invoke" "run" "--preload" "env=time.wasm" "../wasmi-benchmarks/benches/res/wasm/coremark-minimal.wasm"
	Operating System: 6.8.0-51-generic DISTRIB_ID=Ubuntu DISTRIB_RELEASE=24.04 DISTRIB_CODENAME=noble DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"
	Computer Name: fhaweignbi
	Result Size: 48.0 MB
	Collection start time: 18:59:44 17/01/2025 UTC
	Collection stop time: 19:00:04 17/01/2025 UTC
	Collector Type: Driverless Perf system-wide sampling
	CPU
	Name: Intel(R) microarchitecture code named Raptorlake-DT
	Frequency: 3.187 GHz
	Logical CPU Count: 32
	Cache Allocation Technology
	Level 2 capability: not detected
	Level 3 capability: not detected

	If you want to skip descriptions of detected performance issues in the report,
	enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
	Alternatively, you may view the report in the csv format: vtune -report
	<report_name> -format=csv.
	vtune: Executing actions 100 % done