Some people have mentioned that memcpy
and memmove
functions are hot
when profiling some WebAssembly benchmarks. Some examples:
I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things. If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.
In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)
I implemented a prototype implementation of a move_memory
instruction in v8 which just calls out
to v8's MemMove
function. I compared
this to an implementation generated by emscripten and currently used in the Unity demo. This implementation aligns then performs copies using i32.load
and i32.store
. I've also included performance achieved by unrolling this loop manually and increasing the size to i64
.
Each test copies size
bytes from one address to another, non-overlapping. This is repeated N
times. Each row copies a total of 1 Gib of data, and only touches 1 Mib of memory in the source and destination ranges.
This is the core loop:
let mask = Mib - 1;
let start = performance.now();
for (let i = 0; i < N; ++i) {
f(dst_base + dst, src_base + src, size);
dst = (dst + size) & mask;
src = (src + size) & mask;
}
let end = performance.now();
Here are the results on my machine (x86_64, 2.9GHz, L1 32k, L2 256k, L3 256k):
intrinsic | i64 load/store x 4 | i64 load/store x 2 | i32 load/store x 2 | i32 load/store | |
---|---|---|---|---|---|
size=32b, N=33554432 | 1.382 Gib/s | 1.565 Gib/s | 1.493 Gib/s | 1.275 Gib/s | 1.166 Gib/s |
size=64b, N=16777216 | 3.285 Gib/s | 2.669 Gib/s | 2.383 Gib/s | 1.861 Gib/s | 1.639 Gib/s |
size=128b, N=8388608 | 6.162 Gib/s | 3.993 Gib/s | 3.480 Gib/s | 2.433 Gib/s | 2.060 Gib/s |
size=256b, N=4194304 | 9.939 Gib/s | 5.323 Gib/s | 4.462 Gib/s | 2.724 Gib/s | 2.213 Gib/s |
size=512b, N=2097152 | 15.777 Gib/s | 6.377 Gib/s | 4.913 Gib/s | 3.231 Gib/s | 2.457 Gib/s |
size=1.0Kib, N=1048576 | 17.902 Gib/s | 7.010 Gib/s | 6.112 Gib/s | 3.568 Gib/s | 2.614 Gib/s |
size=2.0Kib, N=524288 | 19.870 Gib/s | 8.248 Gib/s | 6.915 Gib/s | 3.764 Gib/s | 2.699 Gib/s |
size=4.0Kib, N=262144 | 20.940 Gib/s | 9.145 Gib/s | 7.400 Gib/s | 3.871 Gib/s | 2.729 Gib/s |
size=8.0Kib, N=131072 | 21.162 Gib/s | 9.258 Gib/s | 7.672 Gib/s | 3.925 Gib/s | 2.763 Gib/s |
size=16.0Kib, N=65536 | 20.991 Gib/s | 9.758 Gib/s | 7.756 Gib/s | 3.945 Gib/s | 2.773 Gib/s |
size=32.0Kib, N=32768 | 22.504 Gib/s | 9.956 Gib/s | 7.861 Gib/s | 3.966 Gib/s | 2.780 Gib/s |
size=64.0Kib, N=16384 | 22.534 Gib/s | 10.088 Gib/s | 7.931 Gib/s | 3.974 Gib/s | 2.782 Gib/s |
size=128.0Kib, N=8192 | 29.728 Gib/s | 10.032 Gib/s | 7.934 Gib/s | 3.975 Gib/s | 2.782 Gib/s |
size=256.0Kib, N=4096 | 29.742 Gib/s | 10.116 Gib/s | 7.625 Gib/s | 3.886 Gib/s | 2.781 Gib/s |
size=512.0Kib, N=2048 | 29.994 Gib/s | 10.090 Gib/s | 7.627 Gib/s | 3.985 Gib/s | 2.785 Gib/s |
size=1.0Mib, N=1024 | 11.760 Gib/s | 10.091 Gib/s | 7.959 Gib/s | 3.989 Gib/s | 2.787 Gib/s |
This proposal introduces 2 new instructions:
move_memory
:
Copy data from a source memory region to destination region; these regions may overlap: the copy is performed as if the source region was first copied to a temporary buffer, then the temporary buffer is copied to the destination region.
The instruction has the signature [i32 i32 i32] -> []
. The parameters are, in order:
- top-2: destination address
- top-1: source address
- top-0: size of memory region in bytes
set_memory
: Set all bytes in a memory region to a given byte.
The instruction has the signature [i32 i32 i32] -> []
. The parameters are, in order:
- top-2: destination address
- top-1: byte value to set
- top-0: size of memory region in bytes
Unlike most other memory operations, the bulk operations do not have a memarg
immediate.
instr ::== ...
| move_memory
| set_memory
move_memory
- The memory
C.mems[0]
must be defined in the context. - Then the instruction is valid with type
[i32 i32 i32] -> []
.
set_memory
- The memory
C.mems[0]
must be defined in the context. - Then the instruction is valid with type
[i32 i32 i32] -> []
.
move_memory
- Let
F
be the current frame. - Assert: due to validation,
F.module.memaddrs[0]
exists. - Let
a
be the memory addressF.module.memaddrs[0]
. - Assert: due to validation,
S.mems[a]
exists. - Let
mem
be the memory instanceS.mems[a]
. - Assert: due to validation, a value of value type
i32
is on the top of the stack. - Pop the value
i32.const n
from the stack. - Assert: due to validation, a value of value type
i32
is on the top of the stack. - Pop the value
i32.const s
from the stack. - If
s + n
is larger than the length ofmem.data
, then:- Trap.
- Assert: due to validation, a value of value type
i32
is on the top of the stack. - Pop the value
i32.const d
from the stack. - If
d + n
is larger than the length ofmem.data
, then:- Trap.
- Let
b*
be the byte sequencemem.data[s:n]
. - Replace the bytes
mem.data[d:n]
withb*
.
set_memory
- Let
F
be the current frame. - Assert: due to validation,
F.module.memaddrs[0]
exists. - Let
a
be the memory addressF.module.memaddrs[0]
. - Assert: due to validation,
S.mems[a]
exists. - Let
mem
be the memory instanceS.mems[a]
. - Assert: due to validation, a value of value type
i32
is on the top of the stack. - Pop the value
i32.const n
from the stack. - Assert: due to validation, a value of value type
i32
is on the top of the stack. - Pop the value
i32.const c
from the stack. - Assert: due to validation, a value of value type
i32
is on the top of the stack. - Pop the value
i32.const d
from the stack. - If
d + n
is larger than the length ofmem.data
, then:- Trap.
- Let
c_w
be the result of computingwrap_{32,8}(c)
. - Let
b*
be the byte sequence{bytes_i8(c_w)}^n
. - Replace the bytes
mem.data[d:n]
withb*
.
instr ::= ...
| 0xC5 0x00 => move_memory
| 0xC6 0x00 => set_memory
Note that this skips 0xC0..0xC4
because those are currently proposed to be used for the
new sign-extending operators (see https://github.com/WebAssembly/threads/blob/master/proposals/threads/Overview.md#new-sign-extending-operators).
An immediate byte is included for future extensions. It currently must be zero.
plaininstr_I ::= ...
| `move_memory` => move_memory
| `set_memory` => set_memory
Why do we need operators? Couldn't the embedding just export memmove and meset functions?