Some people have mentioned that memcpy and memmove functions are hot
when profiling some WebAssembly benchmarks. Some examples:
I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things. If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.
In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)
I implemented a prototype implementation of a move_memory instruction in v8 which just calls out
to v8's MemMove function. I compared
this to an implementation generated by emscripten and currently used in the Unity demo. This implementation aligns then performs copies using i32.load and i32.store. I've also included performance achieved by unrolling this loop manually and increasing the size to i64.
Each test copies size bytes from one address to another, non-overlapping. This is repeated N times. Each row copies a total of 1 Gib of data, and only touches 1 Mib of memory in the source and destination ranges.
This is the core loop:
let mask = Mib - 1;
let start = performance.now();
for (let i = 0; i < N; ++i) {
f(dst_base + dst, src_base + src, size);
dst = (dst + size) & mask;
src = (src + size) & mask;
}
let end = performance.now();
Here are the results on my machine (x86_64, 2.9GHz, L1 32k, L2 256k, L3 256k):
| intrinsic | i64 load/store x 4 | i64 load/store x 2 | i32 load/store x 2 | i32 load/store | |
|---|---|---|---|---|---|
| size=32b, N=33554432 | 1.382 Gib/s | 1.565 Gib/s | 1.493 Gib/s | 1.275 Gib/s | 1.166 Gib/s |
| size=64b, N=16777216 | 3.285 Gib/s | 2.669 Gib/s | 2.383 Gib/s | 1.861 Gib/s | 1.639 Gib/s |
| size=128b, N=8388608 | 6.162 Gib/s | 3.993 Gib/s | 3.480 Gib/s | 2.433 Gib/s | 2.060 Gib/s |
| size=256b, N=4194304 | 9.939 Gib/s | 5.323 Gib/s | 4.462 Gib/s | 2.724 Gib/s | 2.213 Gib/s |
| size=512b, N=2097152 | 15.777 Gib/s | 6.377 Gib/s | 4.913 Gib/s | 3.231 Gib/s | 2.457 Gib/s |
| size=1.0Kib, N=1048576 | 17.902 Gib/s | 7.010 Gib/s | 6.112 Gib/s | 3.568 Gib/s | 2.614 Gib/s |
| size=2.0Kib, N=524288 | 19.870 Gib/s | 8.248 Gib/s | 6.915 Gib/s | 3.764 Gib/s | 2.699 Gib/s |
| size=4.0Kib, N=262144 | 20.940 Gib/s | 9.145 Gib/s | 7.400 Gib/s | 3.871 Gib/s | 2.729 Gib/s |
| size=8.0Kib, N=131072 | 21.162 Gib/s | 9.258 Gib/s | 7.672 Gib/s | 3.925 Gib/s | 2.763 Gib/s |
| size=16.0Kib, N=65536 | 20.991 Gib/s | 9.758 Gib/s | 7.756 Gib/s | 3.945 Gib/s | 2.773 Gib/s |
| size=32.0Kib, N=32768 | 22.504 Gib/s | 9.956 Gib/s | 7.861 Gib/s | 3.966 Gib/s | 2.780 Gib/s |
| size=64.0Kib, N=16384 | 22.534 Gib/s | 10.088 Gib/s | 7.931 Gib/s | 3.974 Gib/s | 2.782 Gib/s |
| size=128.0Kib, N=8192 | 29.728 Gib/s | 10.032 Gib/s | 7.934 Gib/s | 3.975 Gib/s | 2.782 Gib/s |
| size=256.0Kib, N=4096 | 29.742 Gib/s | 10.116 Gib/s | 7.625 Gib/s | 3.886 Gib/s | 2.781 Gib/s |
| size=512.0Kib, N=2048 | 29.994 Gib/s | 10.090 Gib/s | 7.627 Gib/s | 3.985 Gib/s | 2.785 Gib/s |
| size=1.0Mib, N=1024 | 11.760 Gib/s | 10.091 Gib/s | 7.959 Gib/s | 3.989 Gib/s | 2.787 Gib/s |
This proposal introduces 2 new instructions:
move_memory:
Copy data from a source memory region to destination region; these regions may overlap: the copy is performed as if the source region was first copied to a temporary buffer, then the temporary buffer is copied to the destination region.
The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:
- top-2: destination address
- top-1: source address
- top-0: size of memory region in bytes
set_memory: Set all bytes in a memory region to a given byte.
The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:
- top-2: destination address
- top-1: byte value to set
- top-0: size of memory region in bytes
Unlike most other memory operations, the bulk operations do not have a memarg
immediate.
instr ::== ...
| move_memory
| set_memory
move_memory
- The memory
C.mems[0]must be defined in the context. - Then the instruction is valid with type
[i32 i32 i32] -> [].
set_memory
- The memory
C.mems[0]must be defined in the context. - Then the instruction is valid with type
[i32 i32 i32] -> [].
move_memory
- Let
Fbe the current frame. - Assert: due to validation,
F.module.memaddrs[0]exists. - Let
abe the memory addressF.module.memaddrs[0]. - Assert: due to validation,
S.mems[a]exists. - Let
membe the memory instanceS.mems[a]. - Assert: due to validation, a value of value type
i32is on the top of the stack. - Pop the value
i32.const nfrom the stack. - Assert: due to validation, a value of value type
i32is on the top of the stack. - Pop the value
i32.const sfrom the stack. - If
s + nis larger than the length ofmem.data, then:- Trap.
- Assert: due to validation, a value of value type
i32is on the top of the stack. - Pop the value
i32.const dfrom the stack. - If
d + nis larger than the length ofmem.data, then:- Trap.
- Let
b*be the byte sequencemem.data[s:n]. - Replace the bytes
mem.data[d:n]withb*.
set_memory
- Let
Fbe the current frame. - Assert: due to validation,
F.module.memaddrs[0]exists. - Let
abe the memory addressF.module.memaddrs[0]. - Assert: due to validation,
S.mems[a]exists. - Let
membe the memory instanceS.mems[a]. - Assert: due to validation, a value of value type
i32is on the top of the stack. - Pop the value
i32.const nfrom the stack. - Assert: due to validation, a value of value type
i32is on the top of the stack. - Pop the value
i32.const cfrom the stack. - Assert: due to validation, a value of value type
i32is on the top of the stack. - Pop the value
i32.const dfrom the stack. - If
d + nis larger than the length ofmem.data, then:- Trap.
- Let
c_wbe the result of computingwrap_{32,8}(c). - Let
b*be the byte sequence{bytes_i8(c_w)}^n. - Replace the bytes
mem.data[d:n]withb*.
instr ::= ...
| 0xC5 0x00 => move_memory
| 0xC6 0x00 => set_memory
Note that this skips 0xC0..0xC4 because those are currently proposed to be used for the
new sign-extending operators (see https://github.com/WebAssembly/threads/blob/master/proposals/threads/Overview.md#new-sign-extending-operators).
An immediate byte is included for future extensions. It currently must be zero.
plaininstr_I ::= ...
| `move_memory` => move_memory
| `set_memory` => set_memory
Why do we need operators? Couldn't the embedding just export memmove and meset functions?