Recently, I've worked with Emscripten, producing web-assembly from C++. As a result, I've become curious of how much the wasm-binary grows by including this or that code. Even though caring about code size seems to be out of fashion since we stopped using floppy disks to move programs around, it does matter somewhat on the web, as every client has to download the binary, maybe over a cellular connection etc.
So, I was curious: Given a wasm-binary of a given size, can I determine how many bytes this and that source-file contributed? There might be some tool that does this, but I didn't manage to find any. But I found a way.
To have something concrete to discuss around, I present this silly little program that populates and array with pseudo-random numbers, sorts them and prints them out:
#include <vector>
#include <algorithm>
#include <cstdio>
#include <cstdlib>
int main() {
std::vector<int> quux(25);
srand(42);
for(auto & e : quux) {
e = rand();
}
std::sort(quux.begin(), quux.end());
for(auto e : quux) {
printf("%d\n", e);
}
return 0;
}
I compiled this with -O2 -g4
, which produces a binary of 28 KB. The debug
flag is needed to produce the information I will use. Running objdump
produces the following info:
$ wasm-objdump -h main.wasm
main.wasm: file format wasm 0x1
Sections:
Type start=0x0000000b end=0x00000096 (size=0x0000008b) count: 20
Import start=0x00000099 end=0x00000208 (size=0x0000016f) count: 17
Function start=0x0000020a end=0x0000025b (size=0x00000051) count: 80
Global start=0x0000025d end=0x0000026c (size=0x0000000f) count: 2
Export start=0x0000026f end=0x000003d8 (size=0x00000169) count: 24
Elem start=0x000003da end=0x00000401 (size=0x00000027) count: 1
Code start=0x00000405 end=0x00006425 (size=0x00006020) count: 80
Data start=0x00006428 end=0x00006749 (size=0x00000321) count: 22
Custom start=0x0000674c end=0x0000706e (size=0x00000922) "name"
Custom start=0x00007070 end=0x0000708f (size=0x0000001f) "sourceMappingURL"
That is, the code segment is 24 KB and is located between 0x405
and
0x6425
.
Emscripten can produce debug output (the -g4
flag) in the form of Javascript
sourcemaps, of which the specification can be found
here.
The sourcemap for my silly program is a JSON-file that looks like:
{
"version": 3,
"sources": [
"main.cpp",
"emsdk/fastcomp/emscripten/system/include/libcxx/vector",
"emsdk/fastcomp/emscripten/system/include/libcxx/iterator",
"emsdk/fastcomp/emscripten/system/include/libcxx/algorithm",
"emsdk/fastcomp/emscripten/system/include/libcxx/memory",
"emsdk/fastcomp/emscripten/system/include/libcxx/stdexcept",
"emsdk/fastcomp/emscripten/system/include/libcxx/new"
],
"names": [],
"mappings": "2yEAMA,MACA,IC+9CA,OAgBA,cCAA,SF7+CA,...,SKhtBA,OA0CA"
}
The list of sources is pretty obvious, but the mappings is a sequence of segments mapping between generated code positions in the source files. The actual encoding is interesting, and is well explained here.
A segment contains (up to) five pices of information:
- A starting column in the generated code.
- A source file index.
- A starting line in the source file.
- A starting column in the source file.
- A symbol index.
The line number in the generated code is implicit, as the segments are emitted
line-by-line and next line is encoded as ;
.
The wasm sourcemaps appear to interpret the wasm-binary as a single line of bytes, where a column in the generated code is a byte offset. I interpreted a segment to start at its starting position and continue until the next segment starts.
Thus, my idea was to parse the source maps, run through the segments and aggregate the lengths of each segment (i.e. segment byte size) on a source-by-source basis. I made a small Python-script (see end of the post) that does this, and it produces the following:
7 bytes emsdk/fastcomp/emscripten/system/include/libcxx/memory
8 bytes emsdk/fastcomp/emscripten/system/include/libcxx/algorithm
100 bytes emsdk/fastcomp/emscripten/system/include/libcxx/new
296 bytes emsdk/fastcomp/emscripten/system/include/libcxx/iterator
365 bytes main.cpp
Accounted for 776 bytes
Range is [0x514, 0x81c] (length is 776)
The range covered by the segments lies within the code segment (between
0x405
and 0x6425
), though, only 776 bytes of 24KB is covered. This is
probably some constant overhead that grows relatively large for such a small
program, if I run this script on a decently sized application, the coverage
grows above 90%.
Update: Looking at the disassembly, I discovered that the uncovered range is the wasm-side of the C-library. In hindsight, not very surprising.
Now, there is not too much too learn with such a silly example. But on larger code bases there is some insight to be had. In my experience, the byte count for a cpp-file corresponds relatively to the reduction in wasm-size if that file is removed from the build. Thus, one can get an idea of the cost of a feature by summarizing byte count of cpp-files according to the feature they belong to.
Header files are less conclusive. However, if a header-file comes up high in the list, it might indicate that it gets inlined a lot and the inlining actually produces a lot of code. So this might highlight code that might benefit being moved from a header file into a cpp-file.
To conclude, parsing the sourcemaps in this way have at least given my some insight into how much each feature of an application costs byte-wise. It is pretty simple, and, for all I know, this might be some well-known standard approach I didn't know about. But I found it interesting.
Here is the source code for the parser-aggregater. It is in no way polished and I am not a great python programmer, so you are warned that it will probably crash down horribly and burn.
import json
class WasmBlamer:
def _debase64(self, c):
v = ord(c)
if ord('A') <= v and v <= ord('Z'):
return v-ord('A')
elif ord('a') <= v and v <= ord('z'):
return v-ord('a') + 26
elif ord('0') <= v and v <= ord('9'):
return v-ord('0') + 52
elif v == ord('+'):
return 62
elif v == ord('/'):
return 63
assert False, '"' + c + '" is not a valid base64 character'
def _decodeVLQ(self, array):
rv = []
while array:
e = self._debase64(array[0])
array.pop(0)
sign = 1 if (e & 0x1) == 0 else -1
value = (e >> 1) & 0xF
offset = 4
while e & 0x20:
e = self._debase64(array[0])
array.pop(0)
assert value < (1<<offset)
value = value | ((e & 0x1f)<<offset)
offset += 5
rv.append(sign * value)
return rv
def _handleSegment(self, segment):
dst_col_next = 0
src_col_next = 0
src_row_next = 0
src_next = -1
if segment:
vlq = self._decodeVLQ(segment)
dst_col_next = self.dst_col + vlq[0]
if self.dst_col_min < 0 or dst_col_next < self.dst_col_min:
self.dst_col_min = dst_col_next
self.dst_col_max = max(self.dst_col_max, dst_col_next)
if 4 <= len(vlq):
src_next = max(0, self.src) + vlq[1]
src_row_next = self.src_row + vlq[2]
src_col_next = self.src_col + vlq[3]
if 5 <= len(vlq):
pass # ignore symbols for now
if 0 <= self.src:
self.blame[self.src] += max(0, dst_col_next-self.dst_col)
self.dst_col = dst_col_next
self.src_col = src_col_next
self.src_row = src_row_next
self.src = src_next
def _printReport(self):
report = list(zip(self.blame, self.sources))
report.sort(key=lambda e: e[0])
sum = 0
for item in report:
print("%8d bytes %s" % (item[0], item[1]))
sum += item[0]
print("\nAccounted for %d bytes" % sum)
print("Range is [%#0x, %#0x] (length is %d)" % (self.dst_col_min, self.dst_col_max,
self.dst_col_max- self.dst_col_min))
def _processMappings(self):
segment = []
for c in self.mappings:
if c == ',' or c == ';':
self._handleSegment(segment)
segment.clear()
if c == ';':
self.dst_col = 0
self.dst_row += 1
else:
segment.append(c)
self._handleSegment([])
def process(self, path):
with open(path) as file:
sourcemap = json.load(file)
self.sources = sourcemap['sources']
self.names = sourcemap['names']
self.mappings = sourcemap['mappings']
self.dst_col_min = -1
self.dst_col_max = 0
self.dst_col = 0
self.dst_row = 0
self.src = -1
self.src_col = 0
self.src_row = 0
self.blame = [0]*len(self.sources)
self._processMappings()
self._printReport()
if __name__ == "__main__":
import sys
if len(sys.argv) != 2:
print("Usage: %s <binary>.wasm.map" % sys.argv[0], file=sys.stderr)
sys.exit(-1)
mapfile = sys.argv[1]
blamer = WasmBlamer()
blamer.process(mapfile)
Thanks for the link!
In theory, such tools should work.
I tried source-map-explorer on some wasm-files, and it does consume the source-map. However, it doesn't manage to tie bytes to sources, it is just reports everything as 100% unmapped.
My guess is that the tool assumes something about the structure of source file names (guess npm package?), and since this is all
.h
and.cpp
-files, things break.