Who to blame for the bytes in the WASM?

Christopher Dyken

Recently, I've worked with Emscripten, producing web-assembly from C++. As a result, I've become curious of how much the wasm-binary grows by including this or that code. Even though caring about code size seems to be out of fashion since we stopped using floppy disks to move programs around, it does matter somewhat on the web, as every client has to download the binary, maybe over a cellular connection etc.

So, I was curious: Given a wasm-binary of a given size, can I determine how many bytes this and that source-file contributed? There might be some tool that does this, but I didn't manage to find any. But I found a way.

Sample program

To have something concrete to discuss around, I present this silly little program that populates and array with pseudo-random numbers, sorts them and prints them out:

#include <vector>
#include <algorithm>
#include <cstdio>
#include <cstdlib>

int main() {
    std::vector<int> quux(25);
    srand(42);
    for(auto & e : quux) {
        e = rand();
    }
    std::sort(quux.begin(), quux.end());
    for(auto e : quux) {
        printf("%d\n", e);
    }
    return 0;
}

I compiled this with -O2 -g4, which produces a binary of 28 KB. The debug flag is needed to produce the information I will use. Running objdump produces the following info:

$ wasm-objdump -h main.wasm

main.wasm:      file format wasm 0x1

Sections:

     Type start=0x0000000b end=0x00000096 (size=0x0000008b) count: 20
   Import start=0x00000099 end=0x00000208 (size=0x0000016f) count: 17
 Function start=0x0000020a end=0x0000025b (size=0x00000051) count: 80
   Global start=0x0000025d end=0x0000026c (size=0x0000000f) count: 2
   Export start=0x0000026f end=0x000003d8 (size=0x00000169) count: 24
     Elem start=0x000003da end=0x00000401 (size=0x00000027) count: 1
     Code start=0x00000405 end=0x00006425 (size=0x00006020) count: 80
     Data start=0x00006428 end=0x00006749 (size=0x00000321) count: 22
   Custom start=0x0000674c end=0x0000706e (size=0x00000922) "name"
   Custom start=0x00007070 end=0x0000708f (size=0x0000001f) "sourceMappingURL"

That is, the code segment is 24 KB and is located between 0x405 and 0x6425.

Source maps

Emscripten can produce debug output (the -g4 flag) in the form of Javascript sourcemaps, of which the specification can be found here. The sourcemap for my silly program is a JSON-file that looks like:

{
  "version": 3,
  "sources": [
    "main.cpp",
    "emsdk/fastcomp/emscripten/system/include/libcxx/vector",
    "emsdk/fastcomp/emscripten/system/include/libcxx/iterator",
    "emsdk/fastcomp/emscripten/system/include/libcxx/algorithm",
    "emsdk/fastcomp/emscripten/system/include/libcxx/memory",
    "emsdk/fastcomp/emscripten/system/include/libcxx/stdexcept",
    "emsdk/fastcomp/emscripten/system/include/libcxx/new"
  ],
  "names": [],
  "mappings": "2yEAMA,MACA,IC+9CA,OAgBA,cCAA,SF7+CA,...,SKhtBA,OA0CA"
}

The list of sources is pretty obvious, but the mappings is a sequence of segments mapping between generated code positions in the source files. The actual encoding is interesting, and is well explained here.

A segment contains (up to) five pices of information:

A starting column in the generated code.
A source file index.
A starting line in the source file.
A starting column in the source file.
A symbol index.

The line number in the generated code is implicit, as the segments are emitted line-by-line and next line is encoded as ;.

The wasm sourcemaps appear to interpret the wasm-binary as a single line of bytes, where a column in the generated code is a byte offset. I interpreted a segment to start at its starting position and continue until the next segment starts.

Aggregate source map segments

Thus, my idea was to parse the source maps, run through the segments and aggregate the lengths of each segment (i.e. segment byte size) on a source-by-source basis. I made a small Python-script (see end of the post) that does this, and it produces the following:

       7 bytes emsdk/fastcomp/emscripten/system/include/libcxx/memory
       8 bytes emsdk/fastcomp/emscripten/system/include/libcxx/algorithm
     100 bytes emsdk/fastcomp/emscripten/system/include/libcxx/new
     296 bytes emsdk/fastcomp/emscripten/system/include/libcxx/iterator
     365 bytes main.cpp

Accounted for 776 bytes
Range is [0x514, 0x81c] (length is 776)

The range covered by the segments lies within the code segment (between 0x405 and 0x6425), though, only 776 bytes of 24KB is covered. This is probably some constant overhead that grows relatively large for such a small program, if I run this script on a decently sized application, the coverage grows above 90%.

Update: Looking at the disassembly, I discovered that the uncovered range is the wasm-side of the C-library. In hindsight, not very surprising.

Interpreting the results

Now, there is not too much too learn with such a silly example. But on larger code bases there is some insight to be had. In my experience, the byte count for a cpp-file corresponds relatively to the reduction in wasm-size if that file is removed from the build. Thus, one can get an idea of the cost of a feature by summarizing byte count of cpp-files according to the feature they belong to.

Header files are less conclusive. However, if a header-file comes up high in the list, it might indicate that it gets inlined a lot and the inlining actually produces a lot of code. So this might highlight code that might benefit being moved from a header file into a cpp-file.

To conclude, parsing the sourcemaps in this way have at least given my some insight into how much each feature of an application costs byte-wise. It is pretty simple, and, for all I know, this might be some well-known standard approach I didn't know about. But I found it interesting.

Source code

Here is the source code for the parser-aggregater. It is in no way polished and I am not a great python programmer, so you are warned that it will probably crash down horribly and burn.

import json

class WasmBlamer:

    def _debase64(self, c):
        v = ord(c)
        if ord('A') <= v and v <= ord('Z'):
            return v-ord('A')
        elif ord('a') <= v and v <= ord('z'):
            return v-ord('a') + 26
        elif ord('0') <= v and v <= ord('9'):
            return v-ord('0') + 52
        elif v == ord('+'):
            return 62
        elif v == ord('/'):
            return 63
        assert False, '"' + c + '" is not a valid base64 character'

    def _decodeVLQ(self, array):
        rv = []
        while array:
            e = self._debase64(array[0])
            array.pop(0)

            sign = 1 if (e & 0x1) == 0 else -1
            value = (e >> 1) & 0xF

            offset = 4
            while e & 0x20:
                e = self._debase64(array[0])
                array.pop(0)
                assert value < (1<<offset)
                value = value | ((e & 0x1f)<<offset)
                offset += 5
            rv.append(sign * value)
        return rv

    def _handleSegment(self, segment):
        dst_col_next = 0
        src_col_next = 0
        src_row_next = 0
        src_next = -1
        if segment:
            vlq = self._decodeVLQ(segment)
            dst_col_next = self.dst_col + vlq[0]
            if self.dst_col_min < 0 or dst_col_next < self.dst_col_min:
                self.dst_col_min = dst_col_next
            self.dst_col_max = max(self.dst_col_max, dst_col_next)

            if 4 <= len(vlq):
                src_next = max(0, self.src) + vlq[1]
                src_row_next = self.src_row + vlq[2]
                src_col_next = self.src_col + vlq[3]
                if 5 <= len(vlq):
                    pass # ignore symbols for now
        if 0 <= self.src:
            self.blame[self.src] += max(0, dst_col_next-self.dst_col)
        self.dst_col = dst_col_next
        self.src_col = src_col_next
        self.src_row = src_row_next
        self.src = src_next

    def _printReport(self):
        report = list(zip(self.blame, self.sources))
        report.sort(key=lambda e: e[0])
        sum = 0
        for item in report:
            print("%8d bytes %s" % (item[0], item[1]))
            sum += item[0]
        print("\nAccounted for %d bytes" % sum)
        print("Range is [%#0x, %#0x] (length is %d)" % (self.dst_col_min, self.dst_col_max,
        	                                            self.dst_col_max- self.dst_col_min))


    def _processMappings(self):
        segment = []
        for c in self.mappings:
            if c == ',' or c == ';':
                self._handleSegment(segment)
                segment.clear()
                if c == ';':
                    self.dst_col = 0
                    self.dst_row += 1
            else:
                segment.append(c)
        self._handleSegment([])

    def process(self, path):

        with open(path) as file:
            sourcemap = json.load(file)
            self.sources = sourcemap['sources']
            self.names = sourcemap['names']
            self.mappings = sourcemap['mappings']

        self.dst_col_min = -1
        self.dst_col_max = 0
        self.dst_col = 0
        self.dst_row = 0
        self.src = -1
        self.src_col = 0
        self.src_row = 0
        self.blame = [0]*len(self.sources)
        self._processMappings()
        self._printReport()

if __name__ == "__main__":
    import sys
    if len(sys.argv) != 2:
        print("Usage: %s <binary>.wasm.map" % sys.argv[0], file=sys.stderr)
        sys.exit(-1)
    mapfile = sys.argv[1]
    blamer = WasmBlamer()
    blamer.process(mapfile)

cdyk/bytesinwasm.md