Quantization vs. simplification in TopoJSON

Summary

Quantization removes information by reducing the precision of each coordinate, effectively snapping each point to a regular grid.

Simplification removes information by removing points applying a heuristic to make the least-perceptible change.

Mike Bostock's answer on SO

The total size of your geometry is controlled by two factors: the number of points and the number of digits (the precision) of each coordinate.

Say you have a large geometry with 1,000,000 points, where each two-dimensional point is represented as longitude in ±180° and latitude in ±90°:

[-90.07231180399987,29.501753271000098],[-90.06635619599979,29.499494248000133],…

Real numbers can have arbitrary precision (in JSON; in JavaScript they are limited by the precision of IEEE 754) and thus an infinite number of digits. But in practice the above is pretty typical, so say each coordinate has 18 digits. Including extra symbols ([, ] and ,), each point takes at most 1 + 18 + 1 + 18 + 1 = 39 bytes to encode in JSON, and the entire geometry is about 39 * 1,000,000 ≈ 39MB.

Now say we convert these real numbers to integers: both longitude and latitude are reduced to integers x and y where 0 ≤ x ≤ 99 and 0 ≤ y ≤ 99. A simple mapping between real-number points ⟨λ,φ⟩ and integer coordinates ⟨x,y⟩ is:

x = floor((λ + 180) / 360 * 100); y = floor((φ + 90) / 180 * 100);

Since each coordinate now takes at most 2 digits to encode, each point takes at most 1 + 2 + 1 + 2 + 1 = 7 bytes to encode in JSON, and the entire geometry is about 7MB; we reduced the total size by 82%.

Of course, nothing comes for free: if you remove too much information, you will no longer be able to display the geometry accurately. The rule of thumb is that the size of your grid should be at least twice as big as the largest expected display size for the entire map. For example, if you’re displaying a world map in a 960×500 pixel space, then the default 10,000×10,000 (-q 1e4) is a reasonable choice.

So, quantization removes information by reducing the precision of each coordinate, effectively snapping each point to a regular grid. This reduces the size of the generated TopoJSON file because each coordinate is represented as an integer (such as between 0 and 9,999) with fewer digits.

In contrast, simplification removes information by removing points, applying a heuristic that tries to measure the visual salience of each point and removing the least-noticeable points. There are many different methods of simplification, but the Visvalingam method used by the TopoJSON reference implementation is described in my Line Simplification article so I won’t repeat myself here.

While quantization and simplification address these two different types of information mostly independently, there’s an additional complication: quantization is applied before the topology is constructed, whereas simplification is necessarily applied after to preserve the topology. Since quantization frequently introduces coincident points ([24,62],[24,62],[24,62]…), and coincident points are removed, quantization can also remove points.

The reason that quantization is applied before the topology is constructed is that geometric inputs are often not topologically valid. For example, if you takes a shapefile of Nevada counties and combine it with a shapefile of Nevada’s state border, the coordinates in one shapefile might not exactly match the coordinates in the other shapefile. By quantizing the coordinates before constructing the topology, you snap the coordinates to a regular grid and can get a cleaner topology with fewer arcs, hopefully correctly identifying all shared arcs. (Of course, if you over-quantize, then you can cause too many coincident points and get self-intersecting arcs, which causes other problems.)

In a future release, maybe 1.5.0, TopoJSON will allow you to control the quantization before the topology is constructed independently from the quantization of the output TopoJSON file. Thus, you could use a finer grid (or no grid at all!) to compute the topology, then simplify, then use a coarser grid appropriate for a low-resolution screen display. For now, these are tied together, so I recommend using a finer grid (e.g., -q 1e6) that produces a clean topology, at the expense of a slightly larger file. Since TopoJSON also uses delta-encoded coordinates, you rarely pay the full price for all the digits anyway!

Since JSON numbers are base 10, you should use a grid size that’s a power of ten to make most efficient use of the encoding. Also, unless your input is known to be topologically valid you may want to use -q 1e5 or 1e6 to produce a cleaner topology. And lastly because your browser uses antialiasing when rendering, it can use subpixel positions and so it’s beneficial to use a grid size that’s (somewhat) finer than the pixel grid.

espinielli/README.md

Summary

Mike Bostock's answer on SO