Recently @dragostis at Google released an experimental vector graphics renderer called Forma.
The renderer has a pretty cool set of goals: portability, performance, simplicity, and size. Graphics and GPU computation models are a topic that I'm pretty interested in learning more about personally, and this project seems like an especially accessible / well-written codebase to learn from.
I'm very happy to see this work! The era of rendering vector graphics in GPU compute shaders is upon us, and I have no doubt it we'll start seeing these in production soon, as there's just such a performance advantage over CPU rendering, and I believe trying to run vector 2D graphics through the GPU rasterization pipeline doesn't quite work.
This code is simpler than Vello (the new name for piet-gpu), focused on vector path rendering. It's also a strong demo of the power of WebGPU, while also having a performant software-only pipeline. I definitely encourage people to take a closer look.
—Raph Levien
Here are my personal notes from reading through Forma today, which is 13,000 lines of Rust code.
I have a good background in Rust, systems programming, and graphics. I don't know much about GPUs outside of writing OpenGL shaders though, so I prepared by reading some code and other resources in the last few days. The most relevant ones to this exploration were:
- GPU-Puzzles, a set of puzzles teaching the CUDA execution model.
- Learn WGPU, a classic graphics tutorial, except written in Rust/WGPU instead of C++/OpenGL.
I've also been reading other libraries recently like JAX, GPU.js, and hopefully soon Rust-GPU. Those aren't very related to Forma though.
- Unique design goal is parallelization through GPU compute shaders, rather than the traditional "tessellate on CPU then shade on GPU" steps (e.g., Skia). For CPU code, they use rayon's work-stealing model, plus SIMD implemented themselves with unsafe assembly intrinsics. For GPU, the code needs to be cross-platform, so they use WGPU.
- It's all one crate. The
forma/src
folder is ~13K lines of Rust code, not including newlines or comments, although a lot of those lines are unit tests. Some files are thousands of lines long.- The codebase liberally uses semantic newtype definitions. It's nice!
- About half of the code is in the
cpu
folder. I think that's because it includes definitions of optimized render buffers, caches, and layouts. Those are always used regardless of backend. - The
gpu
folder is pretty small, only 1.5K lines of Rust code and ~1.3K lines of WGSL. The WGSL shaders all perform a@compute
utility: rasterize, sort, and paint. - There is a styling library and a math library. They seem to be pretty well-designed.
- The SVG rendering example is a pretty good, clear API demo.
- The high-level flow has a few main parts.
- The single drawing primitive is a
Path
, which is similar to an SVG Path. It starts at a point and can move to other points with a line, quadratic, or cubic Bézier. - You insert multiple paths into a
Layer
, which internally merges all the paths together into a single geometry. A layer can haveProps
and an affine transform attached. The props specify how the layer is drawn: fill rule, fill (solid, gradient, or texture) + blend mode, or clip. - A
Composition
is a stack of layers numbered in order from back to front.- This is the top-level object in Forma.
- Trick of a
use_composition
dirty flag, which is invalidated every time a change to the rendering happens, so the drawing is only updated after a resize or other operation. - Why does the composition need to be reused? It includes caches for props and layers.
- The composition can be rendered using one of two entry point functions, on the
cpu::renderer::Renderer
or thegpu::renderer::Renderer
object.
- The single drawing primitive is a
- Pipeline: curves -> segments -> pixel segments -> sorted pixel segments -> painted tiles
- This is done fully on the CPU. Each
Composition
has aSegmentBuffer
, which stores the line segments making up each layer's paths in an optimized way. These line segments are tagged with aGeomId
for their layer. - This is also the stage where transforms are applied, once per line segment.
- There is a curvature-adaptive splitting algorithm used for Béziers.
- At this point, all of the remaining steps work directly with individual pixels and are handled by the
Renderer
backend, for performance reasons.
- Rendering happens in square chunks called tiles. They're 16x16 (CPU) or 16x4 (GPU).
- A pixel segment is the intersection of a line segment with a pixel. In other words, the line segment can cover only part of a pixel; this is effectively anti-aliasing.
- These are stored in the Rust code as a densely packed 64-bit number. It looks like the GPU /
rasterizer.wgsl
code uses the exact same 64-bit representation, having to pack segments with bitshifts and other logic. - The GPU pass dispatches up to 16 parallel workgroups, each with a
@workgroup_size
that equals 256 threads. They each take 1 / (16*256) of the segments. - The output buffer is pre-allocated to the exact correct size based on how many pixels each line segment passes through. This is computed using the "Manhattan segment length" — basically, 1 plus the number of horizontal and vertical pixel boundaries it crosses, which can be computed really quickly in parallel with rayon.
- Since this is purely geometric, we can ignore the styles and clip rules at this stage.
- My first question: What does this mean? Recall that the pixel segments are stored as 64-bit byte vectors, packed with several fields. This step's goal is to literally just sort them as a
&[u64]
array, i.e., as integers. - The CPU implementation is only one line of code. They use the crumsort's library parallel sort implementation. Crumsort is an adaptive quicksort-mergesort hybrid.
- For the GPU, it is more complex. There is a folder called
conveyor_sort
and a file calledsort.wgsl
. I don't know how this works. At a high level, it appears to be sorting "blocks" of length 64*9, then merging them together over a logarithmic number of iterations. - Seems like some kind of parallel merge sort? Searching online seems to suggest that's common on GPUs. I don't have much background on this.
- Okay, finally we can do the fun, visual part. So far it's just been geometry.
- Recall that we sorted the pixel segments. The most significant bits are (TileY, TileX, LayerId). This means that we can render each tile & each layer within each tile.
- We need to fill shapes, but we've only really split up the boundaries of those shapes into pixels. To fill in the interiors, we choose to paint tiles one row at a time, from left to right. We keep track of which layers have carried over on each pixel row in tiles to the left of the current one, and the percentage of the pixel that's shaded.
- This is the most complicated step and the first step where CPU and GPU differ substantially. The CPU uses a "layer workbench" struct that applies optimization passes before using multithreaded SIMD operations to paint the layers.
- The GPU has an 800-line WGSL compute shader to paint everything. This compute shader notably uses shared memory marked
var<workgroup>
to load segments and correspondingworkgroupBarrier()
calls to synchronize memory access. It uses a 16x4 workgroup to paint each entire row of tiles at a time (recall GPU tiles are 16x4 px). - All of the math / blend modes are implemented on both CPU + GPU.
- Finally, to get the resulting image, a final very short shader on the GPU draws a texture, and we're done. :)
It surprised me how well-delineated the steps in Forma could be. The codebase is really beautifully written. Despite using performance-sensitive systems APIs including SIMD, everything was written very clearly and felt super safe because it was typechecked. There weren't many comments, but I could get the idea of what was going on pretty well by following rust-analyzer's "Go to definition" and skimming code.
I find it wonderful that it's possible to describe a parallelized vector graphics rendering process so concisely. Forma is a nice bit of code to learn about how low-level rendering libraries work, as well as how to architect compute shaders in general.