Raph Levien raphlinus

Kernel 2 processes all the segments in the fill and stroke items. Here we'll concentrate on fill (stroke is similar).

Its input is: a list of fill items for this tilegroup, from kernel 1. Also access to the scene, for the items, and for the lists of points.

Its output is: for each item, a background fill and a list of segments. (there's potential complexity that the segments can be "fill" and "fill edge").

This note refers to the piet-metal source extensively. For the most part, it does the PietItem_Fill case (lines 248..365).

Some simplifications: we'll consider the item list a vec, with len and index operations. In practice, it is likely to be fragmented, to make dynamic allocation easier for kernel 1. We'll also write the code for output in pseudocode (it will have to do similar dynamic alloc tricks).

Writing for GPU involves breaking problems into primitives.
- Some primitives can naturally run in parallel; this is the easy part.
- Others are used to coordinate work between different threads.
- An example of this is transpose of a square matrix bitmap. It is used in piet-gpu to assign work to tiles.
- This post will examine the performance of that transpose task in detail.

layout

title

date

categories

post

Prefix sum on Vulkan

2020-04-21 11:29:42 -0800

gpu

In this blog post are some initial explorations into implementing [prefix sum] on recent Vulkan. I have a rough first draft implementation which suggests that Vulkan is a viable platform for this work, but considerably more performance tuning and evaluation would be needed before I would be to claim it is competitive with CUDA. Even so, I'm posting this now, as the rough explorations may be interesting to some, and I'm not sure I'll have the time and energy to do that followup work any time soon.

Why prefix sum?

layout

title

date

categories

post

A sort-middle architecture for 2D graphics

2020-05-26 16:34:42 -0700

rust

graphics

gpu

In my recent [piet-gpu update], I wrote that I was not satisfied with performance and teased a new approach. I'm happy to report that the new approach is looking very promising, and I'll describe it in some detail.

To recap, piet-gpu is a new high performance 2D rendering engine, currently a research protoype. While most 2D renderers fit the vector primitives into a GPU's rasterization pipeline, the brief for piet-gpu is to fully explore what's possible using the compute capabilities of modern GPU's. In short, it's a software renderer that is written to run efficiently on a highly parallel computer. Software rendering has been gaining more attention even for complex 3D scenes, as the traditional triangle-centric pipeline is less and less of a fit for high-end rendering. As a striking example, the new [Unreal 5] engine relies heavily on compute shaders for software rast

	def ctz(x):
	if x == 0: return 32
	r = 0
	while (x % 2) == 0:
	r += 1
	x >>= 1
	return r

	def clz(x):
	for k in range(31, -1, -1):

	compiling kernel transpose-hybrid-shuffle-WGS=(32,1)...
	num bms: 4096, num dispatch groups: 4096
	GPU results verified!
	task name:Vk-HybridShuffle-TG=32
	device: Intel(R) Iris(TM) Plus Graphics 640
	num BMs: 4096, TG size: 32
	CPU loops: 101, GPU loops: 1001
	timestamp stats (N = 101): 0.00 +/- 0.00 ms
	instant stats (N = 101): 108.47 +/- 8.75 ms

	backend: metal, device: Intel(R) Iris(TM) Plus Graphics 640
	metal-threadgroup-Intel(R) Iris(TM) Plus Graphics 640
	kernel type: threadgroup
	cpu_execs: 2, gpu_execs: 5001
	transpose-threadgroup-WGS=(1,32) kernel already compiled...
	num bms: 4096, num dispatch groups: 4096
	GPU results verified!
	task name:metal-threadgroup-WGS=(32, 32)
	TG size: 32
	timestamp stats (N = 2): 0.00 +/- 0.00 ms

	use kurbo::{CubicBez, ParamCurveArclen, Point};

	fn randpt() -> Point {
	Point::new(rand::random(), rand::random())
	}

	fn randbez() -> CubicBez {
	CubicBez::new(randpt(), randpt(), randpt(), randpt())
	}

	parsing time: 1.2863ms
	flattening and encoding time: 1.8176ms
	scene: 13239 elements
	Element kernel time: 0.633ms
	Binning kernel time: 0.166ms
	Coarse kernel time: 0.133ms
	Render kernel time: 0.001ms
	start thread: 0
	shared minimum element: 1
	minimum element of this thread: 828

	@.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1

	define i32 @main() #0 {
	%1 = alloca i32, align 4
	%i = alloca i32, align 4
	store i32 0, i32* %1
	call void @llvm.dbg.declare(metadata !{i32* %i}, metadata !13), !dbg !15
	store i32 0, i32* %i, align 4, !dbg !16
	br label %2, !dbg !16