Skip to content

Instantly share code, notes, and snippets.

@raphlinus
raphlinus / kernel-2.md
Last active February 11, 2020 23:08
Description of fancy subgroup piet-metal kernel 2

Kernel 2 processes all the segments in the fill and stroke items. Here we'll concentrate on fill (stroke is similar).

Its input is: a list of fill items for this tilegroup, from kernel 1. Also access to the scene, for the items, and for the lists of points.

Its output is: for each item, a background fill and a list of segments. (there's potential complexity that the segments can be "fill" and "fill edge").

This note refers to the piet-metal source extensively. For the most part, it does the PietItem_Fill case (lines 248..365).

Some simplifications: we'll consider the item list a vec, with len and index operations. In practice, it is likely to be fragmented, to make dynamic allocation easier for kernel 1. We'll also write the code for output in pseudocode (it will have to do similar dynamic alloc tricks).

@raphlinus
raphlinus / bitmagic.py
Created February 12, 2020 04:19
A Python scratch file used in support of working out piet-gpu kernels
def ctz(x):
if x == 0: return 32
r = 0
while (x % 2) == 0:
r += 1
x >>= 1
return r
def clz(x):
for k in range(31, -1, -1):
@raphlinus
raphlinus / timing_results_hybrid_shuffle.txt
Created February 24, 2020 03:22
mac results on transpose-timing-tests (git hash 781dcf54fc8f32fa2acf54c7a0261defe09ef1be)
compiling kernel transpose-hybrid-shuffle-WGS=(32,1)...
num bms: 4096, num dispatch groups: 4096
GPU results verified!
task name:Vk-HybridShuffle-TG=32
device: Intel(R) Iris(TM) Plus Graphics 640
num BMs: 4096, TG size: 32
CPU loops: 101, GPU loops: 1001
timestamp stats (N = 101): 0.00 +/- 0.00 ms
instant stats (N = 101): 108.47 +/- 8.75 ms
backend: metal, device: Intel(R) Iris(TM) Plus Graphics 640
metal-threadgroup-Intel(R) Iris(TM) Plus Graphics 640
kernel type: threadgroup
cpu_execs: 2, gpu_execs: 5001
transpose-threadgroup-WGS=(1,32) kernel already compiled...
num bms: 4096, num dispatch groups: 4096
GPU results verified!
task name:metal-threadgroup-WGS=(32, 32)
TG size: 32
timestamp stats (N = 2): 0.00 +/- 0.00 ms
@raphlinus
raphlinus / transpose_blogpost_outline.md
Last active April 2, 2020 01:33
Matrix transpose blog post outline
  • Writing for GPU involves breaking problems into primitives.

    • Some primitives can naturally run in parallel; this is the easy part.

    • Others are used to coordinate work between different threads.

    • An example of this is transpose of a square matrix bitmap. It is used in piet-gpu to assign work to tiles.

    • This post will examine the performance of that transpose task in detail.

@raphlinus
raphlinus / arclen_bound.rs
Last active April 10, 2020 20:18
Empirical measurement of cubic bez arclength bound
use kurbo::{CubicBez, ParamCurveArclen, Point};
fn randpt() -> Point {
Point::new(rand::random(), rand::random())
}
fn randbez() -> CubicBez {
CubicBez::new(randpt(), randpt(), randpt(), randpt())
}
@raphlinus
raphlinus / prefix_sum_draft.md
Last active April 28, 2020 04:26
Very much half-written draft of prefix sum post
layout title date categories
post
Prefix sum on Vulkan
2020-04-21 11:29:42 -0800
gpu

In this blog post are some initial explorations into implementing [prefix sum] on recent Vulkan. I have a rough first draft implementation which suggests that Vulkan is a viable platform for this work, but considerably more performance tuning and evaluation would be needed before I would be to claim it is competitive with CUDA. Even so, I'm posting this now, as the rough explorations may be interesting to some, and I'm not sure I'll have the time and energy to do that followup work any time soon.

Why prefix sum?

@raphlinus
raphlinus / nv_crash_log.txt
Created May 18, 2020 17:36
Log output from piet-gpu nv_crash_2 run
parsing time: 1.2863ms
flattening and encoding time: 1.8176ms
scene: 13239 elements
Element kernel time: 0.633ms
Binning kernel time: 0.166ms
Coarse kernel time: 0.133ms
Render kernel time: 0.001ms
start thread: 0
shared minimum element: 1
minimum element of this thread: 828
@raphlinus
raphlinus / sort_middle.md
Created May 30, 2020 02:35
Draft blog post of sort-middle
layout title date categories
post
A sort-middle architecture for 2D graphics
2020-05-26 16:34:42 -0700
rust
graphics
gpu

In my recent [piet-gpu update], I wrote that I was not satisfied with performance and teased a new approach. I'm happy to report that the new approach is looking very promising, and I'll describe it in some detail.

To recap, piet-gpu is a new high performance 2D rendering engine, currently a research protoype. While most 2D renderers fit the vector primitives into a GPU's rasterization pipeline, the brief for piet-gpu is to fully explore what's possible using the compute capabilities of modern GPU's. In short, it's a software renderer that is written to run efficiently on a highly parallel computer. Software rendering has been gaining more attention even for complex 3D scenes, as the traditional triangle-centric pipeline is less and less of a fit for high-end rendering. As a striking example, the new [Unreal 5] engine relies heavily on compute shaders for software rast

@raphlinus
raphlinus / count to 10 in C
Created May 31, 2020 19:53
LLVM IR size comparison
@.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
define i32 @main() #0 {
%1 = alloca i32, align 4
%i = alloca i32, align 4
store i32 0, i32* %1
call void @llvm.dbg.declare(metadata !{i32* %i}, metadata !13), !dbg !15
store i32 0, i32* %i, align 4, !dbg !16
br label %2, !dbg !16