Description of fancy subgroup piet-metal kernel 2

Kernel 2 processes all the segments in the fill and stroke items. Here we'll concentrate on fill (stroke is similar).

Its input is: a list of fill items for this tilegroup, from kernel 1. Also access to the scene, for the items, and for the lists of points.

Its output is: for each item, a background fill and a list of segments. (there's potential complexity that the segments can be "fill" and "fill edge").

This note refers to the piet-metal source extensively. For the most part, it does the PietItem_Fill case (lines 248..365).

Some simplifications: we'll consider the item list a vec, with len and index operations. In practice, it is likely to be fragmented, to make dynamic allocation easier for kernel 1. We'll also write the code for output in pseudocode (it will have to do similar dynamic alloc tricks).

Pseudocode for scalar case:

struct Result {
    int backdrop;
    vec<segment> segs;
}

kernel2_fill(Buf scene, Buf item_lists) {
    vec<(PietItemRef, float2 offset)> item_list = "item list for this tilegroup"
    vec<Result> results;

    for (uint i = 0; i < item_list.len(); i++) {
        PietFillPacked fill = PietFill_read(scene, item_list.items[i]);
        Result result = default;
        for (uint j = 0; j + 1 < fill.n_points; j++) {
            float2 start = float2_read(scene, fill.points_ix + (j * sizeof(float2)));
            float2 end = float2_read(scene, fill.points_ix + ((j + 1) * sizeof(float2)));
            // This is segment (i, j) - we will refer to this below
            "compute intersections of line with this tile" /* lines 310..330 */
            if ("left ray intersects" /* line 331 */) {
                result.backdrop -= s00; /* see line 332 */
            }
            if ("intersects" /* line 334, 345 */) {
                result.segs.push("process segment" /* lines 338-351 */);
            }

        }
        results.push(result);
    }
}

Some preliminaries:

pos(i, j) = sum(item[k].n_points - 1 for k in 0..i) + j

This is a mapping from (i, j) pairs into integers in the range 0..sum(item[_].n_points - 1). The inverse mapping (invpos) is also defined. Our general strategy will be to iterate through these in chunks of subgroup_size.

We're actually going to take chunks of 0..subgroup size for i, so this definition is not quite right (TODO: fix up properly).

Pseudocode with subgroups (buckle up):

kernel2_fill(Buf scene, Buf item_lists) {
    vec<(PietItemRef, float2 offset)> item_list = "item list for this tilegroup"
    vec<Result> results;
    Result result = default;

    for (uint i0 = 0; i0 < item_list.len(); i0 += subgroup_size) {
        uint i = i0 + tix;
        uint n = 0;
        float2Ref points;
        if (i < item_list.len()) {
            n = PietFill_n_points(scene, item_list.items[i]) - 1;
            points = PietFill_points_ix(scene, item_list.items[i]);
        }
        // Note: subgroup_inclusive_add(n) is WavePrefixSum(n) + n, simd_prefix_inclusive_sum
        uint prefix_n = subgroup_inclusive_add(n);
        uint sum_n = subgroup_broadcast(prefix_n, subgroup_size - 1)
        uint ilast = 0;
        uint jlast = 0;
        for (uint k0 = 0; k0 < sum_n; k0 += subgroup_size) {
            uint k = k0 + tix;
            // we're now going to compute invpos(k)
            uint my_ix = prefix_n - k0 - 1;
            uint is_last = subgroup_or(my_ix < subgroup_size ? 1 << my_ix : 0);
            // bit b of is_last is set when ∃x k0 + b + 1 == prefix_n[x]

            // This is crafted so it's 0 when tix == 0
            uint last_count = (is_last << (31 - tix)) * 2;
            uint delta_i = popcnt(last_count);
            uint i_inv = ilast + delta_i;
            uint j_inv = delta_i == 0 ? jlast + tix : count_leading_zeros(last_count);

            uint intersects = 0;
            uint backdrop_pos = 0;
            uint backdrop_neg = 0;
            float2 start, end;
            if (i0 + i_inv < item_list.len()) {
                // if shuffle is not available, just read again from scene
                float2Ref this_points = simd_shuffle(points, i_inv);
                start = float2_read(scene, this_points + (j_inv * sizeof(float2)));
                end = float2_read(scene, this_points + ((j_inv + 1) * sizeof(float2)));

                if ("left ray for tilegroup intersects") {
                    float x = "solve line equation for y0 (top of tile)";
                    uint xray = floor((x - x0) / tile_width);
                    uint mask = ~0 << xray;
                    if (s00 < 0.0) {
                        backdrop_pos = mask;
                    } else {
                        backdrop_neg = mask;
                    }
                }
                uint xmin, xmax = "range of tiles that this line intersects";
                intersects = xmax > xmin ? (2 << (xmax - 1)) - (1 << xmin) : 0;
            }
            "transpose intersects, backdrop_pos, and backdrop_neg";

            // All bits up to and including the lowest bit set
            vote_t mask = is_last ^ (is_last - 1);
            result.backdrop += popcnt(backdrop_pos & mask) - popcnt(backdrop_neg & mask);
            vote_t bitmask = intersects | is_last;
            while (bitmask != 0) {
                uint k = count_trailing_zeros(bitmask);
                if (intersects & (1 << k) != 0) {
                    float2 this_start = simd_shuffle(start, k);
                    float2 this_end = simd_shuffle(end, k);

                    result.segs.push("process segment");
                }

                if ((is_last & (1 << k)) != 0) {
                    // last segment in item
                    results.push(result);
                    result = default;
                    // bits from the next one after this up to and including the next one set
                    vote_t mask = is_last ^ (is_last - (2 << k));
                    result.backdrop = popcnt(backdrop_pos & mask) - popcnt(backdrop_neg & mask);
                }

                bitmask &= bitmask - 1; // clears bottom bit
            }

            ilast += popcnt(is_last);
            jlast = subgroup_broadcast(j_inv, subgroup_size - 1) + 1;
        }
    }
}

raphlinus/kernel-2.md