Themaister · March 30, 2016 04:31
diff --git a/gistfile1.txt b/gistfile1.txt
 This is probably not the best place to put this, but whatever. I needed to get this down somewhere before I forget.

 I've been looking at the latest efforts in software rendered N64 lately and the graphics programmer in me got interested.
 Since I'm a (GP)GPU programmer I want to see this accelerated on the GPU somehow but the naive "render triangles method" is inadequate I think and trying to do LLE with a HLE mindset is fundamentally flawed.

 Looking at Angrylion's RDP plugin, the N64 clearly has an extremely specific rasterization and interpolation algorithm
 which I believe is not possible to recreate exactly with naive "draw triangle" in OpenGL or any other API (I haven't seen any evidence to the contrary).
 Apparently all the OpenGL LLE implementations atm are horribly and utterly broken, so clearly it’s quite hard.
 We need a custom rasterizer and this is where things will start to get interesting ...

 The two most important things for faithful looking results is exact rasterizer and exact depth interpolation,
 compare and precision and these are unfortunately the things OpenGL does not guarantee anything about, except that it is consistent for that particular GPU.
 OpenGL specifies its rasterizer based on an idealized barycentric scheme (cross product) and
 N64 RDP is a scanline rasterizer with particular fixed point math to determine start/end of a
 scanline plus what seems to be a highly custom way of dealing with AA using very exact coverage mask computation.

 Another point is the combiner. Emulating this exactly clearly requires extremely specific fixed point math and
 obviously the fixed function blending unit in OpenGL can't do this, so we would need to do programmable blending and depth test.
 This isn't impossible, on tile-based mobile GPUs this is actually quite doable, but obviously far from as efficient as fixed function blending.
 Desktop can do it as well, but it means a full ROP sync which really isn't something you want to do on an immediate mode GPU.

 To do a custom rasterizer and combiner, I think in theory it could work like this:

 LLE triangle:
 - Draw a quad which is the bounding box (computed earlier on CPU). Pass along all relevant fixed point parameters as uniforms.
 - Do a custom rasterizer check per pixel, discard if failed and interpolate everything by hand.
 - Read per-pixel bits and do combine/depth manually, discard fragment if depth test fails, etc.

 I don't think this is the way to go down though. The contention here between pixels is batshit insane and won't scale well.
 Also, the only API which would enable this kind of feature in a standard way is Vulkan with its multipass feature.

 Another challenge with N64 it is highly coupled framebuffer so readbacks from RDP is a must for correct emulation.
 We need to be able to fire off RDP rendering work and quickly get results back.
 Fragment comes very late in the graphics pipeline, so readbacks become very expensive,
 reading back from render targets often trigger a "decompression" on GPUs as well.
 This becomes painful if we're running any other shaders like perhaps VI filter emulation or post-process shaders in frontend ...

 Which leaves us with compute shader rasterization!
 This might sound weird, but with modern compute-enabled GPUs this isn't as stupid as it sounds.
 I think PCSX2 has done something similar and there are some very well performing Cuda-based softrasterizers out there.

 We can implement our own tile-based scheme where combining and depth testing happens in GPU registers rather than hitting main memory.
 We can buffer up an entire frame’s worth of work, bin to tiles on CPU and dispatch a compute dispatch that churns all the tiles.
 The RSP has already done all the hard work for us in vertex processing so I think the structure could look something like this on GPU:

 buffer TriangleData { RDPCommandData triangles[]; }; // Filled on CPU
 struct PerTileData { uint num_triangles; uint triangle_indices_hitting_this_tile[MAX_TRIANGLES_PER_TILE]; }; // Filled on CPU
 buffer Tiles { PerTileData tile_data[]; };
 buffer Framebuffer { Pixel pixels[]; }; // Streamed from RDRAM on rendering start. Flushed back out to RDRAM when needed (FB readback/modify on CPU).

 workgroup_size = (8, 8); // 8x8 tiles, arbitrary.

 main()
 {
 	uvec2 pixel = GlobalID.xy;
 	uint frame_buffer = pixels[GlobalID.y * stride + GlobalID.x];
 	uint num_triangles = tile_data[WorkGroupID].num_triangles
 	for (uint i = 0; i < num_triangles; i++)
 	{
 		RDPCommandData tri = triangles[tile_data[WorkGroupID].triangle_indices_hitting_this_tile[i]];
                 if (rasterizer_test_passes(GlobalID.xy, tri, coverage))
 		{
 			InterpolateData interp = interpolate_color_and_sample_textures();
 			if (depth_testing(frame_buffer))
 			{
 				frame_buffer = combiner_and_blend(frame_buffer, interp, coverage);
 			}
 		}
 	}
 	pixels[GlobalID.y * stride + GlobalID.x] = frame_buffer;
 }

 There are some problems with this right off the bat though.
 If we batch up arbitrary triangles, the triangles can have completely different combiners and render states.
 This becomes ugly really quickly since we need an ubershader to deal with all possible cases.
 This hurts occupancy and lots of useless branches (but at least the branches are coherent!)

 Other problems include textures changing a lot mid-frame (especially with that anaemic texture cache),
 but this is probably not as bad since we can probably use 2D texture arrays with mipmaps
 to contain a large number of independent textures of all possible sizes in the same texture description.

 It is certainly doable to go through multiple tile passes per frame if we want to replace
 the compute shader mid-frame due to changing combiners and rendering states.
 Finding the right balance with number of passes and longer and more complex shaders is a very tinkering heavy implementation detail.
 At least with Vulkan, we can hammer on with lots of dispatches and custom linear allocators can
 avoid any driver churn with allocating lots of buffers here and there … :3

 A compute approach like this can have significant improvement for the N64 LLE effort I think.
 Some GPUs can do compute and fragment work in parallel which lets us hide a lot of the processing.
 Some GPUs even have dedicated compute queues (exposed in Vulkan for example), which can improve this even more.
 This means compute doesn’t have to wait for fragment to complete and read backs can be done cleanly and fast,
 not having to wait for fragment to complete first.

 Sure, this approach won’t get close to HLE rendering speed, but any half-decent GPU should still churn through this like butter,
 this is 320x240 territory, not 1080p exactly :P

 I’m playing around with a pure rasteriser using this approach with RDP triangle semantics to get a feel for how it works out.
 Graphics programming at this level is quite fun :3
	This is probably not the best place to put this, but whatever. I needed to get this down somewhere before I forget.

	I've been looking at the latest efforts in software rendered N64 lately and the graphics programmer in me got interested.
	Since I'm a (GP)GPU programmer I want to see this accelerated on the GPU somehow but the naive "render triangles method" is inadequate I think and trying to do LLE with a HLE mindset is fundamentally flawed.

	Looking at Angrylion's RDP plugin, the N64 clearly has an extremely specific rasterization and interpolation algorithm
	which I believe is not possible to recreate exactly with naive "draw triangle" in OpenGL or any other API (I haven't seen any evidence to the contrary).
	Apparently all the OpenGL LLE implementations atm are horribly and utterly broken, so clearly it’s quite hard.
	We need a custom rasterizer and this is where things will start to get interesting ...

	The two most important things for faithful looking results is exact rasterizer and exact depth interpolation,
	compare and precision and these are unfortunately the things OpenGL does not guarantee anything about, except that it is consistent for that particular GPU.
	OpenGL specifies its rasterizer based on an idealized barycentric scheme (cross product) and
	N64 RDP is a scanline rasterizer with particular fixed point math to determine start/end of a
	scanline plus what seems to be a highly custom way of dealing with AA using very exact coverage mask computation.

	Another point is the combiner. Emulating this exactly clearly requires extremely specific fixed point math and
	obviously the fixed function blending unit in OpenGL can't do this, so we would need to do programmable blending and depth test.
	This isn't impossible, on tile-based mobile GPUs this is actually quite doable, but obviously far from as efficient as fixed function blending.
	Desktop can do it as well, but it means a full ROP sync which really isn't something you want to do on an immediate mode GPU.

	To do a custom rasterizer and combiner, I think in theory it could work like this:

	LLE triangle:
	- Draw a quad which is the bounding box (computed earlier on CPU). Pass along all relevant fixed point parameters as uniforms.
	- Do a custom rasterizer check per pixel, discard if failed and interpolate everything by hand.
	- Read per-pixel bits and do combine/depth manually, discard fragment if depth test fails, etc.

	I don't think this is the way to go down though. The contention here between pixels is batshit insane and won't scale well.
	Also, the only API which would enable this kind of feature in a standard way is Vulkan with its multipass feature.

	Another challenge with N64 it is highly coupled framebuffer so readbacks from RDP is a must for correct emulation.
	We need to be able to fire off RDP rendering work and quickly get results back.
	Fragment comes very late in the graphics pipeline, so readbacks become very expensive,
	reading back from render targets often trigger a "decompression" on GPUs as well.
	This becomes painful if we're running any other shaders like perhaps VI filter emulation or post-process shaders in frontend ...

	Which leaves us with compute shader rasterization!
	This might sound weird, but with modern compute-enabled GPUs this isn't as stupid as it sounds.
	I think PCSX2 has done something similar and there are some very well performing Cuda-based softrasterizers out there.

	We can implement our own tile-based scheme where combining and depth testing happens in GPU registers rather than hitting main memory.
	We can buffer up an entire frame’s worth of work, bin to tiles on CPU and dispatch a compute dispatch that churns all the tiles.
	The RSP has already done all the hard work for us in vertex processing so I think the structure could look something like this on GPU:

	buffer TriangleData { RDPCommandData triangles[]; }; // Filled on CPU
	struct PerTileData { uint num_triangles; uint triangle_indices_hitting_this_tile[MAX_TRIANGLES_PER_TILE]; }; // Filled on CPU
	buffer Tiles { PerTileData tile_data[]; };
	buffer Framebuffer { Pixel pixels[]; }; // Streamed from RDRAM on rendering start. Flushed back out to RDRAM when needed (FB readback/modify on CPU).

	workgroup_size = (8, 8); // 8x8 tiles, arbitrary.

	main()
	{
	uvec2 pixel = GlobalID.xy;
	uint frame_buffer = pixels[GlobalID.y * stride + GlobalID.x];
	uint num_triangles = tile_data[WorkGroupID].num_triangles
	for (uint i = 0; i < num_triangles; i++)
	{
	RDPCommandData tri = triangles[tile_data[WorkGroupID].triangle_indices_hitting_this_tile[i]];
	if (rasterizer_test_passes(GlobalID.xy, tri, coverage))
	{
	InterpolateData interp = interpolate_color_and_sample_textures();
	if (depth_testing(frame_buffer))
	{
	frame_buffer = combiner_and_blend(frame_buffer, interp, coverage);
	}
	}
	}
	pixels[GlobalID.y * stride + GlobalID.x] = frame_buffer;
	}

	There are some problems with this right off the bat though.
	If we batch up arbitrary triangles, the triangles can have completely different combiners and render states.
	This becomes ugly really quickly since we need an ubershader to deal with all possible cases.
	This hurts occupancy and lots of useless branches (but at least the branches are coherent!)

	Other problems include textures changing a lot mid-frame (especially with that anaemic texture cache),
	but this is probably not as bad since we can probably use 2D texture arrays with mipmaps
	to contain a large number of independent textures of all possible sizes in the same texture description.

	It is certainly doable to go through multiple tile passes per frame if we want to replace
	the compute shader mid-frame due to changing combiners and rendering states.
	Finding the right balance with number of passes and longer and more complex shaders is a very tinkering heavy implementation detail.
	At least with Vulkan, we can hammer on with lots of dispatches and custom linear allocators can
	avoid any driver churn with allocating lots of buffers here and there … :3

	A compute approach like this can have significant improvement for the N64 LLE effort I think.
	Some GPUs can do compute and fragment work in parallel which lets us hide a lot of the processing.
	Some GPUs even have dedicated compute queues (exposed in Vulkan for example), which can improve this even more.
	This means compute doesn’t have to wait for fragment to complete and read backs can be done cleanly and fast,
	not having to wait for fragment to complete first.

	Sure, this approach won’t get close to HLE rendering speed, but any half-decent GPU should still churn through this like butter,
	this is 320x240 territory, not 1080p exactly :P

	I’m playing around with a pure rasteriser using this approach with RDP triangle semantics to get a feel for how it works out.
	Graphics programming at this level is quite fun :3