What this proposal is about

Creating a Compositor that will handle rendering and compute
Make ordering tasks around MUCH easier
Fixing barrier problem

What this proposal isn't

It's not about exposing the Compositor to the user
- This is possible. Juan tells me there is little demand for it.
- Not really the point of this proposal.
It's not about generating an UI for the Compositor and connecting nodes
- Also possible
- May be useful in the future if FXs go out of hand
An optional component
- This will lay at the heart of Godot's RenderingDevice
Solving execution order (e.g. what ops can be paired together so they can run in parallel via async compute)
- However it lays the foundation for this in future work
- We will isolate operations into isolated 'units of work' called passes.

Problems

Godot generates barriers "just in case"

Right now the following Godot code generates barriers:

RD::ComputeListID compute_list = RD::get_singleton()->compute_list_begin();

RD::get_singleton()->compute_list_dispatch_threads(compute_list, size.x, size.y, 1);
RD::get_singleton()->compute_list_add_barrier(compute_list);
RD::get_singleton()->draw_command_end_label();

RD::get_singleton()->compute_list_end(RD::BARRIER_MASK_COMPUTE);

Ideally this code alone must not generate barriers. Why? Because barriers are needed when we need to enforce ordering (or change resource layout, but ordering is implicit then. If you need a resource to change layout, that means B needs to consume the output of A and must be executed after it's done).

Therefore this snippet alone is meaningless without context. What gives it meaning is knowing what will execute next (and before, if any).

Operation order is too strict

The following is from Godot's SSIL (Screen Space Indirect Lighting) code:

void SSEffects::screen_space_indirect_lighting(Ref<RenderSceneBuffersRD> p_render_buffers,
		SSILRenderBuffers &p_ssil_buffers, uint32_t p_view, RID p_normal_buffer,
		const Projection &p_projection, const Projection &p_last_projection,
		const SSILSettings &p_settings) {

    SSILProjectionUniforms projection_uniforms;
    store_camera(p_last_projection, projection_uniforms.inv_last_frame_projection_matrix);
    RD::get_singleton()->buffer_update(
			ssil.projection_uniform_buffer, 0, sizeof(SSILProjectionUniforms), &projection_uniforms);

    //...

    RD::ComputeListID compute_list = RD::get_singleton()->compute_list_begin();
    // ...
    RD::get_singleton()->compute_list_dispatch_threads(compute_list,
                p_ssil_buffers.half_buffer_width, p_ssil_buffers.half_buffer_height, 1);
    // ...
	RD::get_singleton()->compute_list_end(RD::BARRIER_MASK_NO_BARRIER);

There are a lot of barriers generated that are part of the nature of this effect, but I want to focus on an unnecessary barrier generated by this call:

RD::get_singleton()->buffer_update(ssil.projection_uniform_buffer, 0, sizeof(SSILProjectionUniforms), &projection_uniforms);

This operation simply updates a GPU buffer with the projection matrix send through screen_space_indirect_lighting.

The problem is that because projection_uniform_buffer needs to be updated via a buffer copy (its data doesn't seem to be CPU visible*), it needs a staging buffer and vkCmdCopyBuffer. And therefore... a barrier.

* It may be argued it'd be better if it's CPU visible, but that's for another discussion.

Therefore the SSIL effect creates TWO extra barriers: One before the vkCmdCopyBuffer call, and another after vkCmdCopyBuffer:

This is completely unnecessary. projection_uniforms is known since the beginning of the frame.

What Godot should've done is gather all operations that need updating buffers when the frame rendering is about to start, and update all buffers together.

This way we can have one vkCmdCopyBuffer and two barriers for all operations, not per effect.

Certain FXs will unavoidable need to do copies because the data for some reason was available at the last minute. But effects like SSIL is not one of them. We've identified a problem that needs solving. For that data that is available at the last minute, using CPU-visible memory for UBOs <= 64kb is feasible. That way we remove the necessary barrier without cost.

There may be some exceptions too. For example in VR it is common to query the view matrix at the last possible second (sometimes even after the commands have been generated, but before the work is submitted to the GPU) to minimize latency.

Other similar problem is that compute has a tendency to revert all transitions it did assuming rendering will come next, but then another compute effect is chained to gether which transitions yet again to something Compute needs.

Passes are hardcoded and manually maintained in C++

This doesn't need much explanation. It's the main reason everyone is saying that Godot needs a refactor.

RenderForwardClustered::_render_scene is a god function. I mean what the heck is this?:

bool can_continue_color = !scene_state.used_screen_texture && !using_ssr && !using_sss;
bool can_continue_depth = !(scene_state.used_depth_texture || scene_state.used_normal_texture) && !using_ssil && !using_ssr && !using_sss;

bool will_continue_color = (can_continue_color || draw_sky || draw_sky_fog_only || debug_voxelgis || debug_sdfgi_probes);
bool will_continue_depth = (can_continue_depth || draw_sky || draw_sky_fog_only || debug_voxelgis || debug_sdfgi_probes);

// ...

if (will_continue_color && using_separate_specular) {
    // close the specular framebuffer, as it's no longer used
    RD::get_singleton()->draw_list_begin(rb_data->get_specular_only_fb(), RD::INITIAL_ACTION_CONTINUE, RD::FINAL_ACTION_READ, RD::INITIAL_ACTION_CONTINUE, RD::FINAL_ACTION_CONTINUE);
    RD::get_singleton()->draw_list_end();
}

It has gotten out of hand and become very difficult to maintain. It's very difficult to mentally track what's going on depending on what setting is toggled.

There is an inherent difficulty given the number of effects and permutations that Godot must support, however this paradigm is not scaling with the difficulty.

Solution

The compositor's structure boils down to the following base (note: the definition of CompositorPass is incomplete on purpose, and we'll expand it on the next sections):

class BarrierSolver;

class CompositorPass {
protected:
    /// Performs all transitions that execute() needs to start
	virtual void analyze_barriers(BarrierSolver *barrier_solver);

public:
	/// Tells compositor they need to call us at frame_init()
	virtual bool needs_frame_init() const = 0;
	/// Effects like SSIL will upload SSILProjectionUniforms here
	/// THIS CALL SHOULD NOT ISSUE BARRIERS. THAT'S ITS ENTIRE
	/// POINT: GATHER TOGETHER AS MOST AS POSSIBLE INTO THE SAME BARRIER.
	virtual void frame_init() = 0;
	/// Performs the actual rendering/compute operation
	virtual void execute(BarrierSolver *barrier_solver) = 0;
};

class Compositor {
	BarrierSolver *m_barrier_solver;

	std::vector<CompositorPass *> m_frame_init;
	std::vector<CompositorPass *> m_passes;

public:
	void build() {
		// Gather all passes that need frame_init
		m_frame_init.clear();
		for (CompositorPass *pass : m_passes) {
			if (pass->needs_frame_init())
				m_frame_init.push_back(pass);
		}
	}

	void iterate() {
		for (CompositorPass *pass : m_frame_init)
			pass->frame_init();

		for (CompositorPass *pass : m_passes) {
			pass->execute(m_barrier_solver);
		}
	}
};

In Godot current lingo:

RenderForwardClustered::_process_ssil -> CompositorPass::execute
RenderForwardClustered::_render_scene -> Compositor::execute

Analyzing barriers

Barrier solving has two parts:

Solving & issuing barriers that each job needs to start
- It ensures to the compositor pass that the initial conditions are guaranteed
- Rendering only needs this
- This can be abstracted and be generic
- A future solution may consider reordering these passes based on this data
Solving barriers inside the job
- Compute-heavy operations need with multiple dependent passes need this.
- e.g. Parallel reduction algorithms (depth downsampling, mipmap generation, etc).
- This will be handled by the class that derives from CompositorPass inside their execute() implementation.
- Can not be reordered by a higher system because everything inside CompositorPass::execute is an atomic unit of work.
  - If reordering is too important, it could be split into multiple passes.

The following code addresses the first part:

enum class ResourceState {
	Sampling,
	Storage,
	Rendering,
	RenderingReadOnly,
	CopySrc,
	CopyDst
};

enum class ResourceAccess {
	Undefined = 0x00,
	Read = 0x01,
	Write = 0x02,
	ReadWrite = Read | Write
};

struct CompositorInput {
	RID *resource; // Can be a buffer or a texture
	ResourceState state; // Only for textures
	ResourceAccess access;
};

class CompositorPass {
protected:
	std::vector<CompositorInput> m_inputs;

	/// Performs all transitions that execute() needs to start
	virtual void analyze_barriers(BarrierSolver *barrier_solver) {
		for (const CompositorInput &input : m_inputs)
			barrier_solver->solve(input);

		barrier_solver->execute_transitions();
	}

public:
    // The rest stays the same
};

class RenderingPass : public CompositorPass {
	VkRenderPass m_render_pass; // May be null

public:
	RenderingPass(RID colourRt[], size_t numColourRt, RID depthBuffer, bool isDepthPrepass) {
		for (size_t i = 0u; i < numColourRt; ++i) {
			CompositorInput compoInput;
			compoInput.resource = colourRt[i];
			compoInput.state = ResourceState::Rendering;
			/// Rendering is a read/write operation
			compoInput.access = ResourceAccess::ReadWrite;
			m_inputs.push_back(compoInput);
		}

		if (!depthBuffer.is_null()) {
			CompositorInput compoInput;
			compoInput.resource = depthBuffer;
			if (isDepthPrepass) {
				compoInput.state = ResourceState::Rendering;
				compoInput.access = ResourceAccess::ReadWrite;
			} else {
				compoInput.state = ResourceState::RenderingReadOnly;
				compoInput.access = ResourceAccess::Read;
			}
			m_inputs.push_back(compoInput);
		}

        CompositorInput compoInput;
	    compoInput.resource = whatever_texture;
        compoInput.state = ResourceState::Sampling;
        compoInput.access = ResourceAccess::Read;
        m_inputs.push_back(compoInput);
	}

	void execute(BarrierSolver *barrier_solver) override {
		analyze_barriers(barrier_solver);
		do_rendering();
	}
};

In this example, RenderingPass declares everything that it will need on creation or setup and then when calling analyze_barriers() this is evaluated every frame, generically.

Merging rendering passes

The previous example assumes one thing: that each RenderingPass will have its own VkRenderPass.

However this is too inflexible. What if we want to issue five RenderingPass in a row? (we'll ignore any compute passes)

One RenderingPass for shadow mapping
One RenderingPass for early prepass
One RenderingPass for opaque objects
One RenderingPass for sky
One RenderingPass for transparent objects

The first two (shadow mapping & early prepass) need their own VkRenderPass.

However we're fairly certain that opaque + sky + transparent objects are rendering to the same VkRenderPass and should not open and close the pass 3 times (something which Godot is currently doing!!!).

This is relatively easy to do:

A VkRenderPass must be currently open.
The RenderingPass must not want to clear (clearing means we must discard the current contents of what we're writing in the currentVkRenderPass).
The entries in m_inputs with ResourceState::Rendering or RenderingReadOnly must be the exact same as the one used to create the VkRenderPass currently open.
barrier_solver->execute_transitions() must conclude there are no transitions to execute.

class RenderingPass : public CompositorPass {
    void analyze_barriers(BarrierSolver *barrier_solver) override {
		for (const CompositorInput &input : m_inputs)
			barrier_solver->solve(input);

		bool must_set_render_pass = false;
		const bool did_transitions = barrier_solver->execute_transitions();
		if (!did_transitions) {
			if (!current_pass_set || !check_if_equal(current_pass_set, m_inputs)) {
				must_set_render_pass = true;
			}
		} else {
			must_set_render_pass = true;
		}

		if (must_set_render_pass) {
			if (!m_render_pass)
				m_render_pass = create_pass();
			if (current_pass_set)
				render_pass_close(current_pass_set);
			render_pass_begin(m_render_pass);
		}
	}
}

There are other details I'm leaving out.

For example it would be convenient if render passes can be marked as "merge only" which means Godot will error out when a pass cannot be merged due to incorrect setup.

Barrier solver

See godotengine/godot-proposals#7366

Custom code

A problem right now is that Godot users want to run custom compute shaders but don't know how to deal with barriers.

While we cannot fix this problem entirely for complex compute works (that inherently require a basic understanding of parallel work synchronization), a vast majority of what users can do boil down to running basic shaders (Compute or Raster), often in a single dispatch or few, that receives an input and produces an output.

As a result, this system can make it more easy for users because they just need to provide what inputs they depend on (and depending on what the interface for users looks like, this may even be intuitive, i.e. first ask for the input resources, and then ask for what Compute Shader to run and only those input resources can be used).

Such code would be run by a class that derives from CompositorPass.

Final code

The code mockup more or less would look like this:

class CompositorInput;
class BarrierSolver
{
	void solve(const CompositorInput&);
	bool execute_transitions();
};

enum class ResourceState {
	Sampling,
	Storage,
	Rendering,
	RenderingReadOnly,
	CopySrc,
	CopyDst
};

enum class ResourceAccess {
	Undefined = 0x00,
	Read = 0x01,
	Write = 0x02,
	ReadWrite = Read | Write
};

struct CompositorInput {
	RID resource; // Can be a buffer or a texture
	ResourceState state; // Only for textures
	// For many types of buffers & textures, the access can be derived:
	//	Vertex/index/ubo buffers is always Read
	//	Sampling textures is always Read
	//	Rendering is always ReadWrite
	//	RenderingReadOnly is always Read
	//	CopySrc is always Read
	//	CopyDst is always Write
	//
	// The following cannot be derived:
	//	Storage can be Read, Write, or ReadWrite
	ResourceAccess access;
};

class CompositorPass {
protected:
	std::vector<CompositorInput> m_inputs;

	/// Performs all transitions that execute() needs to start
	virtual void analyze_barriers(BarrierSolver *barrier_solver) {
		for (const CompositorInput &input : m_inputs)
			barrier_solver->solve(input);

		barrier_solver->execute_transitions();
	}

public:
	/// Tells compositor they need to call us at frame_init()
	virtual bool needs_frame_init() = 0;
	/// Effects like SSIL will upload SSILProjectionUniforms here
	/// THIS CALL SHOULD NOT ISSUE BARRIERS. THAT'S ITS ENTIRE
	/// POINT: GATHER TOGETHER AS MOST AS POSSIBLE INTO THE SAME BARRIER.
	virtual void frame_init() = 0;
	/// Performs the actual rendering/compute operation
	virtual void execute(BarrierSolver *barrier_solver) = 0;
};

class Compositor {
	BarrierSolver *m_barrier_solver;

	std::vector<CompositorPass *> m_frame_init;
	std::vector<CompositorPass *> m_passes;

public:
	void build() {
		// Gather all passes that need frame_init
		m_frame_init.clear();
		for (CompositorPass *pass : m_passes) {
			if (pass->needs_frame_init())
				m_frame_init.push_back(pass);
		}
	}

	void iterate() {
		for (CompositorPass *pass : m_frame_init)
			pass->frame_init();

		for (CompositorPass *pass : m_passes) {
			pass->execute(m_barrier_solver);
		}
	}
};
{
	RD::get_singleton()->buffer_update(
			ssil.projection_uniform_buffer, 0, sizeof(SSILProjectionUniforms), &projection_uniforms);
}
class RenderingPass : public CompositorPass {
	VkRenderPass m_render_pass;

	void analyze_barriers(BarrierSolver *barrier_solver) override {
		for (const CompositorInput &input : m_inputs)
			barrier_solver->solve(input);

		bool must_set_render_pass = false;
		const bool did_transitions = barrier_solver->execute_transitions();
		if (!did_transitions) {
			if (!current_pass_set || !check_if_equal(current_pass_set, m_inputs)) {
				must_set_render_pass = true;
			}
		} else {
			must_set_render_pass = true;
		}

		if (must_set_render_pass) {
			if (!m_render_pass)
				m_render_pass = create_pass();
			if (current_pass_set)
				render_pass_close(current_pass_set);
			render_pass_begin(m_render_pass);
		}
	}
public:
	RenderingPass(RID colourRt[], size_t numColourRt, RID depthBuffer, bool isDepthPrepass) {
		for (size_t i = 0u; i < numColourRt; ++i) {
			CompositorInput compoInput;
			compoInput.resource = colourRt[i];
			compoInput.state = ResourceState::Rendering;
			/// Rendering is a read/write operation
			compoInput.access = ResourceAccess::ReadWrite;
			m_inputs.push_back(compoInput);
		}

		if (!depthBuffer.is_null()) {
			CompositorInput compoInput;
			compoInput.resource = depthBuffer;
			if (isDepthPrepass) {
				compoInput.state = ResourceState::Rendering;
				compoInput.access = ResourceAccess::ReadWrite;
			} else {
				compoInput.state = ResourceState::RenderingReadOnly;
				compoInput.access = ResourceAccess::Read;
			}
			m_inputs.push_back(compoInput);
		}

		CompositorInput compoInput;
		compoInput.resource = whatever_texture;
		compoInput.state = ResourceState::Sampling;
		compoInput.access = ResourceAccess::Read;
		m_inputs.push_back(compoInput);
	}

	void execute(BarrierSolver *barrier_solver) override {
		analyze_barriers(barrier_solver);
		do_rendering();
	}
};

darksylinc/GodotCompositors.md