Ash - Rust bindings to Vulkan

Modern GPU Architecture:

Most introductions to modern GPU architecture jump straight into speeds and feeds or the concept of Latency vs. Throughput:

CPUs have fewer, larger, and faster cores, prioritizing latency (i.e., they are optimized for general-purpose computing).

GPUs have many smaller, slower cores, prioritizing throughput (i.e., they are designed for data-parallel computations).
Unlike CPUs, GPUs operate with fewer and smaller memory cache layers. This is because GPUs dedicate more transistors to computation, making them less concerned with how long it takes to retrieve data from memory.

A block diagram of Nvidia’s Hopper H100:

Zooming into a single streaming multi-processor:

But how do you program (or talk to) a GPU?

Well, use the CPU to program the GPU (it’s a coprocessor):

GPUs operate using a command-response model. In this model, you send commands to the GPU, and once the commands are processed, the GPU notifies your application when it is ready to accept more work.

When you begin exploring the Vulkan documentation, you’ll encounter terms like:
Commands recorded into a command buffer
Commands added to a command queue
etc.

High-Level Overview of GPU Driver Architecture:

User-Mode Driver (ICD) Responsibilities:

Translates Vulkan API calls into GPU-specific commands.
Manages Vulkan objects like pipelines, shaders, and command buffers.
Handles API-level validation and error checking.
Interacts with the kernel-mode driver for:
- Memory allocation requests.
- Queue submissions.
- Synchronization mechanisms.

Kernel-Mode Driver Responsibilities:

Manages low-level hardware interaction with the GPU.
Handles memory management, including VRAM and shared memory.
Processes DMA (Direct Memory Access) transfers.
Submits command buffers to the GPU (e.g., via ring buffers or execution queues).
Ensures GPU security and stability across processes.
Handles interrupts and other hardware events.

Key Difference:

User-Mode Driver: Focused on translating high-level API calls into hardware-specific instructions.
Kernel-Mode Driver: Handles low-level hardware management and execution of commands on the GPU.

Linux: GPU devices typically implement open and close for opening/closing the device, mmap for sharing data between the application (which uses the user mode driver as its deputy) and the kernel mode driver, and ioctl for various controls.

GPU command queues (aka GPU ring buffers):

The figure below shows how

The host communicates with the command processor (CP) of the GPU via a virtual memory region (i.e. a ring buffer) which is memory mapped to the GPU, accessible by the command processor.
This enables communication between the CPU and GPU through entries in the command queue.
The CPU transmits kernel launch packets to the GPU by writing them to the user mode command queue.
The CP is responsible for decoding and dispatching the kernels in these command queues for execution.
The CP accesses the command queue and schedules the kernels at the head for execution. This ensures that the kernels are dispatched for launch from these queues in order.

Additional notes about the GPU ring buffer:

A ring buffer has a read pointer, used by the GPU to fetch commands for execution, and a write pointer, used by the kernel mode driver to queue workloads.
Application-generated command buffers are initially stored in system memory allocated to the user mode driver. The kernel mode driver doesn’t copy these large buffers into the ring buffer but instead inserts indirect calls referencing them. This allows multiple applications to prepare workloads concurrently without excessive locking.
In summary, a ring buffer contains GPU setup/teardown commands, context-switching commands, and indirect calls to application-specific command buffers, along with other commands (e.g., for performance counters).

GPU as a shared resource:

During device initialization in a modern GPU driver stack, the kernel mode driver is responsible for setting up the IOMMU (Input-Output Memory Management Unit) and configuring the GPU page tables.
Device Initialization:
- The kernel mode driver allocates memory regions for the GPU.
- Configures the IOMMU to map GPU virtual addresses to the correct physical addresses.
- Sets up and populates the GPU’s page tables with mappings for resources like textures, buffers, and shaders.
In the GPU’s case - this is used to prevent Direct Memory Access (DMA) attacks by limiting what memory the GPU can access.

Note:

GPU page tables are physically located in system memory and managed in the kernel address space, ensuring both security and proper low-level memory management for the GPU. They map virtual GPU addresses (used by applications) to physical memory addresses, either in system memory (RAM) or video memory (VRAM) on the GPU. The kernel-mode driver is responsible for setting up, maintaining, and updating these page tables. When an application submits a workload, the user-mode driver requests memory allocations, and the kernel-mode driver ensures the GPU page tables are updated accordingly.

### Command processor:

The command processor, also known as the graphics controller, retrieves commands from the command queue for execution. These commands include:
- Kernel launches
- Memory copy operations
- Other related tasks

When people refer to GPU firmware, they are typically referring to the firmware that manages the command processor.

Multi-threaded, multi-queue command buffer recording:

What’s in a command buffer?

a list of command packets, each of which has the following format (i.e. compute command format.)
- a sequence of methods calls
- The one below is for Nvidia GPUs based on the fermi architecture. This is a command packet that performs a vector addition of 128 elements (1 dimensional grid)
- method calls are a sequence of 32 bits words.
- sub-channels represent one of several slots (usually 8) in a hardware queue. Each slot gets “bound” to a specific task type, like compute or graphics, so commands can be routed efficiently to the right GPU engine without interference.

Sequence	Method (Hex)	Parameter	Description
1	0x0000 (BIND)	Compute object ID	Bind the Compute Object to a Subchannel
2	0x0bd8 (LOCAL_SIZE)	Local memory size in bytes (e.g., 0)	Set local memory size
3	0x0bdc (SHARED_SIZE)	Shared memory size in bytes (0)	Set shared memory size
4	0x0be8 (CBANK_SIZE)	Constant bank size	Set constant bank size
5	0x0be0 (CODE_ADDRESS_HIGH)	High 32 bits of SASS address	Set code address high
6	0x0be4 (CODE_ADDRESS_LOW)	Low 32 bits of SASS address	Set code address low
7	0x0bf0 (CONSTANT_BUFFER_LOAD)	Header for parameters (count=4)	Start loading parameters (using increasing mode)
8	Incremental	64-bit A_ptr (high/low as two words)	Parameter: A_ptr
9	Incremental	64-bit B_ptr (high/low as two words)	Parameter: B_ptr
10	Incremental	64-bit C_ptr (high/low as two words)	Parameter: C_ptr
11	Incremental	32-bit n	Parameter: n
12	0x0bc4 (GRID_DIM_X)	1	Set grid dimension X
13	0x0bc8 (GRID_DIM_Y)	1	Set grid dimension Y
14	0x0bcc (GRID_DIM_Z)	1	Set grid dimension Z
15	0x0bb0 (BLOCK_DIM_X)	128	Set block dimension X
16	0x0bb4 (BLOCK_DIM_Y)	1	Set block dimension Y
17	0x0bb8 (BLOCK_DIM_Z)	1	Set block dimension Z
18	0x0bd0 (REGISTER_COUNT)	Registers per thread (e.g., 8)	Set register count
19	0x0bd4 (BARRIER_ALLOC)	0	Set number of barriers
20	0x0bec (CACHE_CONFIG)	Preferred cache mode	Set cache configuration
21	0x0bf4 (STACK_SIZE)	Per-thread stack size (if needed)	Set stack size
22	0x0c00 (DISPATCH)	0 (non-blocking)	Launch the kernel

Asahi/AGX compute work submission

A single WorkCommandCP, i.e., one compute command packet describing a compute dispatch and its supporting state/microsequence for the Apple GPU via m1n1’s proxy structures.

What it represents

WorkCommandCP is the compute-command variant of the AGX “work command” structures that bundle a compute dispatch’s context, registers, microsequence, timestamps, and encoder/job metadata for submission through the firmware-managed queues, analogous to a single compute command buffer submission in modern UAPIs.
Asahi’s UAPI work discusses compute commands as streams of compute dispatches with command-level timestamps; WorkCommandCP in m1n1 is the reverse-engineered/development-format used to introspect those fields before the upstream Rust DRM driver/UAPI consolidation, so the snippet corresponds to one such compute submission record

Field mapping table

Fields	labels	Bytes	Description
1	0x00000003	magic = 0x3	WorkCommandCP “magic” identifying a compute work command packet
2	counter	64-bit counter (V ≥ 13_0B4)	Monotonic sequence/counter used by newer firmware versions for tracking submissions
3	unk_4	32-bit	Unknown field following magic/counter used internally by firmware (kept for structure layout)
4	context_id	32-bit context ID	Identifies the GPU context/VM for the compute job
5	event_control_addr	64-bit GPU VA	Address of event-control block providing completion/fault signaling for the job
6	event_control	pointer -> EventControl	Resolved structure containing event bits/counters used by firmware signaling
7	unk_2c	32-bit	Reserved/unknown slot kept from tracing; firmware expects it in layout
8	registers[]	128 RegisterDefinition (G ≥ G14X)	Inline register programming block for compute stage on newer GPU gens
9	unk_g14x[]	64 x u32 default 0 (G ≥ G14X)	Extra reserved space for G14X+ parts observed during RE
10	unk_buf	0x50 bytes (G < G14X)	Legacy padding/unknown blob for pre-G14X layouts
11	compute_info	ComputeInfo (G < G14X)	Legacy compute configuration block for older GPUs (grid/block, resources)
12	registers_addr	64-bit GPU VA	Pointer to an external register programming list when not inlined
13	register_count	16-bit	Number of register entries to apply from registers_addr
14	registers_length	16-bit	Byte length of the register programming payload
15	unk_pad	0x24 bytes	Reserved pad matching firmware alignment requirements
16	microsequence_ptr	64-bit GPU VA	Pointer to the microsequence (firmware-level control stream) for compute
17	microsequence_size	32-bit	Size of the microsequence blob referenced above
18	microsequence	pointer -> MicroSequence	Decoded microsequence entries (dispatch/phase commands)
19	compute_info2	ComputeInfo2	Extended compute configuration (e.g., additional resources/limits)
20	encoder_params	EncoderParams	Parameters for firmware encoder interpreting microsequence/register lists
21	job_meta	JobMeta	Submission/job metadata (priority, queues, dependencies)
22	ts1	TimeStamp	Primary timestamp record at command granularity
23	ts_pointers	TimeStampPointers	Pointers for firmware to write timestamps (GPU/CPU domains)
24	user_ts_pointers	TimeStampPointers	Additional timestamp pointers intended for userspace reporting
25	client_sequence	8-bit	Small client-side sequence/id used for tracking submissions
26	unk_ts2	TimeStamp (V ≥ 13_0B4)	Extra timestamp block used in newer firmware
27	unk_ts	TimeStamp (V ≥ 13_0B4)	Additional timestamp record (purpose under investigation)
28	unk_2e1	0x1c bytes default 0 (V ≥ 13_0B4)	Reserved/unknown; firmware expects presence
29	unk_flag	Flag (V ≥ 13_0B4)	Boolean/bitflag toggling optional firmware behavior
30	unk_pad	0x10 bytes default 0 (V ≥ 13_0B4)	Additional reserved padding for newer versions
31	pad_2d9	0x7 bytes default 0	Tail padding to satisfy alignment/size constraints

Notes relative to the NVIDIA-style table

Apple/AGX compute submission is not programmed via subchannels/method registers in the public UAPI; instead, the firmware consumes structured command blocks (with register programming lists and microsequences) placed in GPU-accessible memory, so Subchannel and Method columns are not applicable and are marked N/A above.
The “microsequence” plus register lists together stand in for the “methods” used on NVIDIA Fermi, while ComputeInfo/ComputeInfo2 capture dispatch geometry/resources akin to grid/block and LDS/stack config in the NVIDIA example

Bottom line

The pasted WorkCommandCP instance corresponds to a single compute command submission containing the state, control stream, and bookkeeping needed by the firmware to dispatch one compute workload, matching Asahi’s description that a compute command encapsulates one or more dispatches with command-level timestamps; here, it represents one command packet as requested

Lets talk shaders:

Lets do a recap ([[/Well, use the CPU to program the GPU (it’s a coprocessor):|Host]] Vs Device programming)
Generic shader programming workflow
- In Vulkan, when you call VkCreateGraphicsPipeline or VkCreateComputePipeline, that’s when the Vulkan driver translates SPIR-V into the GPU’s native instruction set (ISA).
- Vulkan forces applications to compile pipelines ahead of time. This makes performance predictable since there’s no unexpected shader compilation during rendering.
- But we can also avoid this runtime compilation
  - Pipeline Cache (VkPipelineCache): Stores compiled shaders so they don’t have to be recompiled next time.
  - Vendor Tools (NVIDIA, AMD, etc.): Some vendors let you precompile SPIR-V to a GPU-specific binary offline, so the driver doesn’t need to compile at runtime.
SPIRV as the interchange format for the future.
Shader dev tools and transpilers.
What do shading languages look like and what does SPIRV look like?
Lets look at some shader code (to get a feel for shader programming)
Notes on shader programming without host code:
- Interactive Shader Coding
  - Shader playgrounds (Shadertoy, GLSL Sandbox)
  - Game engines (Unity’s Shader Graph, Unreal’s HLSL-based material system)
- Compute-focused Shader-Like Programming (via MLIR)
  - Mojo – Pythonic GPU programming, [[/High-Level Overview of GPU Driver Architecture:|no explicit host API management]]
  - Triton – [[Pliron/Intuition:|Custom GPU kernels for ML workloads]], JIT-compiled via LLVM
  - PyTorch / TensorFlow – High-level tensor computation mapped to GPU backends

# CUDA/PTX Vs SPIR-V

CUDA PTX is a great contrast point, because it sits lower than CUDA C++ but higher than raw SASS (NVIDIA machine code). Unlike SPIR-V, PTX really does expose a more “CPU-like” model, even though the hardware underneath is SIMT. Let’s map your list again:

⸻

🔹 Unrestricted pointers

Yes.
In PTX, pointers are just 64-bit integers (.u64) under the hood.
You can do pointer arithmetic, cast them to/from integers, and manipulate them freely.
Of course, the memory space still matters (global, shared, local, const), but once you’ve got an address, it’s just an integer.

⸻

🔹 Generic pointers

Sort of.
PTX pointers do carry address space tags (like global, shared, local), but since they’re integers you can bit-twiddle and reinterpret.
There isn’t a built-in “generic pointer type” like OpenCL C’s generic, but you can roll your own by packing an address + space tag in a 64-bit value and branching at load/store time.
NVIDIA’s compiler often optimizes around this.

⸻

🔹 True function calls (with recursion)

Yes.
PTX supports real call and ret instructions.
A per-thread call stack exists in GPU local memory, so recursion is possible.
That said: it’s slow and discouraged for performance reasons, but nothing in PTX forbids it.
CUDA C++ recursion is officially supported since compute capability 2.0 (Fermi era).

⸻

🔹 Function pointers

Yes.
PTX supports indirect calls (call.uni with a register holding the function address).
CUDA C++ lets you use function pointers on device code, though with restrictions on inlining and performance.
This means you can actually write functional-style GPU code in CUDA, unlike SPIR-V.

⸻

🔹 Arbitrary gotos

Yes.
PTX has bra (branch) and @pred bra (conditional branch) instructions that can target arbitrary labels.
Control flow does not need to be structured — you can write totally spaghetti PTX if you want.
The compiler might restructure things for performance, but the IR itself allows unrestricted jumps.

⸻

✅ Summary table:

Feature	SPIR-V	CUDA/PTX
Unrestricted pointers	❌ strongly typed, no arithmetic	✔ true integers, arithmetic allowed
Generic pointers	❌ storage class encoded in type	(✔) possible with tricks, not native
True function calls	✔ (but recursion forbidden)	✔ recursion supported
Function pointers	❌	✔ supported
Arbitrary gotos	❌ structured only	✔ unrestricted branches

⸻

👉 So, PTX is much closer to “GPU assembly with a CPU-like flavor,” whereas SPIR-V is deliberately restricted and structured to make GPU driver backends simpler and more predictable.

If you want “full C++ semantics on GPU”, CUDA PTX is the closest you’ll get today. SPIR-V is more of a constrained IR to glue multiple shading languages into Vulkan.

nihalpasham/.md

Select an option

No results found