Skip to content

Instantly share code, notes, and snippets.

@nihalpasham
Created May 9, 2026 09:26
Show Gist options
  • Select an option

  • Save nihalpasham/59fedb91ffa3fea5ae8e53308c9e6c71 to your computer and use it in GitHub Desktop.

Select an option

Save nihalpasham/59fedb91ffa3fea5ae8e53308c9e6c71 to your computer and use it in GitHub Desktop.
ash_vulkan_bindings

Ash - Rust bindings to Vulkan

image 2

What is Vulkan?

image 2 - [What is Vulkan? :: Vulkan Documentation Project](https://docs.vulkan.org/guide/latest/what_is_vulkan.html)

High-level Overview of Vulkan’s Architecture:

image - [Vulkan Architecture](https://chromium.googlesource.com/external/github.com/KhronosGroup/Vulkan-Loader/+/HEAD/loader/LoaderAndLayerInterface.md)

My first MoltenVK issue

Modern GPU Architecture:

Most introductions to modern GPU architecture jump straight into speeds and feeds or the concept of Latency vs. Throughput:

  • CPUs have fewer, larger, and faster cores, prioritizing latency (i.e., they are optimized for general-purpose computing).
  • GPUs have many smaller, slower cores, prioritizing throughput (i.e., they are designed for data-parallel computations).
  • Unlike CPUs, GPUs operate with fewer and smaller memory cache layers. This is because GPUs dedicate more transistors to computation, making them less concerned with how long it takes to retrieve data from memory.
sketch image 8

A block diagram of Nvidia’s Hopper H100:

Zooming into a single streaming multi-processor: image 6


But how do you program (or talk to) a GPU?

Well, use the CPU to program the GPU (it’s a coprocessor):

GPUs operate using a command-response model. In this model, you send commands to the GPU, and once the commands are processed, the GPU notifies your application when it is ready to accept more work.

  • When you begin exploring the Vulkan documentation, you’ll encounter terms like:
  • Commands recorded into a command buffer
  • Commands added to a command queue
  • etc.

High-Level Overview of GPU Driver Architecture:

image 4 image 9

User-Mode Driver (ICD) Responsibilities:

  1. Translates Vulkan API calls into GPU-specific commands.
  2. Manages Vulkan objects like pipelines, shaders, and command buffers.
  3. Handles API-level validation and error checking.
  4. Interacts with the kernel-mode driver for:
    • Memory allocation requests.
    • Queue submissions.
    • Synchronization mechanisms.

Kernel-Mode Driver Responsibilities:

  1. Manages low-level hardware interaction with the GPU.
  2. Handles memory management, including VRAM and shared memory.
  3. Processes DMA (Direct Memory Access) transfers.
  4. Submits command buffers to the GPU (e.g., via ring buffers or execution queues).
  5. Ensures GPU security and stability across processes.
  6. Handles interrupts and other hardware events.

Key Difference:

  • User-Mode Driver: Focused on translating high-level API calls into hardware-specific instructions.
  • Kernel-Mode Driver: Handles low-level hardware management and execution of commands on the GPU.

Linux: GPU devices typically implement open and close for opening/closing the device, mmap for sharing data between the application (which uses the user mode driver as its deputy) and the kernel mode driver, and ioctl for various controls.

GPU command queues (aka GPU ring buffers):

The figure below shows how

  • The host communicates with the command processor (CP) of the GPU via a virtual memory region (i.e. a ring buffer) which is memory mapped to the GPU, accessible by the command processor.
  • This enables communication between the CPU and GPU through entries in the command queue.
  • The CPU transmits kernel launch packets to the GPU by writing them to the user mode command queue.
  • The CP is responsible for decoding and dispatching the kernels in these command queues for execution.
  • The CP accesses the command queue and schedules the kernels at the head for execution. This ensures that the kernels are dispatched for launch from these queues in order.
cpu-gpu-datapath
Additional notes about the GPU ring buffer:
  • A ring buffer has a read pointer, used by the GPU to fetch commands for execution, and a write pointer, used by the kernel mode driver to queue workloads.
  • Application-generated command buffers are initially stored in system memory allocated to the user mode driver. The kernel mode driver doesn’t copy these large buffers into the ring buffer but instead inserts indirect calls referencing them. This allows multiple applications to prepare workloads concurrently without excessive locking.
  • In summary, a ring buffer contains GPU setup/teardown commands, context-switching commands, and indirect calls to application-specific command buffers, along with other commands (e.g., for performance counters).

GPU as a shared resource:

  • During device initialization in a modern GPU driver stack, the kernel mode driver is responsible for setting up the IOMMU (Input-Output Memory Management Unit) and configuring the GPU page tables.
  • Device Initialization:
    • The kernel mode driver allocates memory regions for the GPU.
    • Configures the IOMMU to map GPU virtual addresses to the correct physical addresses.
    • Sets up and populates the GPU’s page tables with mappings for resources like textures, buffers, and shaders.
  • In the GPU’s case - this is used to prevent Direct Memory Access (DMA) attacks by limiting what memory the GPU can access.

Note:

GPU page tables are physically located in system memory and managed in the kernel address space, ensuring both security and proper low-level memory management for the GPU. They map virtual GPU addresses (used by applications) to physical memory addresses, either in system memory (RAM) or video memory (VRAM) on the GPU. The kernel-mode driver is responsible for setting up, maintaining, and updating these page tables. When an application submits a workload, the user-mode driver requests memory allocations, and the kernel-mode driver ensures the GPU page tables are updated accordingly.

image 5 ### Command processor:
  • The command processor, also known as the graphics controller, retrieves commands from the command queue for execution. These commands include:
    • Kernel launches
    • Memory copy operations
    • Other related tasks
  • When people refer to GPU firmware, they are typically referring to the firmware that manages the command processor.
Screenshot 2025-01-24 at 2 37 49 PM

Multi-threaded, multi-queue command buffer recording:

image 3

What’s in a command buffer?

  • a list of command packets, each of which has the following format (i.e. compute command format.)
    • a sequence of methods calls
    • The one below is for Nvidia GPUs based on the fermi architecture. This is a command packet that performs a vector addition of 128 elements (1 dimensional grid)
    • method calls are a sequence of 32 bits words.
    • sub-channels represent one of several slots (usually 8) in a hardware queue. Each slot gets “bound” to a specific task type, like compute or graphics, so commands can be routed efficiently to the right GPU engine without interference.
Sequence Subchannel Method (Hex) Parameter Description
1 0 0x0000 (BIND) Compute object ID Bind the Compute Object to a Subchannel
2 0 0x0bd8 (LOCAL_SIZE) Local memory size in bytes (e.g., 0) Set local memory size
3 0 0x0bdc (SHARED_SIZE) Shared memory size in bytes (0) Set shared memory size
4 0 0x0be8 (CBANK_SIZE) Constant bank size Set constant bank size
5 0 0x0be0 (CODE_ADDRESS_HIGH) High 32 bits of SASS address Set code address high
6 0 0x0be4 (CODE_ADDRESS_LOW) Low 32 bits of SASS address Set code address low
7 0 0x0bf0 (CONSTANT_BUFFER_LOAD) Header for parameters (count=4) Start loading parameters (using increasing mode)
8 0 Incremental 64-bit A_ptr (high/low as two words) Parameter: A_ptr
9 0 Incremental 64-bit B_ptr (high/low as two words) Parameter: B_ptr
10 0 Incremental 64-bit C_ptr (high/low as two words) Parameter: C_ptr
11 0 Incremental 32-bit n Parameter: n
12 0 0x0bc4 (GRID_DIM_X) 1 Set grid dimension X
13 0 0x0bc8 (GRID_DIM_Y) 1 Set grid dimension Y
14 0 0x0bcc (GRID_DIM_Z) 1 Set grid dimension Z
15 0 0x0bb0 (BLOCK_DIM_X) 128 Set block dimension X
16 0 0x0bb4 (BLOCK_DIM_Y) 1 Set block dimension Y
17 0 0x0bb8 (BLOCK_DIM_Z) 1 Set block dimension Z
18 0 0x0bd0 (REGISTER_COUNT) Registers per thread (e.g., 8) Set register count
19 0 0x0bd4 (BARRIER_ALLOC) 0 Set number of barriers
20 0 0x0bec (CACHE_CONFIG) Preferred cache mode Set cache configuration
21 0 0x0bf4 (STACK_SIZE) Per-thread stack size (if needed) Set stack size
22 0 0x0c00 (DISPATCH) 0 (non-blocking) Launch the kernel

Asahi/AGX compute work submission

  • A single WorkCommandCP, i.e., one compute command packet describing a compute dispatch and its supporting state/microsequence for the Apple GPU via m1n1’s proxy structures.

What it represents

  • WorkCommandCP is the compute-command variant of the AGX “work command” structures that bundle a compute dispatch’s context, registers, microsequence, timestamps, and encoder/job metadata for submission through the firmware-managed queues, analogous to a single compute command buffer submission in modern UAPIs.
  • Asahi’s UAPI work discusses compute commands as streams of compute dispatches with command-level timestamps; WorkCommandCP in m1n1 is the reverse-engineered/development-format used to introspect those fields before the upstream Rust DRM driver/UAPI consolidation, so the snippet corresponds to one such compute submission record

Field mapping table

Fields labels Bytes Description
1 0x00000003 magic = 0x3 WorkCommandCP “magic” identifying a compute work command packet
2 counter 64-bit counter (V ≥ 13_0B4) Monotonic sequence/counter used by newer firmware versions for tracking submissions
3 unk_4 32-bit Unknown field following magic/counter used internally by firmware (kept for structure layout)
4 context_id 32-bit context ID Identifies the GPU context/VM for the compute job
5 event_control_addr 64-bit GPU VA Address of event-control block providing completion/fault signaling for the job
6 event_control pointer -> EventControl Resolved structure containing event bits/counters used by firmware signaling
7 unk_2c 32-bit Reserved/unknown slot kept from tracing; firmware expects it in layout
8 registers[] 128 RegisterDefinition (G ≥ G14X) Inline register programming block for compute stage on newer GPU gens
9 unk_g14x[] 64 x u32 default 0 (G ≥ G14X) Extra reserved space for G14X+ parts observed during RE
10 unk_buf 0x50 bytes (G < G14X) Legacy padding/unknown blob for pre-G14X layouts
11 compute_info ComputeInfo (G < G14X) Legacy compute configuration block for older GPUs (grid/block, resources)
12 registers_addr 64-bit GPU VA Pointer to an external register programming list when not inlined
13 register_count 16-bit Number of register entries to apply from registers_addr
14 registers_length 16-bit Byte length of the register programming payload
15 unk_pad 0x24 bytes Reserved pad matching firmware alignment requirements
16 microsequence_ptr 64-bit GPU VA Pointer to the microsequence (firmware-level control stream) for compute
17 microsequence_size 32-bit Size of the microsequence blob referenced above
18 microsequence pointer -> MicroSequence Decoded microsequence entries (dispatch/phase commands)
19 compute_info2 ComputeInfo2 Extended compute configuration (e.g., additional resources/limits)
20 encoder_params EncoderParams Parameters for firmware encoder interpreting microsequence/register lists
21 job_meta JobMeta Submission/job metadata (priority, queues, dependencies)
22 ts1 TimeStamp Primary timestamp record at command granularity
23 ts_pointers TimeStampPointers Pointers for firmware to write timestamps (GPU/CPU domains)
24 user_ts_pointers TimeStampPointers Additional timestamp pointers intended for userspace reporting
25 client_sequence 8-bit Small client-side sequence/id used for tracking submissions
26 unk_ts2 TimeStamp (V ≥ 13_0B4) Extra timestamp block used in newer firmware
27 unk_ts TimeStamp (V ≥ 13_0B4) Additional timestamp record (purpose under investigation)
28 unk_2e1 0x1c bytes default 0 (V ≥ 13_0B4) Reserved/unknown; firmware expects presence
29 unk_flag Flag (V ≥ 13_0B4) Boolean/bitflag toggling optional firmware behavior
30 unk_pad 0x10 bytes default 0 (V ≥ 13_0B4) Additional reserved padding for newer versions
31 pad_2d9 0x7 bytes default 0 Tail padding to satisfy alignment/size constraints

Notes relative to the NVIDIA-style table

  • Apple/AGX compute submission is not programmed via subchannels/method registers in the public UAPI; instead, the firmware consumes structured command blocks (with register programming lists and microsequences) placed in GPU-accessible memory, so Subchannel and Method columns are not applicable and are marked N/A above.
  • The “microsequence” plus register lists together stand in for the “methods” used on NVIDIA Fermi, while ComputeInfo/ComputeInfo2 capture dispatch geometry/resources akin to grid/block and LDS/stack config in the NVIDIA example

Bottom line

  • The pasted WorkCommandCP instance corresponds to a single compute command submission containing the state, control stream, and bookkeeping needed by the firmware to dispatch one compute workload, matching Asahi’s description that a compute command encapsulates one or more dispatches with command-level timestamps; here, it represents one command packet as requested

Lets talk shaders:

  • Lets do a recap ([[/Well, use the CPU to program the GPU (it’s a coprocessor):|Host]] Vs Device programming)
  • Generic shader programming workflow
    • In Vulkan, when you call VkCreateGraphicsPipeline or VkCreateComputePipeline, that’s when the Vulkan driver translates SPIR-V into the GPU’s native instruction set (ISA).
    • Vulkan forces applications to compile pipelines ahead of time. This makes performance predictable since there’s no unexpected shader compilation during rendering.
    • But we can also avoid this runtime compilation
      • Pipeline Cache (VkPipelineCache): Stores compiled shaders so they don’t have to be recompiled next time.
      • Vendor Tools (NVIDIA, AMD, etc.): Some vendors let you precompile SPIR-V to a GPU-specific binary offline, so the driver doesn’t need to compile at runtime.
  • SPIRV as the interchange format for the future.
  • Shader dev tools and transpilers.
  • What do shading languages look like and what does SPIRV look like?
  • Lets look at some shader code (to get a feel for shader programming)
  • Notes on shader programming without host code:
    • Interactive Shader Coding
      • Shader playgrounds (Shadertoy, GLSL Sandbox)
      • Game engines (Unity’s Shader Graph, Unreal’s HLSL-based material system)
    • Compute-focused Shader-Like Programming (via MLIR)
      • Mojo – Pythonic GPU programming, [[/High-Level Overview of GPU Driver Architecture:|no explicit host API management]]
      • Triton – [[Pliron/Intuition:|Custom GPU kernels for ML workloads]], JIT-compiled via LLVM
      • PyTorch / TensorFlow – High-level tensor computation mapped to GPU backends
sketch 2 # CUDA/PTX Vs SPIR-V

CUDA PTX is a great contrast point, because it sits lower than CUDA C++ but higher than raw SASS (NVIDIA machine code). Unlike SPIR-V, PTX really does expose a more “CPU-like” model, even though the hardware underneath is SIMT. Let’s map your list again:

🔹 Unrestricted pointers

  • Yes.
  • In PTX, pointers are just 64-bit integers (.u64) under the hood.
  • You can do pointer arithmetic, cast them to/from integers, and manipulate them freely.
  • Of course, the memory space still matters (global, shared, local, const), but once you’ve got an address, it’s just an integer.

🔹 Generic pointers

  • Sort of.
  • PTX pointers do carry address space tags (like global, shared, local), but since they’re integers you can bit-twiddle and reinterpret.
  • There isn’t a built-in “generic pointer type” like OpenCL C’s generic, but you can roll your own by packing an address + space tag in a 64-bit value and branching at load/store time.
  • NVIDIA’s compiler often optimizes around this.

🔹 True function calls (with recursion)

  • Yes.
  • PTX supports real call and ret instructions.
  • A per-thread call stack exists in GPU local memory, so recursion is possible.
  • That said: it’s slow and discouraged for performance reasons, but nothing in PTX forbids it.
  • CUDA C++ recursion is officially supported since compute capability 2.0 (Fermi era).

🔹 Function pointers

  • Yes.
  • PTX supports indirect calls (call.uni with a register holding the function address).
  • CUDA C++ lets you use function pointers on device code, though with restrictions on inlining and performance.
  • This means you can actually write functional-style GPU code in CUDA, unlike SPIR-V.

🔹 Arbitrary gotos

  • Yes.
  • PTX has bra (branch) and @pred bra (conditional branch) instructions that can target arbitrary labels.
  • Control flow does not need to be structured — you can write totally spaghetti PTX if you want.
  • The compiler might restructure things for performance, but the IR itself allows unrestricted jumps.

Summary table:

Feature SPIR-V CUDA/PTX
Unrestricted pointers
❌ strongly typed, no arithmetic ✔ true integers, arithmetic allowed
Generic pointers ❌ storage class encoded in type (✔) possible with tricks, not native
True function calls ✔ (but recursion forbidden) ✔ recursion supported
Function pointers ✔ supported
Arbitrary gotos ❌ structured only ✔ unrestricted branches

👉 So, PTX is much closer to “GPU assembly with a CPU-like flavor,” whereas SPIR-V is deliberately restricted and structured to make GPU driver backends simpler and more predictable.

If you want “full C++ semantics on GPU”, CUDA PTX is the closest you’ll get today. SPIR-V is more of a constrained IR to glue multiple shading languages into Vulkan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment