- [What is Vulkan? :: Vulkan Documentation Project](https://docs.vulkan.org/guide/latest/what_is_vulkan.html)
- [Vulkan Architecture](https://chromium.googlesource.com/external/github.com/KhronosGroup/Vulkan-Loader/+/HEAD/loader/LoaderAndLayerInterface.md)
- Instance creation issue: ERROR_INCOMPATIBLE_DRIVER
Most introductions to modern GPU architecture jump straight into speeds and feeds or the concept of Latency vs. Throughput:
- CPUs have fewer, larger, and faster cores, prioritizing latency (i.e., they are optimized for general-purpose computing).
- GPUs have many smaller, slower cores, prioritizing throughput (i.e., they are designed for data-parallel computations).
- Unlike CPUs, GPUs operate with fewer and smaller memory cache layers. This is because GPUs dedicate more transistors to computation, making them less concerned with how long it takes to retrieve data from memory.
A block diagram of Nvidia’s Hopper H100:
Zooming into a single streaming multi-processor:

GPUs operate using a command-response model. In this model, you send commands to the GPU, and once the commands are processed, the GPU notifies your application when it is ready to accept more work.
- When you begin exploring the Vulkan documentation, you’ll encounter terms like:
- Commands recorded into a command buffer
- Commands added to a command queue
- etc.
- Translates Vulkan API calls into GPU-specific commands.
- Manages Vulkan objects like pipelines, shaders, and command buffers.
- Handles API-level validation and error checking.
- Interacts with the kernel-mode driver for:
- Memory allocation requests.
- Queue submissions.
- Synchronization mechanisms.
- Manages low-level hardware interaction with the GPU.
- Handles memory management, including VRAM and shared memory.
- Processes DMA (Direct Memory Access) transfers.
- Submits command buffers to the GPU (e.g., via ring buffers or execution queues).
- Ensures GPU security and stability across processes.
- Handles interrupts and other hardware events.
Key Difference:
- User-Mode Driver: Focused on translating high-level API calls into hardware-specific instructions.
- Kernel-Mode Driver: Handles low-level hardware management and execution of commands on the GPU.
Linux: GPU devices typically implement
openandclosefor opening/closing the device,mmapfor sharing data between the application (which uses the user mode driver as its deputy) and the kernel mode driver, andioctlfor various controls.
The figure below shows how
- The host communicates with the command processor (CP) of the GPU via a virtual memory region (i.e. a ring buffer) which is memory mapped to the GPU, accessible by the command processor.
- This enables communication between the CPU and GPU through entries in the command queue.
- The CPU transmits kernel launch packets to the GPU by writing them to the user mode command queue.
- The CP is responsible for decoding and dispatching the kernels in these command queues for execution.
- The CP accesses the command queue and schedules the kernels at the head for execution. This ensures that the kernels are dispatched for launch from these queues in order.
- A ring buffer has a read pointer, used by the GPU to fetch commands for execution, and a write pointer, used by the kernel mode driver to queue workloads.
- Application-generated command buffers are initially stored in system memory allocated to the user mode driver. The kernel mode driver doesn’t copy these large buffers into the ring buffer but instead inserts
indirect callsreferencing them. This allows multiple applications to prepare workloads concurrently without excessive locking. - In summary, a ring buffer contains GPU setup/teardown commands, context-switching commands, and indirect calls to application-specific command buffers, along with other commands (e.g., for performance counters).
- During device initialization in a modern GPU driver stack, the kernel mode driver is responsible for setting up the IOMMU (Input-Output Memory Management Unit) and configuring the GPU page tables.
- Device Initialization:
- The kernel mode driver allocates memory regions for the GPU.
- Configures the IOMMU to map GPU virtual addresses to the correct physical addresses.
- Sets up and populates the GPU’s page tables with mappings for resources like textures, buffers, and shaders.
- In the GPU’s case - this is used to prevent Direct Memory Access (DMA) attacks by limiting what memory the GPU can access.
Note:
GPU page tables are physically located in system memory and managed in the kernel address space, ensuring both security and proper low-level memory management for the GPU. They map virtual GPU addresses (used by applications) to physical memory addresses, either in system memory (RAM) or video memory (VRAM) on the GPU. The kernel-mode driver is responsible for setting up, maintaining, and updating these page tables. When an application submits a workload, the user-mode driver requests memory allocations, and the kernel-mode driver ensures the GPU page tables are updated accordingly.
### Command processor:
- The command processor, also known as the graphics controller, retrieves commands from the command queue for execution. These commands include:
- Kernel launches
- Memory copy operations
- Other related tasks
- When people refer to GPU firmware, they are typically referring to the firmware that manages the command processor.
- a list of command packets, each of which has the following format (i.e. compute command format.)
- a sequence of methods calls
- The one below is for Nvidia GPUs based on the fermi architecture. This is a command packet that performs a vector addition of 128 elements (1 dimensional grid)
- method calls are a sequence of 32 bits words.
- sub-channels represent one of several slots (usually 8) in a hardware queue. Each slot gets “bound” to a specific task type, like compute or graphics, so commands can be routed efficiently to the right GPU engine without interference.
| Sequence | Subchannel | Method (Hex) | Parameter | Description |
|---|---|---|---|---|
| 1 | 0 | 0x0000 (BIND) | Compute object ID | Bind the Compute Object to a Subchannel |
| 2 | 0 | 0x0bd8 (LOCAL_SIZE) | Local memory size in bytes (e.g., 0) | Set local memory size |
| 3 | 0 | 0x0bdc (SHARED_SIZE) | Shared memory size in bytes (0) | Set shared memory size |
| 4 | 0 | 0x0be8 (CBANK_SIZE) | Constant bank size | Set constant bank size |
| 5 | 0 | 0x0be0 (CODE_ADDRESS_HIGH) | High 32 bits of SASS address | Set code address high |
| 6 | 0 | 0x0be4 (CODE_ADDRESS_LOW) | Low 32 bits of SASS address | Set code address low |
| 7 | 0 | 0x0bf0 (CONSTANT_BUFFER_LOAD) | Header for parameters (count=4) | Start loading parameters (using increasing mode) |
| 8 | 0 | Incremental | 64-bit A_ptr (high/low as two words) | Parameter: A_ptr |
| 9 | 0 | Incremental | 64-bit B_ptr (high/low as two words) | Parameter: B_ptr |
| 10 | 0 | Incremental | 64-bit C_ptr (high/low as two words) | Parameter: C_ptr |
| 11 | 0 | Incremental | 32-bit n | Parameter: n |
| 12 | 0 | 0x0bc4 (GRID_DIM_X) | 1 | Set grid dimension X |
| 13 | 0 | 0x0bc8 (GRID_DIM_Y) | 1 | Set grid dimension Y |
| 14 | 0 | 0x0bcc (GRID_DIM_Z) | 1 | Set grid dimension Z |
| 15 | 0 | 0x0bb0 (BLOCK_DIM_X) | 128 | Set block dimension X |
| 16 | 0 | 0x0bb4 (BLOCK_DIM_Y) | 1 | Set block dimension Y |
| 17 | 0 | 0x0bb8 (BLOCK_DIM_Z) | 1 | Set block dimension Z |
| 18 | 0 | 0x0bd0 (REGISTER_COUNT) | Registers per thread (e.g., 8) | Set register count |
| 19 | 0 | 0x0bd4 (BARRIER_ALLOC) | 0 | Set number of barriers |
| 20 | 0 | 0x0bec (CACHE_CONFIG) | Preferred cache mode | Set cache configuration |
| 21 | 0 | 0x0bf4 (STACK_SIZE) | Per-thread stack size (if needed) | Set stack size |
| 22 | 0 | 0x0c00 (DISPATCH) | 0 (non-blocking) | Launch the kernel |
- A single
WorkCommandCP, i.e., one compute command packet describing a compute dispatch and its supporting state/microsequence for the Apple GPU via m1n1’s proxy structures.
- WorkCommandCP is the compute-command variant of the AGX “work command” structures that bundle a compute dispatch’s context, registers, microsequence, timestamps, and encoder/job metadata for submission through the firmware-managed queues, analogous to a single compute command buffer submission in modern UAPIs.
- Asahi’s UAPI work discusses compute commands as streams of compute dispatches with command-level timestamps; WorkCommandCP in m1n1 is the reverse-engineered/development-format used to introspect those fields before the upstream Rust DRM driver/UAPI consolidation, so the snippet corresponds to one such compute submission record
| Fields | labels | Bytes | Description |
|---|---|---|---|
| 1 | 0x00000003 | magic = 0x3 | WorkCommandCP “magic” identifying a compute work command packet |
| 2 | counter | 64-bit counter (V ≥ 13_0B4) | Monotonic sequence/counter used by newer firmware versions for tracking submissions |
| 3 | unk_4 | 32-bit | Unknown field following magic/counter used internally by firmware (kept for structure layout) |
| 4 | context_id | 32-bit context ID | Identifies the GPU context/VM for the compute job |
| 5 | event_control_addr | 64-bit GPU VA | Address of event-control block providing completion/fault signaling for the job |
| 6 | event_control | pointer -> EventControl | Resolved structure containing event bits/counters used by firmware signaling |
| 7 | unk_2c | 32-bit | Reserved/unknown slot kept from tracing; firmware expects it in layout |
| 8 | registers[] | 128 RegisterDefinition (G ≥ G14X) | Inline register programming block for compute stage on newer GPU gens |
| 9 | unk_g14x[] | 64 x u32 default 0 (G ≥ G14X) | Extra reserved space for G14X+ parts observed during RE |
| 10 | unk_buf | 0x50 bytes (G < G14X) | Legacy padding/unknown blob for pre-G14X layouts |
| 11 | compute_info | ComputeInfo (G < G14X) | Legacy compute configuration block for older GPUs (grid/block, resources) |
| 12 | registers_addr | 64-bit GPU VA | Pointer to an external register programming list when not inlined |
| 13 | register_count | 16-bit | Number of register entries to apply from registers_addr |
| 14 | registers_length | 16-bit | Byte length of the register programming payload |
| 15 | unk_pad | 0x24 bytes | Reserved pad matching firmware alignment requirements |
| 16 | microsequence_ptr | 64-bit GPU VA | Pointer to the microsequence (firmware-level control stream) for compute |
| 17 | microsequence_size | 32-bit | Size of the microsequence blob referenced above |
| 18 | microsequence | pointer -> MicroSequence | Decoded microsequence entries (dispatch/phase commands) |
| 19 | compute_info2 | ComputeInfo2 | Extended compute configuration (e.g., additional resources/limits) |
| 20 | encoder_params | EncoderParams | Parameters for firmware encoder interpreting microsequence/register lists |
| 21 | job_meta | JobMeta | Submission/job metadata (priority, queues, dependencies) |
| 22 | ts1 | TimeStamp | Primary timestamp record at command granularity |
| 23 | ts_pointers | TimeStampPointers | Pointers for firmware to write timestamps (GPU/CPU domains) |
| 24 | user_ts_pointers | TimeStampPointers | Additional timestamp pointers intended for userspace reporting |
| 25 | client_sequence | 8-bit | Small client-side sequence/id used for tracking submissions |
| 26 | unk_ts2 | TimeStamp (V ≥ 13_0B4) | Extra timestamp block used in newer firmware |
| 27 | unk_ts | TimeStamp (V ≥ 13_0B4) | Additional timestamp record (purpose under investigation) |
| 28 | unk_2e1 | 0x1c bytes default 0 (V ≥ 13_0B4) | Reserved/unknown; firmware expects presence |
| 29 | unk_flag | Flag (V ≥ 13_0B4) | Boolean/bitflag toggling optional firmware behavior |
| 30 | unk_pad | 0x10 bytes default 0 (V ≥ 13_0B4) | Additional reserved padding for newer versions |
| 31 | pad_2d9 | 0x7 bytes default 0 | Tail padding to satisfy alignment/size constraints |
- Apple/AGX compute submission is not programmed via subchannels/method registers in the public UAPI; instead, the firmware consumes structured command blocks (with register programming lists and microsequences) placed in GPU-accessible memory, so
Subchannel and Methodcolumns are not applicable and are marked N/A above. - The “microsequence” plus register lists together stand in for the “methods” used on NVIDIA Fermi, while ComputeInfo/ComputeInfo2 capture dispatch geometry/resources akin to grid/block and LDS/stack config in the NVIDIA example
- The pasted WorkCommandCP instance corresponds to a single compute command submission containing the state, control stream, and bookkeeping needed by the firmware to dispatch one compute workload, matching Asahi’s description that a compute command encapsulates one or more dispatches with command-level timestamps; here, it represents one command packet as requested
- Lets do a recap ([[/Well, use the CPU to program the GPU (it’s a coprocessor):|Host]] Vs Device programming)
- Generic shader programming workflow
- In Vulkan, when you call
VkCreateGraphicsPipelineorVkCreateComputePipeline, that’s when the Vulkan driver translates SPIR-V into the GPU’s native instruction set (ISA). - Vulkan forces applications to compile pipelines ahead of time. This makes performance predictable since there’s no unexpected shader compilation during rendering.
- But we can also avoid this runtime compilation
- Pipeline Cache (VkPipelineCache): Stores compiled shaders so they don’t have to be recompiled next time.
- Vendor Tools (NVIDIA, AMD, etc.): Some vendors let you precompile SPIR-V to a GPU-specific binary offline, so the driver doesn’t need to compile at runtime.
- In Vulkan, when you call
- SPIRV as the interchange format for the future.
- Shader dev tools and transpilers.
- What do shading languages look like and what does SPIRV look like?
- Lets look at some shader code (to get a feel for shader programming)
- Notes on shader programming without host code:
- Interactive Shader Coding
- Shader playgrounds (Shadertoy, GLSL Sandbox)
- Game engines (Unity’s Shader Graph, Unreal’s HLSL-based material system)
- Compute-focused Shader-Like Programming (via MLIR)
- Mojo – Pythonic GPU programming, [[/High-Level Overview of GPU Driver Architecture:|no explicit host API management]]
- Triton – [[Pliron/Intuition:|Custom GPU kernels for ML workloads]], JIT-compiled via LLVM
- PyTorch / TensorFlow – High-level tensor computation mapped to GPU backends
- Interactive Shader Coding
# CUDA/PTX Vs SPIR-V
CUDA PTX is a great contrast point, because it sits lower than CUDA C++ but higher than raw SASS (NVIDIA machine code). Unlike SPIR-V, PTX really does expose a more “CPU-like” model, even though the hardware underneath is SIMT. Let’s map your list again:
⸻
- Yes.
- In PTX, pointers are just 64-bit integers (.u64) under the hood.
- You can do pointer arithmetic, cast them to/from integers, and manipulate them freely.
- Of course, the memory space still matters (global, shared, local, const), but once you’ve got an address, it’s just an integer.
⸻
- Sort of.
- PTX pointers do carry address space tags (like global, shared, local), but since they’re integers you can bit-twiddle and reinterpret.
- There isn’t a built-in “generic pointer type” like OpenCL C’s generic, but you can roll your own by packing an address + space tag in a 64-bit value and branching at load/store time.
- NVIDIA’s compiler often optimizes around this.
⸻
- Yes.
- PTX supports real call and ret instructions.
- A per-thread call stack exists in GPU local memory, so recursion is possible.
- That said: it’s slow and discouraged for performance reasons, but nothing in PTX forbids it.
- CUDA C++ recursion is officially supported since compute capability 2.0 (Fermi era).
⸻
- Yes.
- PTX supports indirect calls (call.uni with a register holding the function address).
- CUDA C++ lets you use function pointers on device code, though with restrictions on inlining and performance.
- This means you can actually write functional-style GPU code in CUDA, unlike SPIR-V.
⸻
- Yes.
- PTX has bra (branch) and @pred bra (conditional branch) instructions that can target arbitrary labels.
- Control flow does not need to be structured — you can write totally spaghetti PTX if you want.
- The compiler might restructure things for performance, but the IR itself allows unrestricted jumps.
⸻
✅ Summary table:
| Feature | SPIR-V | CUDA/PTX |
|---|---|---|
| Unrestricted pointers |
❌ strongly typed, no arithmetic | ✔ true integers, arithmetic allowed |
| Generic pointers | ❌ storage class encoded in type | (✔) possible with tricks, not native |
| True function calls | ✔ (but recursion forbidden) | ✔ recursion supported |
| Function pointers | ❌ | ✔ supported |
| Arbitrary gotos | ❌ structured only | ✔ unrestricted branches |
⸻
👉 So, PTX is much closer to “GPU assembly with a CPU-like flavor,” whereas SPIR-V is deliberately restricted and structured to make GPU driver backends simpler and more predictable.
If you want “full C++ semantics on GPU”, CUDA PTX is the closest you’ll get today. SPIR-V is more of a constrained IR to glue multiple shading languages into Vulkan.