fxkamd/TinyGrad-notes.md

Last active April 24, 2025 04:56

Star (14) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48.js"></script>
Save fxkamd/ffd02d66a2863e444ec208ea4f3adc48 to your computer and use it in GitHub Desktop.

Observations about HSA and KFD backends in TinyGrad

Raw

This is Felix Kuehling, long time KFD driver architect. I started looking into the TinyGrad source code yesterday, focusing on ops_kfd.py, ops_hsa.py and driver/hsa.py, to understand how TinyGrad talks to our HW and help with the ongoing debugging effort from the top down. This analysis is based on this commit: https://github.com/tinygrad/tinygrad/tree/3de855ea50d72238deac14fc05cda2a611497778

I'm intrigued by the use of Python for low-level programming. I think I can learn something from your use of ctypes and clang2py for fast prototyping and test development. I want to share some observations based on my initial review.

ops_kfd looks pretty new, and I see many problems with it based on my long experience working on KFD. I think it's interesting, but probably not relevant for the most pressing problems at hand, so I'll cover that last.

ops_hsa uses ROCr APIs to manage GPU memory, create a user mode AQL queue for GPU kernel dispatch, async SDMA copies, and signal-based synchronization with barrier packets between the two. There is also some host-side synchronization used for lazy cleanup of reusable signals and freeing memory. I only see one potential problem so far:

AQLQueue.blit_packets writes multiple packets, header first. This is problematic because the AQL packet processor can start reading packets with a valid header even before you ring the doorbell and update the write-index and doorbell. I only see this used in HSAGraph, and I don't understand the rest of TinyGrad well enough yet to know, whether this can happen in a typical ResNet run
Even in submit_kernel and submit_barrier, you may need a memory barrier before writing the header, to make sure the writes complete in the right order in the CPU. I don't know if python does that implicitly, e.g. because of overheads in the interpreter

Now my notes on ops_kfd. There is a good chance I missed something and I pick up something new every time I look a the code, so please take these with a grain of salt:

In HWComputeQueue.submit AQL packet headers must be written after the packet contents. You may also need a memory barrier to ensure the writes complete in the rigth order in the CPU. The AQL packet processor can start working on packets as soon as it sees a valid header, even before you ring the doorbell
Sharing device.completion_signal: This can cause race conditions when overwriting or waiting for a signal value before the previous dispatch has completed. Before reusing a signal, you need to wait for it. KFDAllocator.copyout waits for the signal, but then reuses it for multiple SDMA commands in the loop. The wait in the end may get triggered by something that's not the last SDMA command. To avoid this, I'd only signal after the last SDMA command. In copyin I don't see any waiting at all before using the signal
AQLAllocator.transfer seems to use the destination device for the data copy. I would expect writing to be faster than reading (easier to hide latency), so using the source device may perform better
Is there some code I'm missing to map either the source or destination on the other GPU for AQLAllocator.transfer?
Operations on wptr and doorbells may not be atomic: This could cause race conditions if the HW sees half-complete values. I don't know ctypes very well, so I don't know what atomicity guarantees it makes
No virtual address alignments to optimize for huge pages: This will lead to bad TLB efficiency, more page table allocations, slower memory allocation and reduced access performance
No suballocator for small VRAM allocations: Similar to above, if you have many small allocations, it will lead to more memory management overhead and reduced access performance
Highest queue priority, I don't think this gains anything if all queues end up with the same priority but may risk other issues by starving kernel queues (if you ever need interop, mostly for video processing)
Mapping only one doorbell page per GPU: Each process has two doorbell pages per GPU. You should map both. Otherwise you may have problems if you're using more SDMA queues later that end up using some of the doorbells in the second page due to how doorbells get routed in the HW
Queue overruns are only detected after corrupting the queues
No fallback to shader-based copies when SDMA queues run out: There are a limited number of SDMA queues in the HW and we don't oversubscribe them at the moment because low latency is one of the big advantages of using SDMA over shader-based copies. When they run out, SDMA queue creation will fail. ROCr has a fallback to use shader-based copies for this. As long as you run a small number of processes concurrently and use a small number of SDMA queues per device, this is no problem
Using same BO for compute and SDMA read/write pointers
- Not a problem now, but be aware that the SDMA engine writes some queue usage information and internal scratch data after the RPTR
Circumventing ROCr breaks rocm-gdb. You won't be able to use it for debugging compute kernels

geohot commented Apr 5, 2024 •

edited

Loading

Thanks for the notes!

Yea, KFD is a lot newer and nowhere near ready, but is the direction we want to go in. We are working on the same thing for NVIDIA too tinygrad/tinygrad#4044

Ahh, interesting re: as soon as it sees a valid header, I didn't expect that. Why doesn't it wait for the doorbell? I will add a test for this behavior and make sure I write the header last + add CPU memory barrier. For the SDMA queue, do I have to write the header last too?

The signal behavior is bad re: completion_signal. I will refactor it to dynamically allocate signals. Though you do have to wait for it in the loop, otherwise the CPU can't know when it's safe to overwrite the buffer.

The RAM is so much faster than the PCI-E, so I don't think the device choice of src or dest matters. The test (TestHCQ.test_cross_device_copy_bandwidth) shows it's getting ~28 GB/s.

The first line of def transfer is dest_dev._gpu_map(src) mapping the src into the dest device.

Fair re: atomic write of doorbell, I'll look into this and confirm. I'm not sure ctypes is really designed for this, but I'd imagine 64-bit writes are atomic.

"virtual address alignments" I'm not sure what this is. Is there documentation on this and how it impacts performance?

Aware of a lack of suballocator. This will be handled higher up in tinygrad after a few more refactors to support offsetted buffers.

We only have one of each type of queue, so I doubt priority matters. Can lower it but I think it's a wash.

Are we not mapping both doorbell pages? self.doorbells = libc.mmap(0, 8192, should be two pages, right?

Yea, queue overrun behavior is bad, but I'm fine with just crashing the process if it happens for now. I just don't want it to be silent.

We only have 1 SDMA queue per device, we don't create new queues for transfer. Are these lower level queues somewhere you are talking about?

Will switch to different BO for the two queues.

I've never used rocm-gdb though I'm curious to explore how the debugging stuff works. We'll probably add back a very simple HIP backend for things like this when you don't need speed.

geohot commented Apr 5, 2024 •

edited

Loading

Two questions I've had while working on this

What's the difference between GTT and USERPTR on allocation? I copied what the ROCr driver uses for the different allocations, but what's the actual difference?

Is there a lower level queue than AQL queue? I see a KFD_IOC_QUEUE_TYPE_COMPUTE as well as KFD_IOC_QUEUE_TYPE_COMPUTE_AQL Can this be used to dispatch kernels and get lower level control, maybe with PM4 packets?

Author

fxkamd commented Apr 5, 2024

Thanks for the notes!

Yea, KFD is a lot newer and nowhere near ready, but is the direction we want to go in. We are working on the same thing for NVIDIA too tinygrad/tinygrad#4044

Ahh, interesting re: as soon as it sees a valid header, I didn't expect that. Why doesn't it wait for the doorbell? I will add a test for this behavior and make sure I write the header last + add CPU memory barrier. For the SDMA queue, do I have to write the header last too?

This applies to AQL queues only. The reason is, that the AQL queue ABI supports multiple producers. They can concurrently allocate space on the queue with atomic operations on the write index, then write their packets and finally ring the doorbell. Ringing the doorbell from one producer thread doesn't mean that all other producers have filled in their packets yet, so the doorbell is just a wakeup-call and the packet processor checks the packet headers to know when the packets are ready. The other way around, the queue will continue to process all valid packets and not wait for another doorbell to make sure it catches up with older work that was written into later queue slots.

The signal behavior is bad re: completion_signal. I will refactor it to dynamically allocate signals. Though you do have to wait for it in the loop, otherwise the CPU can't know when it's safe to overwrite the buffer.

Makes sense.

The RAM is so much faster than the PCI-E, so I don't think the device choice of src or dest matters. The test (TestHCQ.test_cross_device_copy_bandwidth) shows it's getting ~28 GB/s.

That's good to hear. The problem is PCIe latency. As long as the SDMA engines can track enough outstanding reads to hide that latency, you can sustain full bandwidth. It's easier with writes, because they can be posted and SDMA doesn't need to wait for their completion.

The first line of def transfer is dest_dev._gpu_map(src) mapping the src into the dest device.

Ah, I missed that.

Fair re: atomic write of doorbell, I'll look into this and confirm. I'm not sure ctypes is really designed for this, but I'd imagine 64-bit writes are atomic.

"virtual address alignments" I'm not sure what this is. Is there documentation on this and how it impacts performance?

Both the CPU and GPU use 4-level page tables (5 levels on the latest Intel and AMD CPUs). I'm using our GPU terminology here, but the structure is similar on x86 CPUs with different names: Each page table block (PTB) in the lowest level of the page table has 512 entries (PTEs) that each point to a 4KB page. The PTB represents 2MB of address space aligned at a 2MB boundary. The next level up is a page directory block (PDB) that has 512 entries (PDEs) pointing to a PTB each. It represents 1GB of address space. Instead of pointing to PTBs, it can point to a contiguous 2MB block. This saves memory in the page table, makes memory mapping faster. And it makes the TLB cache more efficient because it can represent much more address space with the same number of entries.

Our VRAM allocator in the kernel mode driver is optimized for 2MB or larger allocations and minimizes fragmentation into smaller pieces. But to map a contiguous 2MB block with a single PTE, it needs to be aligned on a 2MB boundary in virtual address space.

Similarly, the Linux memory manager supports "transparent huge pages" where it prefers 2MB pages for large allocations.

The performance impact depends on your workload. If you have mostly linear accesses or a small working data set, TLB efficiency is less important. But large working sets with random access patterns benefit from better TLB efficiency. We call this TLB reach. Our GPUs' TLBs are optimized to reach all local memory with 2MB pages. If you used 4KB page for everything, that would drop by a factor 512 in the worst case.

Aware of a lack of suballocator. This will be handled higher up in tinygrad after a few more refactors to support offsetted buffers.

We only have one of each type of queue, so I doubt priority matters. Can lower it but I think it's a wash.

Are we not mapping both doorbell pages? self.doorbells = libc.mmap(0, 8192, should be two pages, right?

You're right. Maybe I was thrown off by the address alignment to the nearest page in the line just above. I think that should be:

self.doorbells_base = self.aql_queue.doorbell_offset & (~0x1fff)

Yea, queue overrun behavior is bad, but I'm fine with just crashing the process if it happens for now. I just don't want it to be silent.

We only have 1 SDMA queue per device, we don't create new queues for transfer. Are these lower level queues somewhere you are talking about?

You can run out either by creating more queues (I see the HSA backend uses 2 SDMA engines), or by running multiple processes. If you don't do either of those things, you'll be fine without a fallback.

Will switch to different BO for the two queues.

I've never used rocm-gdb though I'm curious to explore how the debugging stuff works. We'll probably add back a very simple HIP backend for things like this when you don't need speed.

I haven't used it myself because I haven't written GPU kernels. But I've worked closely with the tools and ROCr architects on making it work. rocm-gdb is integrated with gdb (hopefully in upstream gdb at some point--it depends on some updates to DWARF). It lets you debug GPU kernels similar to how you'd debug a multi-threaded application on the CPU. Among other things you can enumerate running wave fronts, inspect registers, memory, single step programs, set breakpoints and stop when a GPU kernel throws a segfault.

But this depends on a trap handler (like an interrupt handler running in the GPU compute unit) to handle a bunch of exceptions and debug traps and send information back to the runtime or the debugger. The trap handler is loaded by ROCr. There is a handshake between debugger and ROCr to allow rocm-gdb to attach to the program before or after ROCr is initialized. rocm-gdb depends on a new-enough ROCr runtime to do all that.

geohot commented Apr 5, 2024 •

edited

Loading

Addressing issues here: tinygrad/tinygrad#4087

What happens when the AQL queue wraps around? Won't it find valid headers? Does it 0 the headers after it runs the commands?

I'm also not sure about needing to write the header last, agreed on the doorbell not being the gate if it's already running, but won't it not run past device.amd_aql_queue.write_dispatch_id?

nimlgen commented Apr 5, 2024 •

edited

Loading

AQLQueue.blit_packets writes multiple packets, header first. This is problematic because the AQL packet processor can start reading packets with a valid header even before you ring the doorbell and update the write-index and doorbell. I only see this used in HSAGraph, and I don't understand the rest of TinyGrad well enough yet to know, whether this can happen in a typical ResNet run

So, does the CP read packets that are after the write pointer? We memcpy packets and only then move write_pointer + doorbell.

Even in submit_kernel and submit_barrier, you may need a memory barrier before writing the header, to make sure the writes complete in the right order in the CPU. I don't know if python does that implicitly, e.g. because of overheads in the interpreter

We test on x86 which is TSO (so I expect we should not see any writes reordering visible to the memory subsystem). And I hope python doesn't do any reorderings.

Author

fxkamd commented Apr 5, 2024 •

edited

Loading

The AQL packet processor in the firmware invalidates the headers before it updates the read index. BTW, the encoding for invalid packets is format=1. format=0 is used for vendor packets. So just a 0-initialized queue buffer is full of valid vendor packets as far as the firmware is concerned. Yikes. It shouldn't read ahead of the write-index, I hope. I'm just reading up on this again the the HSA spec. See section 2.9 in HSA Platform System Architecture Specification.

There is also section 2.5 in HSA Runtime Programmer’s Reference Manual. It talks more about single/multi-producer queues and the submission ABI.

Author

fxkamd commented Apr 5, 2024

Two questions I've had while working on this

What's the difference between GTT and USERPTR on allocation? I copied what the ROCr driver uses for the different allocations, but what's the actual difference?

GTT memory is allocated as a buffer object in the kernel mode driver. This memory is not pageable in the traditional sense, though the memory manager does have its own swap mechanism. However, in the upstream kernel, it limits us to less than 1/2 of system memory capacity due to some limitations with memory accounting and the OOM killer.

To get around that, we use pageable memory (plain mmap) for most system memory allocations, and then map them for GPU access using userptr BOs or our newer SVM API. This memory can be paged freely by the Linux kernel, so our driver needs to handle MMU notifiers when that happens, so we can keep our GPU page tables in sync. Fortunately we've optimized things to make that rare (except when NUMA balancing is on).

We use pageable memory for most system memory allocations so we get access to the full system memory capacity. We use GTT for memory that must be accessed in kernel mode, or shared between processes with DMABufs. If you don't need the full capacity, using GTT may be slightly faster (no MMU notifiers, faster to allocate and free).

Is there a lower level queue than AQL queue? I see a KFD_IOC_QUEUE_TYPE_COMPUTE as well as KFD_IOC_QUEUE_TYPE_COMPUTE_AQL Can this be used to dispatch kernels and get lower level control, maybe with PM4 packets?

KFD_IOC_QUEUE_TYPE_COMPUTE uses PM4 packets. We use that in kfdtest for low-level testing with tiny hand-written assembly shaders. IME, doing dispatches with PM4 packets is a pain, requires low-level HW specs for writing GPU-specific dispatch registers and handling scratch memory, synchronization, cache flushes, HDP flushes etc. manually. There is lots of potential for getting things subtly wrong, or causing more hardware hangs by corrupting the wrong registers.

AQL is our "architected queuing language" that came out of the HSA initiative. It's very abstract by design to make it portable with a stable ABI, multi-producer semantics, well defined memory coherence, synchronization primitives and kernel calling conventions, which makes all of these features usable in 3rd-party runtime code such as your own. It also enables interoperability between different language runtimes. You've already seen the PM4 escape through a vendor packet that we use for icache flushes in the code-object loader. I find it unfortunate that we need that escape. Other than that, all our compute language runtimes for ROCm use AQL.

geohot commented Apr 5, 2024 •

edited

Loading

Cool, makes sense re: GTT and USERPTR. I switched most to GTT since we aren't doing any big allocations, but found USERPTR was required for readinto to work. Haven't noticed a performance difference.

Ahh, PM4 looks like what I was putting on the NVIDIA queues. https://github.com/geohot/cuda_ioctl_sniffer/blob/master/gpu_driver.cc I have a whole bunch of tinygrad work to do first, but I do want to move to PM4 eventually. Would love to see it documented. For now AQL is fine though if we aren't hitting dispatch bugs. The key thing is that our drivers are O(1) regardless of queue length and that we have lightweight ways to sync the queues, which is fine at the AQL+SDMA level.

I have a very repeatable GPU crash in KFD, appears as just a hang and looks like the same one we were hitting before. Happens on both 6.0.2 and 6.0.3 (looks to happen faster on 6.0.3). Occurs around 500 steps into training.

On current tinygrad master (164329a8ea71ac63eeec5adb526b1ab1a4eb5982) in the ResNet trainer
BS=768 GPUS=6 WANDB=1 BEAM=4 MODEL=resnet KFD=1 python3 examples/mlperf/model_train.py

Python traceback is a wait for 10s+ for a signal that never comes on the SDMA queue.

  File "/home/tiny/tinygrad/examples/mlperf/model_train.py", line 259, in <module>
    globals()[nm]()
  File "/home/tiny/tinygrad/examples/mlperf/model_train.py", line 144, in train_resnet
    next_proc = data_get(it)
  File "/home/tiny/tinygrad/examples/mlperf/model_train.py", line 126, in data_get
    return x.shard(GPUS, axis=0).realize(), Tensor(y, requires_grad=False).shard(GPUS, axis=0), cookie
  File "/home/tiny/tinygrad/tinygrad/tensor.py", line 142, in realize
    Tensor.corealize([self])
  File "/home/tiny/tinygrad/tinygrad/tensor.py", line 139, in corealize
    run_schedule(create_schedule(flatten([x.lazydata.lbs if isinstance(x.lazydata, MultiLazyBuffer) else [x.lazydata] for x in lst])))
  File "/home/tiny/tinygrad/tinygrad/engine/realize.py", line 5, in run_schedule
    def run_schedule(schedule:List[ScheduleItem]): CommandQueue(schedule)()
  File "/home/tiny/tinygrad/tinygrad/engine/commandqueue.py", line 99, in __call__
    fxn.exec([si.output, si.input])
  File "/home/tiny/tinygrad/tinygrad/device.py", line 48, in exec
    et = self(rawbufs, var_vals)
  File "/home/tiny/tinygrad/tinygrad/device.py", line 82, in __call__
    self.copy(dest, src)
  File "/home/tiny/tinygrad/tinygrad/device.py", line 72, in copy
    dest.allocator.copy_from_fd(dest._buf, src._buf.ud.fd, src._buf.offset, src.nbytes)
  File "/home/tiny/tinygrad/tinygrad/runtime/ops_kfd.py", line 281, in copy_from_fd
    if i != 0: self.device._wait_signal(self.device.signal_sdma)
  File "/home/tiny/tinygrad/tinygrad/runtime/ops_kfd.py", line 361, in _wait_signal
    if ret.wait_result != 0: raise RuntimeError(f"wait_result: {ret.wait_result}, {timeout} ms TIMEOUT!")
RuntimeError: wait_result: 1, 10000 ms TIMEOUT!

This is what's in dmesg.

[26262.469879] amdgpu 0000:83:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1202
[26262.470548] amdgpu 0000:83:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[26262.471185] amdgpu 0000:83:00.0: amdgpu: Failed to evict queue 5
[26262.471845] amdgpu: Failed to evict process queues
[26262.481377] amdgpu 0000:83:00.0: amdgpu: GPU recovery disabled.
[26262.481743] amdgpu: Failed to evict queues of pasid 0x8006
[26271.215221] amdgpu 0000:83:00.0: amdgpu: Failed to remove queue 4
[26271.350674] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[26271.351787] [drm:amdgpu_mes_flush_shader_debugger [amdgpu]] *ERROR* failed to set_shader_debugger
[26281.549387] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=20001, emitted seq=20005
[26281.551659] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[26281.553705] amdgpu 0000:83:00.0: amdgpu: GPU recovery disabled.

After the crash happens the GPU stays at 100% in rocm-smi. It was GPU 3 that crashed, the rest seemed to go back into low power mode when the process exited. I have UMR reg dumps of all the GPUs, anything I can look for in there?

(I can't rule out that we aren't doing something wrong in our KFD driver, so you are welcome to wait for a simpler repro)

Author

fxkamd commented Apr 5, 2024

AQLQueue.blit_packets writes multiple packets, header first. This is problematic because the AQL packet processor can start reading packets with a valid header even before you ring the doorbell and update the write-index and doorbell. I only see this used in HSAGraph, and I don't understand the rest of TinyGrad well enough yet to know, whether this can happen in a typical ResNet run

So, does the CP read packets that are after the write pointer? We memcpy packets and only then move write_pointer + doorbell.

Yeah, I think it shouldn't read ahead of the write index. Your sequence is different from what I would expect for AQL queues. It's probably fine since you're using it in single-producer mode.

Even in submit_kernel and submit_barrier, you may need a memory barrier before writing the header, to make sure the writes complete in the right order in the CPU. I don't know if python does that implicitly, e.g. because of overheads in the interpreter

We test on x86 which is TSO (so I expect we should not see any writes reordering visible to the memory subsystem). And I hope python doesn't do any reorderings.

I'm not a memory-model lawyer. These discussions can get very tricky. Typically this is handled by compilers and their implementation of atomics with the right acquire/release flags, or hidden inside synchronization primitives such as pthread mutexes or barriers. When you're dealing with AQL queues, HSA signals provide the same functionality. But when you're coordinating python code with the GPU directly, bypassing ROCr, you can rely on neither. In the kernel mode drivers we tend to use explicit memory barriers when doing lock-less synchronization between threads or with external devices.

Author

fxkamd commented Apr 9, 2024

Some more background about the hang you were seeing and a glimpse into the work we're doing on this issue.

ROCm uses user mode queues. Hangs in those queues are not handled by the kernel mode driver as long as the queue is preemptible. For example if you dispatch a persistent shader kernel that basically executes an infinite loop, that queue is going to "soft hang" indefinitely. But as long as the GPU scheduler firmware (MES) can preempt the queue, there is no problem. Other queues can still run, new processes can come along and create more queues and get their work executed on the GPU (assuming the persistent one isn't blocking all the wave slots).

If your kernel causes a page fault (e.g. some out-of-bounds memory access) that is handled by the kernel mode driver, and as long as the queue is still preemptible, it can just terminate your process and not affect any other process in the system. So far this is how things are supposed to work.

What you're seeing in the kernel log is when the MES scheduler firmware is becoming unresponsive after a queue in the CP failed to respond to a preemption request. We're looking into ways to improve the robustness in the scheduler or the driver to recover such situations without a GPU reset if possible, by killing the wavefronts of the offending queue. If that fails, a full GPU reset is still the last resort. This will kill all the applications currently running on the GPU. But new processes should be able to use the GPU after the reset.

When you disable SDMA, you're only disabling its use in the user mode runtime. It's always needed in kernel mode for some buffer management operations. An SDMA hang detected by the kernel mode driver could also be a symptom of something else going wrong in the GPU.

We're finding and fixing some issues with the GPU reset programming sequence in our Linux driver on Navi3. We're also working on the robustness of the MES scheduler so we can recover from more situations without a full GPU reset. At the same time we're looking into understanding what's causing the hangs in the first place. We have some reproductions of such issues with Tinygrad in AMD now, so we're making progress. Getting to the bottom of that may require a bunch of low-level driver hacking and JTAG debugging of the hardware state. Our goal is to make handling of application errors as robust as possible, so that you can get back to debugging your application.

2eQTu commented Apr 9, 2024

@fxkamd Thank you for the updates and frank technical discussion. I'm not affiliated with tinygrad, but am following along too. Looking forward to whatever aspects of the firmware component(s) can eventually be open-sourced. Seeing further down the stack is always helpful.

geohot commented Apr 10, 2024 •

edited

Loading

The same hang appears in HIP, HSA, and KFD, and I've confirmed removing SDMA (using kernels to copy instead) doesn't fix it. They all rely on AQL (and the complex MEC code to parse and run it), so all that seems left to do to try to fix it is PM4. tinygrad/tinygrad#4110

What I like with PM4 is I can use umr to see exactly what packet the GPU stalled on. I have gotten some hangs with PM4, but none of the non preemptible type. Thanks for clearing up the difference between the two hangs.

Is PM4 documented anywhere? Ideally what I want to do is treat the GPU like one queue that gets as close to the hardware as possible. Are the regCOMPUTE registers actually dispatching hardware, or is there more firmware somewhere scheduling them to shaders?

Author

fxkamd commented Apr 10, 2024

If the hang is caused by the packet processor or something that is controls directly, then you can maybe catch it with PM4 packets and UMR. Chances are that it's something triggered by the shader engines, or downstream from them. In that case, all UMR will tell you is, that it's hanging at the dispatch initiator. UMR can dump wavefronts, maybe that will tell you something: https://gitlab.freedesktop.org/tomstdenis/umr/-/blob/main/doc/sphinx/source/wave_status.rst?ref_type=heads

My understanding of PM4 dispatch is, that you use register writes to set up the state for the next dispatch (dimensions, kernel arguments, code object, scratch, register and LDS allocation, etc.). Then the actual dispatch packet kicks off hardware in the CP and other HW blocks that generate workgroups that then get scheduled on the compute units. That hardware (and maybe some firmware) also handles the tracking of completed wave fronts so that it knows when the entire dispatch is completed. Your PM4 commands also need to handle cache/HDP invalidation (before) and flushing (after) to ensure memory coherence around your dispatch, and signaling back to the host (through memory and interrupts) so that your runtime can wait for completed dispatches using an HSA signal. The RELEASE_MEM packet can handle cache flushing and signaling all in one.

The best public documentation of PM4 packets is probably in the open-source code that uses them: the amdgpu kernel mode driver (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/nvd.h) and Mesa. Also our kfdtest uses PM4 user mode queues, though it doesn't implement all the same dispatch functionality as AQL: https://github.com/ROCm/ROCT-Thunk-Interface/blob/master/tests/kfdtest/src/Dispatch.cpp

I saw you were concerned about out-of-order execution with AQL. I'm waiting for a co-worker to write up a good explanation of what it means. My superficial understanding is that it's a bit of a mis-nomer. The order of AQL packets is guaranteed as you set the BARRIER bit in all your packet headers. This is more about the scheduling of workgroups across the hierarchy of shader engines, arrays and compute units. This feature is needed on Navi3 for reliable CWSR (compute wave save restore), which MES and our driver depend on for preempting long-running compute waves.

geohot commented Apr 10, 2024

Yea, have to look more into the wave dumping stuff and understanding the status of the shader engines. Understood re: ACQUIRE_MEM and RELEASE_MEM, and now I understand what HDP is (the PCI-E bus) and why I need to flush it.

That's where I've been getting PM4 stuff from, was hoping there was something better. This is the sort of hardware documentation I'm hoping AMD releases, what happens after I poke regCOMPUTE_DISPATCH_INITIATOR?

My concern was that after reading this that instead of root causing the deadlock, the out-of-order bit was set and the test repro no longer crashed, so it was considered fixed. My real workload crashed even faster.
https://repo.radeon.com/.hidden/cfa27af7066b8ebd5c73d75110183a62/docs/Change%20Summary_6.0.3_Known_Issues%20(1).pdf

fxkamd/TinyGrad-notes.md

geohot commented Apr 5, 2024 •

edited

Loading

Uh oh!

geohot commented Apr 5, 2024 •

edited

Loading

Uh oh!

fxkamd commented Apr 5, 2024

Uh oh!

geohot commented Apr 5, 2024 •

edited

Loading

Uh oh!

nimlgen commented Apr 5, 2024 •

edited

Loading

Uh oh!

fxkamd commented Apr 5, 2024 •

edited

Loading

Uh oh!

fxkamd commented Apr 5, 2024

Uh oh!

geohot commented Apr 5, 2024 •

edited

Loading

Uh oh!

fxkamd commented Apr 5, 2024

Uh oh!

fxkamd commented Apr 9, 2024

Uh oh!

2eQTu commented Apr 9, 2024

Uh oh!

geohot commented Apr 10, 2024 •

edited

Loading

Uh oh!

fxkamd commented Apr 10, 2024

Uh oh!

geohot commented Apr 10, 2024

Uh oh!

fxkamd/TinyGrad-notes.md

geohot commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geohot commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxkamd commented Apr 5, 2024

Uh oh!

geohot commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nimlgen commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxkamd commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxkamd commented Apr 5, 2024

Uh oh!

geohot commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxkamd commented Apr 5, 2024

Uh oh!

fxkamd commented Apr 9, 2024

Uh oh!

2eQTu commented Apr 9, 2024

Uh oh!

geohot commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxkamd commented Apr 10, 2024

Uh oh!

geohot commented Apr 10, 2024

Uh oh!

geohot commented Apr 5, 2024 •

edited

Loading

geohot commented Apr 5, 2024 •

edited

Loading

geohot commented Apr 5, 2024 •

edited

Loading

nimlgen commented Apr 5, 2024 •

edited

Loading

fxkamd commented Apr 5, 2024 •

edited

Loading

geohot commented Apr 5, 2024 •

edited

Loading

geohot commented Apr 10, 2024 •

edited

Loading