Skip to content

Instantly share code, notes, and snippets.

@fxkamd
Created October 1, 2024 19:06
Show Gist options
  • Save fxkamd/ecaaab72694696602b6f350edaab9d8d to your computer and use it in GitHub Desktop.
Save fxkamd/ecaaab72694696602b6f350edaab9d8d to your computer and use it in GitHub Desktop.
Solutions to problems with BERT training with tinygrad on AMD GPUs

Thank you to tiny corp for pointing out some problems running BERT training with Tinygrad on AMD GPUs in this Tweet. We had a few engineers at AMD take a look at the problem and they were quickly able to reproduce it.

What they found was an issue related to CWSR (compute wave save restore), which is a mechanism that allows our driver and firmware to preempt and reschedule long-running compute waves on our GPUs. The GFXv11 GPU line requires a workaround to set COMPUTE_PGM_RSRC1.PRIV=1 when dispatching a compute kernel. Normally this is handled by the AQL DISPATCH packet. However, since the Tinygrad implementation leverages a custom runtime, it requires this workaround in its PM4-based dispatch. This patch is specific to GFXv11 GPUs. Other GPUs do not require it and should not use this workaround. The following KFDTest patch can be used as a reference: https://github.com/ROCm/ROCT-Thunk-Interface/commit/507637ed5b82197eecbf483cdc1234939766549a

While investigating this issue another potential problem was discovered in that Tinygrad is not setting the Control Stack size for user mode queues correctly (also related to CWSR). Applications written on top of HSA or HIP would not have this problem because AMD's ROCr runtime handles queue creation and related buffer allocations for them. You can find the calculations for the correct CWSR area and control stack size here: https://gitlab.freedesktop.org/agd5f/linux/-/blob/a1fc9f584c4aaf8bc1ebfa459fc57a3f26a290d8/drivers/gpu/drm/amd/amdkfd/kfd_queue.c#L417

The next ROCm release will include robustness improvements in KFD that will gracefully fail during user mode queue creation if the queue buffer pointers or sizes are missing or fail sanity checks. It will also prevent accidental unmapping or freeing of queue buffers while the queues exist. These patches have been in our public kernel code since July: https://lore.kernel.org/amd-gfx/[email protected]/T/

@geohot
Copy link

geohot commented Oct 8, 2024

Thanks for the reply, just saw this!

We have CWSR disabled on our machines with options amdgpu cwsr_enable=0 in amdgpu.conf, so I'm not sure it's the same issue. But we should certainly fix our runtime with CWSR.

Are there any docs about how CWSR works? Is it a hardware feature, or all implemented in the kernel driver?

Also, definitely appreciate gracefully failing in the kernel driver if things are wrong.

@fxkamd
Copy link
Author

fxkamd commented Oct 8, 2024

CWSR is a fundamental feature in our command-submission/execution model and memory management model with user mode queues. When you contrast that with kernel mode command submission it's like comparing preemptive scheduling and cooperative scheduling on CPUs.

Without CWSR we can only preempt dispatches at wavefront granularity. I.e. an endless loop in a shader kernel would hang the GPU and require a GPU reset to recover. CWSR allows preempting wavefronts at instruction granularity, which enables arbitrarily long-running or persistent shader kernels on the GPU without introducing stalls or hangs affecting the GPU scheduler and memory manager. We also use the saved GPU state from CWSR in the ROCm debugger. Therefore disabling CWSR is not recommended. We don't test our SW with CWSR disabled. This mode only exists for hardware bring-up, triage and debugging purposes.

CWSR is implemented by a combination of CP firmware and a trap handler, which is like an interrupt handler running on the GPU compute units. The driver is only responsible for allocating the memory needed to save the GPU state (registers, LDS, and metadata needed to restarte waves later). When a request comes to the CP to preempt a queue that's currently executing a dispatch, it first tries to drain running waves from the compute units. If a grace period expires it sends a trap to all the remaining waves and the trap handler does the heavy lifting of saving register and LDS state to memory. The trap handler code can be found in our kernel mode driver. The CP writes restart information about the wave-fronts to the control stack.

AFAIK resuming a queue is handled by the CP firmware and supporting fixed-function HW. After restoring the wave, register and LDS state it resumes execution of each wave at the instruction that was previously interrupted by the trap handler.

@geohot
Copy link

geohot commented Oct 9, 2024

We're working on a full driver that will just have one system wide kernel mode queue. I'm hoping by not needing to support any graphics, multiuser stuff, or SVM memory we can make something 100x simpler. (the end goal is to support multiuser at a higher abstraction level in tinygrad) Hopefully this finally ends all instability, since everything at that point is statically scheduled. tinygrad/tinygrad#6923

tinygrad should never generate shaders with endless loops, the language isn't Turing complete and all programs must halt. So we don't need the ability to preempt, it's never used. If you kill the process, I'm fine with a full GPU reset. Though with the fixes you suggested (merged tinygrad/tinygrad@137ad55) we can test CWSR again.

Thanks for the pointer to the trap handler, the flow makes sense. When you say CP firmware, do you mean the MEC? I don't see any CP in cat amdgpu_firmware_info https://github.com/geohot/7900xtx

@fxkamd
Copy link
Author

fxkamd commented Oct 9, 2024

Yes, in this context CP is another name for MEC. CP stands for "command processor". MEC stands for "Micro Engine Compute".

@fxkamd
Copy link
Author

fxkamd commented Oct 17, 2024

I wanted to mention another option here, short of writing a whole new kernel mode driver. The existing upstream driver has statically mapped kernel mode queues used by the amdgpu_cs ioctl for PM4 indirect buffer (IB) submission from user mode. This is typically used by MESA (OpenGL and Vulcan). But you could also use it in Tinygrad instead of the KFD user mode queues, since you eliminated most of the ROCm user mode stack that builds on KFD already. You'd also want to use the GEM memory management APIs. To wait for command completion, you'd use a fence-wait ioctl. libdrm-amdgpu provides thin wrappers for all the ioctls you'd need.

@FlorianHeigl
Copy link

FlorianHeigl commented Nov 11, 2024

@fxkamd if you talk of "next ROCm" release I pray it's not the one fully dropping gfx906!
(it would be a really huge relief to have a working stack once before it's deprecated)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment