fxkamd’s gists

Thank you to tiny corp for pointing out some problems running BERT training with Tinygrad on AMD GPUs in this Tweet. We had a few engineers at AMD take a look at the problem and they were quickly able to reproduce it.

What they found was an issue related to CWSR (compute wave save restore), which is a mechanism that allows our driver and firmware to preempt and reschedule long-running compute waves on our GPUs. The GFXv11 GPU line requires a workaround to set COMPUTE_PGM_RSRC1.PRIV=1 when dispatching a compute kernel. Normally this is handled by the AQL DISPATCH packet. However, since the Tinygrad implementation leverages a custom runtime, it requires this workaround in its PM4-based dispatch. This patch is specific to GFXv11 GPUs. Other GPUs do not require it and should not use this workaround. The following KFDTest patch can be used as a reference: https://github.com/ROCm/ROCT-Thunk-Interface/commit/507637ed5b82197eecbf483cdc1234939766549a

While inv

This is Felix Kuehling, long time KFD driver architect. I started looking into the TinyGrad source code yesterday, focusing on ops_kfd.py, ops_hsa.py and driver/hsa.py, to understand how TinyGrad talks to our HW and help with the ongoing debugging effort from the top down. This analysis is based on this commit: https://github.com/tinygrad/tinygrad/tree/3de855ea50d72238deac14fc05cda2a611497778

I'm intrigued by the use of Python for low-level programming. I think I can learn something from your use of ctypes and clang2py for fast prototyping and test development. I want to share some observations based on my initial review.

ops_kfd looks pretty new, and I see many problems with it based on my long experience working on KFD. I think it's interesting, but probably not relevant for the most pressing problems at hand, so I'll cover that last.

ops_hsa uses ROCr APIs to manage GPU memory, create a user mode AQL queue for GPU kernel dispatch, async SDMA copies, and signal-based synchronization with barrier packets

Felix Kuehling fxkamd