These are just some notes on my current understanding of the subtleties of the AGX memory model and the TLB/caching issues I'm seeing.
TLBI instructions do not broadcast to the GPU from EL1 with stage 2 translation enabled. That's it. That's what the bug was.