Writing out the math used in Sonic MoE: https://arxiv.org/abs/2512.14080. Sec. 3.2 and Appendix C.
Goal: compute the backwards pass without needing to cache large tensors which would blow up with increased sparsification.
Tensors:
X_{ted}: input tensorsW^1_{edn}: up projection