Tensor Parallelism (TP) shards individual tensors across devices, following the Megatron-LM pattern:
- Column-wise: Input projections (q/k/v, gate/up)
- Row-wise: Output projections (o_proj, down_proj)
- Sequence Parallel: Shards activations on sequence dimension for memory savings
- Loss Parallel: Keep logits sharded on vocab dimension for efficient cross-entropy