Author: RyAnne Graff
Date: July 10, 2025
License: CC BY 4.0
Co-developed with assistance from ChatGPT (OpenAI).
Title: Notes on QKV Attention and Training Dynamics
Introduction Transformer models rely on Query-Key-Value (QKV) attention as a core mechanism for capturing relationships between tokens. While QKV attention has proven effective, this document explores an observation about how the QKV projection weights are trained, and raises questions about whether they align with the broader roles they ultimately serve in the model.
This is not meant as a definitive critique, but rather an invitation to consider a possible under-discussed property of Transformer training.
Initial Question
- What happens to QKV attention when the training data is presented in a different order? Would the learned QKV weights converge to the same result?
- If model behavior appears stable despite varying training order, does that indicate QKV attention is inherently robust — or could it be masking an underlying inconsistency in how the projections are learned?
Clarification
- This question does not refer to reversing tokens in a sequence.
- It refers to changing the order of training samples (e.g., documents or data segments) across training steps, and observing how this affects the stored (learned) QKV projection weights.
Observation Summary
-
QKV Weights May Be Path-Dependent
- The QKV weights (
W_Q,W_K,W_V) are updated incrementally, based only on the data in each minibatch. - This means they may reflect the history and order of training, rather than a global understanding of the data.
- The QKV weights (
-
Projections Are Computed Token-by-Token
- Each token's QKV is computed independently, without access to the full layer’s state at that time.
- There doesn’t appear to be a phase where QKV weights are revisited to match the stabilized behavior of a layer later in training.
-
Implications for Global Consistency
- If QKV weights are used to mediate relationships across all inputs, but are shaped only by local and transient signals, they may carry forward biases or assumptions that aren’t re-evaluated.
-
Mismatch Between Emergent Behavior and Learned Projections
- Transformer layers often take on functional roles as training progresses.
- But the QKV weights may not be realigned to reflect the final role of their layer, which could introduce inefficiencies or inconsistencies.
Potential Consequences (Speculative)
- Attention heads may specialize inconsistently between runs.
- Interpretability may be limited if the projections don't reflect layer behavior coherently.
- Some models may require scale or redundancy to compensate for projection drift.
Possible Direction (for Further Exploration)
Segment-Level QKV Context
One possible direction is to maintain multiple sets of QKV projection weights (W_Q, W_K, W_V) per layer, each corresponding to different segments or partitions of the overall training dataset. These could represent different thematic, structural, or statistical regions of the data. A global QKV set could then be composed or informed by these segment-specific sets—either through averaging, weighting, or some learned combination—so that each layer's projections reflect a broader consensus rather than just the history of minibatch updates. Compute QKV projections using summaries of small groups of tokens in addition to individual token projections.
Conclusion This is not a formal claim, but an exploratory note. The way QKV attention is trained — incrementally, without re-alignment — may have underappreciated implications for how attention heads develop, behave, and generalize.
This write-up is shared in case others have thought about similar dynamics or have ideas for how to investigate them further.