The training task for Transformers, formulated as
Unlike RNNs that process input sequences step by step and maintain hidden states to carry information across steps, Transformers rely on self-attention mechanisms. These mechanisms allow each element in a sequence to dynamically attend to any other elements, thereby capturing dependencies regardless of their positions in the sequence. This feature enables the Transformer model to parallelize training more effectively than RNNs, as it eliminates the need for sequential processing dictated by hidden states.
The absence of hidden states in Transformers re