- The computation of QK^T involves N^2 dot products, where N is the sequence length.
- The softmax(QK^T)V operation requires an N x N attention matrix, making complexity quadratic in N.
Repeated theme: use lighter "augmentation" of attention (e.g. clustering, lsh lookup, etc) to vet only highly relevant tokens for expensive attention.