Skip to content

Instantly share code, notes, and snippets.

View gashon's full-sized avatar
👁️

Gashon Hussein gashon

👁️
View GitHub Profile
@gashon
gashon / long-transformers.md
Last active December 19, 2024 08:57
WIP: Long Context Transformers notes

Transformers Long Context

  • The computation of QK^T involves N^2 dot products, where N is the sequence length.
  • The softmax(QK^T)V operation requires an N x N attention matrix, making complexity quadratic in N.

Selecting a Subsection of Tokens

Repeated theme: use lighter "augmentation" of attention (e.g. clustering, lsh lookup, etc) to vet only highly relevant tokens for expensive attention.

Reformer