One of the most fundamental objects in machine learning are probabilsitic models, and one of the most fundamental questions are how do make such models (i.e. how do you represent probability distributions?) and how do you train them.
The success of language models like GPT-3 stems in big part from the idea of representing probability distributions autoregressively, which allows computing the exact probability of a sequence/sentence, by decomposing it as a product of conditional probabilities over a tractably small number of elements (the tokens).
$P(\mathbf{x}) = P(\mathbf{x}_0) P(\mathbf{x}_1|\mathbf{x}_0)P(\mathbf{x}_2|\mathbf{x}_0, \mathbf{x}_1) ... P(\mathbf{x}_N|\mathbf{x}0,...,\mathbf{x}{N-1})$