ARENA notes

General vocab

CNN: convolutional neural net: replace linear layers with convolutions
GAN: generator and discriminator training in tandem
SGD: stochastic gradient descent
Supervised: training data is labeled (MNIST 0-9 digit labels)
Unsupervised: training data is unlabeled (DCGAN bedrooms are just bedrooms)
Parameters: learned weights and biases within NN layers
Activations: transient computations calculated during a forward pass

KL divergence

Probability distance function
Inuititions for KDL
- Personally think the gambling analogies clicked the most for me
- If the house is mistaken in its belief that a game follow probability distribution $Q$ when you know it follows $P$ then $D_{KL}(P||Q)$ tells you how much money you could win using an optimal strategy to exploit this misbelief
KDL Intuitions video
- For 2 probability distributions $P$ and $Q$ over the same space of outcomes $|X|=n$ measure the likelyhood of some event $E = x_1, x_2, x_3, \cdots, x_k$ under both distributions
  - $P_P(E) = \prod p_i$ and $P_Q(E) = \prod q_i$
  - Quantify the distance $D(P||Q)$ from $P$ to $Q$ as the ratio of $P_P$ and $P_Q$ for any event
  - Normalize by raising this ratio to the $1/k$ and logging it
  - Using log rules you and taking limit as $k$ goes to infinity we reduce down to $\sum p_i\log(\frac{p_i}{q_i}) = D_{KL}(P||Q)$

Cross Entropy Loss

Classifier models usually output a probability distributions (softmax of logits) so we need a way to measure the distance of this probaility ditstribution against the true distribution from the trianing data
- input $x_i$ leads to prediction $P(y|x_i;\theta)$ (model parameters $\theta$) whereas real distribution is $P^*(y|x_i)$
Naturally you could use KL-Divergence to measure distance from $P$ to $P^*$
- Using log rules you can separate the computation into a term that depends on $\theta$ and one that only depends on the true training data distribution
- The true distribution is fixed so we can focus on just minimizing the term that depends on theta
- This term is $-\sum P^*(y|x_i)\log(P(y|x_i;\theta))$ which is the definition of cross entropy loss
In classifiers the true distribution is often all zeros with a single one in for the true label of the example
- In this case cross entropy loss reduces to maximum likelyhood estimator for this dataset

ResNets

Non-linear layers are bad at learning the identity function and its ilk
- Short circuiting a layer allows that layer to approximate the identity by learning to set its weights to near-zero to let the input bypass through the identity branch
- Empirically this performs much better than without residuals
Using projections with parameter vs straight identity produces no significant improvement
In ImageNet they use batch norms so there’s unlikely to be vanishing gradients (non-zero variance on forward step)
- As of paper publishing research is not quite sure why “plain nets” take so long to converge
ResNets converge faster
- Belief is that identity shortcuts let optimizer solve some layers faster
Bottlenecks
- Instead of short circuiting every pair of 3x3 conv create triplets of 1x1 conv reducing dim -> 3x3 conv -> 1x1 conv restoring dimension
- Short circuit the entire triplet
- Shows major improvement over bottlenecked “plain” nets
- Parametrized projections here hurt since they short circuit the higher dimensional space and thus introduce a lot of parameters
- Using bottle necks allows for much deeper networks with fewer params (1x1 conv much fewer params than 3x3)
  - 50, 101, 151 layer ResNet perform way better
- At each layer the stddev of responses is lower than plain nets somewhat validating hypothesis that resNet non-linear functions tend to approx 0 to let the identity part take over
- At 1000 layers it still optimizes and works but performs slightly worse than 100 layers due to overfitting
Technique also worked for general object classification outside of ImageNet/CIFAR

Batch Norms

Internal Covariate Shift: the change in the distribution of network activations due to the change in network parameters during training
Using sigmoid activation it's bad to fall into the "saturated regime" (long tail) since gradient of activation goes to 0
Abstract: normalizing for each training minibatch allows for higher learning rates and regularizes at least as good as dropout
Training NN with mini-batches instead of on single examples is generally better
- More computationally efficient wiht parallel processing
- Mini-batches better represent the entire training set distribution (and it's gradient) than does a single example
- Historically we've known that "whitening" (i.e. scaling for mean 0 var 1) inputs leads to faster training of an entire network
We can model each layer of a NN as a separate NN who input is the output of the previous layer
- Each SGD step across layers looks the same when you model it this way IF the distribution of the input does not change from layer to layer
- This is generally not true since the activation function will "saturate" in some dimensions of the input leading to vanishing gradients and slow convergence
BatchNorm: core idea is to "whiten" inputs across layers not just at beginning of training
- When computing normalization factors independently of the gradient descent steps we find that the normalization adjustment can cancel out the gradient descent step adjustment leading to no reduction in loss
- Need to make the gradient descent be "aware" of the normalization
- Naively trying to account for this term is computationally expensive for statistics reason I don't feel like reading about
Simulate full whitening between layers by scaling mean and var of each dimension independently
- Need to introduce transformation that allows normalization step to also be the identity function to effectively allow no normalization to take place
  - Ex: normalizing inputs to sigmoid restricts sigmoid to linear section making the entire layer linear
  - By-passing normalization can allow inputs to trigger non-linear zones of sigmoid if true non-linearity is needed
- Accomplished by introducing new scaling and shifting parameters $\gamma$ and $\beta$ that can learn to undo the variance scaling and mean shifting done during normalization
  - Do the normalization across the mini batch across each dimension of input so these learned paramters can participate in the gradient descent step
- Add some $\epsilon$ to the var to stabilize it if var is near 0
TODO: only got to section 3.1

Random Aside: Dropout

Deep NNs often overfit data and generalize poorly
One solution is to use "ensemble" models that average the predictions across multiple models with different settings trained on the same data
Approximate training multiple architectures in parallel by randomly "dropping" nodes during training
- Forces nodes to randomly take on different levels of responsibility on different inputs
- Without dropout sometimes later nodes can "co-adapt" to learn to correct mistakes from previous nodes
Dropout also promotes learning sparse representations (useful for sparse autoencoders)
Effectively thins network during trianing so need to start with a "wider" network
Used for any layers
- For each layer specify hyperparamter for probability at which a node is kept (usually ~0.5 for hidden and 0.8-1 for visible)
- Need to rescale weights by drop out probability since weights will increase to compensate for less connectivity
  - Can be done during training via "inverse dropout"
Replaced older "regularization" techniques

Convolutions

Convolution explainer with excel
- TODO
Convolution guide paper
- TODO

GANs

Google Lesson

Generators generally more complex than discriminators
- Discriminators can draw lines to divide space but GANs must truly learn the distribution to create new points very close to real points
Training convergence is generally unstable
- If generator gets really good then discriminator can do no better than 50/50 guess which means its feedback for generator is bad
  - More training on bad feedback can degrade quality
- Conversely, early on in training the discriminator’s job is “easy” since generator outputs are horrible
  - Naive generator loss $log(1-D(G(z)) \sim log(1-0) = 0$ when $D$ is very good and $G$ is very bad
  - "Better" loss $log(D(G(z))$ is monotonic in the opposite direction so maximizing this achieves the same result as minimizing the original loss except now when $D(G(z))$ is near zero early in training $log(D(G(z))$ is very large magnitutde
Wasserstein GANS:
- Let discriminator output any positive number instead of probability that sample is real/fake
- Goal is for discriminator to output generally larger numbers for real inputs than for fake inputs
- New loss $D(x) - D(G(z))$
  - No need to log since not probabilities
  - Apparently leads to better training
  - Requires weights to be “clipped” for theoretical math reasons
- Original Wasserstein GAN paper
Failure modes:
- Vanishing gradient (discriminator wins)
  - Fixes: Wasserstein, modified minimax (what we do in Streamlit)
- Mode collapse (generator only learns to produce 1 really good fake)
  - Discriminator conversely learns to just reject that 1 sample
  - Fixes: Wasserstein, unrolled GANs
    - TODO unrolled paper
- Failure to converge
  - Fixes: add noise to discriminator inputs, regularization
    - TODO: regularization paper
Variations
- Progressive GANs: early generator layers do low-res outputs to train more quickly
- Conditional GANs: specify which label the generator should try outputting
- Image to Image: input to generator is existing image and its goal is to make it look like ground truth image
  - Loss is some combination of usual discriminator loss + pixel by pixel loss for diverging from ground truth
- Cycle Gan: generator learns to make input images look like images from ground truth set (horses to zebras) without labels
- Text-to-image: very limited
- Super resolution: learn to upscale an input image
- Inpainting: learn to generate blacked out parts of photo
- Text-to-speech

DCGAN

Model architecture
- Replaced pooling techniques for down/up sampling with convolutions
- Replace fully connected pooling layers at start/end of conv layers with simple linear layers
  - “Remove fully connected hidden layers for deeper architectures”
- Added batch norm to every conv layer
  - Didn’t batch norm beginning/end of conv stack to avoid “sample oscillation and model instability”
- Replace old activation with ReLU
  - Found LeakyReLU did well for discriminator
  - TanH good for bounding output of generator
Trained using SGD with Adam optimizer
De-duped training set by training an auto-encoder on crops of the images to generate “semantic hashes”
Used some openCV also to scrape faces off internet to use as more training data
“Walk the latent space” to look for discontinuities
- Smoothly interpolating the input random noise vector should produce smooth changes in output images with new features emerging
Inputs that maximize the outputs of different filter layers help visualize what the layer has “learned” to recognize
- With DCGAN trained on bedrooms many filters’ best input show distinctly “bed” and “window” like features
For a bunch of samples manually draw bounding boxes around generated windows
- Train log-regression model on later generator conv layers (i.e. conv layers that input nearly full dimensional images) to predict if layer would’ve been activated on that bounding-box subspace
  - Basically looking for layers that activate on subimages of windows
- Dropping layers that are predicted to activate on windows in the generator and feeding the same input showed outputs without windows
Averaging latent vectors that produce output images of the “same type” gives good representations of those features
- Smiling woman - neutral woman + neutral man = smiling man vector
- Can interpolate between representation vectors too
Still has some model instability with mode collapse

GAN hacks

TODO

AutoEncoders

TODO: go through streamlit again to jot down non-obvious stuff + read any papers I skipped

Plain AE

VAE

Diffusion

TODO

Optimizers

Streamlit notes

infitesmal steps in gradient descent mathematically correct but computationally infeasible
nice memory footprint since $N$ parameters require $N$ gradients
"curvature" of gradient function ignored by partials in each direction
- can be addressed with precondition matrix but computing such matrix is also intractable
large batch sizes feel intuitively correct (closer to true gradient across dataset)
- empirically can cause problems
- start with small batches and increase size over time
- powers of 2 good for recirsive computations, multiples of 32 good for CUDA
Weight decay: at each iteration slightly scale weights down by $.9999$ or something else close to 1
- no mathematical proff it helps but emprically shows benefits
- type of inductive bias: helps us when it's justified and hurts us when it's not
- generally applied only to weights and not biases; also nto applied to BatchNorms
Momentum: adjust learning rate by moving average of past gradients
Pathological curves are hard to find global extrema on due to steep gradients toward local extrema in one dimension and shallow gradients toward the global extrema in another direction

Extra reading

TODO: long list of tabs lol

Backprop

TODO

Transformers

Mathematical Framework for Transformers

Induction Circuits

Transformers from Scratch

General transformer architecture is as follows:
1. tokens -> embeddings
2. residual stream + attention heads
3. residual stream + MLP layers
4. embeddings -> tokens
Residual Stream
- Lets transformer "remember" stuff from previous layers
- use logit lens to view predictions straight from stream before MLP layers and you'll see reasonable predictions
Attention
- Generalizes convolution
  - Attention head is learned algorithm for encoding "locality" of data within sequence
  - Vision convolutions encode locality of pixel data but for text such locality is too nuanced and must be learned programmatically
- Generally broken up into 2 steps
  - First learn sources and sinks for where information will move
    - Function of source tokens and future tokens
    - Linear transformation of input into "keys" (sources) and "queries" (destinations)
  - Then learn what information to actually move
    - Function of ONLY source tokens
    - Linear transformation from inputs to "values"
    - take average of values weighed by "key" for each "query"
  - "Autoregressive" = don't let attention look ahead
MLP
- Generally uses GELU activation function
- Doesn't move information between positions
- does "reasoning" on features extracted by attention layers
- nobody is quite sure how to intuit what is actually going on in MLP layers
- Can be thought of as Key-Value pairs memorizing what each feature means in the abstract output space
  - TODO Memory paper
  - "MLPs as memory management"
Embeddings
- Sometimes we used "tied embeddings" to reuse the same embedding matrix to un-embed output vectors into tokens
  - Efficient to do but not actually a principled approach
  - Imagine transformer model was just a no-op
    - Going from token->embedding->unembedding->token thus can be modeled by pure matrix multiplication $W_E * W_U$ of the embedding and unembedding matrices
    - Bi-grams like "Barrack Obama" imply that the token "Barrack" is way more likely to be paired with the token "Obama" but not necessarily that "Obama" is followed by "Barrack"
    - Thus this transformation is not symmetric but if $W_U=W^T_E$ then their product would be symmetric
  - In practice the residual stream nudges the transformer towards no-op direction meaning this bi-gram assymetry will emerge
  - So our embedding and unembedding transformations cannot be the same
LayerNorm
- BatchNorm analogy for transformers making each vector have mean 0 and var 1
  - Now scale/translate vectors to desired mean/var
- Technically non-linear so it makes interprebility hard
Positional Embeddings
- Attention layers start out symmetric w.r.t. relative token index distances but really it should care more about closer tokens than farther ones
- Can be addressed by learning a lookup map from token index to residual vector to add back to the embedding before applying attention head
  - Residual vector has "local" shared memory that can help nudge embedding to carry information about nearby tokens

QKV Composition Slides

interpretability of transformers better done "vertically" across attention layers rather than within them
Each attention head has 4 matrices
- $Q$ and $K$ of size d_model x d_head (takes embeddings to "made up" head space)
  - For each index the "query" value tells you how much the "key" token at another index affects the prediction for the token at this index
  - For each index i compute a score based on the query and key of every other index j (using dot product) that represents how the transformer should "pay attention" to token at j when predicting the token value at i
- $V$ of size d_model x d_out and $O$ of size d_out x d_model (takes "made up" output space to embeddings)
  - $V$ produces a single "value" of size d_out for each token index in the sequence
  - For each index i of the sequence $V$ then scales the value vector of that index by the score from $Q^TK$
  - $O$ converts that scaled value vector back into an embedding vector
- in practice d_head = d_output
Reformulate the full attention layer into $Q^TK$ and $OV$ since dimensions line up
- technically is inefficient since Q^TK is of size d_head^2 and $OV$ is of size d_model^2
  - both would be very sparse matrices since generally d_head << d_model
- Insert a softmax between Q^TK and OV
Full attention head of $Q^T$ and $OV$ circuits can be thought of as executing 2 functions to generate data to append to each word of the sequence
- inputs look like tuples of words along with their index in the sequence (word, i)
- $Q^TK$ circuit thought of as a function that scores for how much "attention" a token at index i pays to a token at index j
  - Signature def QK((query, query_idx), (key, key_idx)) -> number
- $OV$ circuit is a function that takes a weigted averages how much attention index i paid to each other token and spits out the token that represents that average
  - Signature def OV((val, val_idx)) -> token
  - concat the token outputs of $OV$ to the end of the initial inputs to get $(word, i, a_1)$ to get the output of the first attention layer
    - Further attention layers will have access to previous layers thorugh their inputs so the $k$-th layer takes as input $(word, i, a_1, \cdots, a_{k-1})$ and outputs $(word, i, a_1, \cdots, a_k)$
- I am using "token" and "embedding" and "word" interchangably here
Thinking of the $Q^TK$ and $OV$ circuits as independent functiosn can help us understand what is happening across layers by imagining function composition
Attention head that shifts the entire sequence forward by 1 word is relatively simple
- $Q^TK$ returns 1 if the key (dest) index is one step ahead of the query (source) index and 0 otherwise
  - Looks like identity matrix with diagonal offset by 1 to the left
- $OV$ simply spits back out its input token
  - Scaling each token in our vocab by the scaling factor from $Q^T$ sends every token to 0 except the token that immediately preceeded the current token which is scaled by 1 back to itself
  - Summing all of these 0-vectors with the vector for the token immedeately preceeding the current token gives an input to $OV$ of the token that preceeded the current index
Induction heads recognize basic patterns and guess inductively that the next token follows the pattern
- ex: a b c d a b c <guess> should guess d
- This cannot be accomplished with a single attention layer since each $Q^TK$ evaluation can only be aware of a single pairwise comparison between 2 indices
  - If I'm considering index 5 (i.e. token b) and am computing $Q^TK$ against index 2 (token c) how could I possibly know whether or not this token matters?
  - I need to consider the whole sequence all at once to know if an arbitrary token was part of a repeating (inductive) substring and pairwise attention computation from $Q^TK$ just aren't expressive enough for that
Possible to do induction with 2 layers
- layer 1:
  - $Q^TK$: returns 1 if the key (dest) index is one step ahead of the query (source) index and 0 otherwise
  - $OV$: return the input token
    - this is the same as the "shift 1 forward" attention head from above
  - The output for index i looks like an $(word, i, prev token)$
- layer 2:
  - $Q^T$: return 1 if the key (dest) TOKEN is equal to the query (source) TOKEN and 0 otherwise
  - $OV$: return the output of the previous attention layer
  - In english composing these 2 attention layers does the following:
    - layer 1: for every word $w$ at index $i$ in my sequence remember what word came before me as $p_i$
    - layer 2: for every word $w$ at index $i$ in my sequence, if $w$ is equal to the previous word $p_j$ for some other word at index $j$ then return the word $p_j$ I remembered for index $j$
    - This effectively "remembers" bigram pairs
      - if at index $k$ I remembered previous word $w_{k-1}$ then when I'm at index $i$ and notice the word $w_i$ is equal to that $w_{k-1}$ then I'll output whatever originally came at index $k$ again
Multiple layers of attention can implement arbitrarily complex functions this way and interpretting what they do is at the heart of mech interp
- When an attention layer adds information to the residual stream that is used by the next layer in the "query" argument of $Q^TK$ it's called Q-Composition
  - the information the first layer adds generally only pulled from the query argument
- Conversely if the information is used by reading from the "key" argument of $Q^TK$ then it's called K-Composition
- bigram induction head above uses K-composition since we read the augmented information off the the "key" argument
  - can be done qith Q-composition by instead:
    - Layer 1: for each index $i$ if it's word $w_i$ is equal to another word $w_j$ at index $j$ then augment the stream with index $j$ (i.e. return $(word, i, j)$)
    - Layer 2: for each index $i$ return the word at the index that comes after the index you remembered form the previous layer (i.e. return $(word, i, j, w_{j+1})$)
V-Composition is a little weirder
- is a "virtual" phenomena where behavior's achieved by separate attention heads combine to reproduce the behavior attainable by a single attention head
- is useful for the network as a whole since it can build a small set of base "instructions" out of it's limited attention parameters and combine them build more complex programs
  - analogous to CPU ISA tradeoff of instruction count vs complexity of programs
In practice the circuits don't actually "extend" the residual stream by lengthening the tuple
- Instead just adding the circuit outputs directly to the vectors in the stream works
- The overall network is linear if you squint hard enough so you can still interpret each attention layer as acting linearly over their previous outputs
- something something superposition
  - feels very quantum mechanics-y
  - the residual stream is your quantum state which and attention heads update the quantum state by adjust its superposition over the basis of observable outcomes (sequence of embeddings)
  - quantum operators are all linear so you can think of them moving around the superposition independently and adding up their independent movements

ferntheplant/ARENA.md

ARENA notes

General vocab

KL divergence

Random Aside: Dropout

Convolutions

GANs

AutoEncoders

Plain AE

VAE

Optimizers

Streamlit notes

Extra reading

Backprop

Transformers

Transformers from Scratch