Skip to content

Instantly share code, notes, and snippets.

@AaradhyaSaxena
Last active January 7, 2025 12:22
Show Gist options
  • Save AaradhyaSaxena/5406eaf161d779923a85479a41a4e527 to your computer and use it in GitHub Desktop.
Save AaradhyaSaxena/5406eaf161d779923a85479a41a4e527 to your computer and use it in GitHub Desktop.
ML

References

LSTM

Unread

Circuits

Langchain

Transformer

Back Propagation

What is Backpropagation?

Backpropagation (short for "backward propagation of errors") is a key algorithm used in training artificial neural networks. It involves two main phases:

  1. Forward Pass:

    • The input data is passed through the network layer by layer to obtain the output (predicted values).
    • During this phase, the activations at each layer are computed.
  2. Backward Pass:

    • The error (difference between the predicted output and the actual target) is propagated backward through the network.
    • Gradients of the loss function with respect to each weight in the network are calculated.
    • These gradients are used to update the weights to minimize the loss function.

The goal of backpropagation is to optimize the weights of the network by minimizing the loss function, which measures how well the network's predictions match the actual targets.

Why Do We Want the Loss Function to Be Differentiable?

The loss function needs to be differentiable because backpropagation relies on gradient descent or similar optimization algorithms to update the weights. Gradient descent requires the computation of the gradient (partial derivatives) of the loss function with respect to the weights. These gradients indicate the direction and magnitude of the weight updates needed to reduce the loss.

Other Functions That Need to Be Differentiable

  1. Activation Functions:

    • Common activation functions like sigmoid, tanh, and ReLU (Rectified Linear Unit) must be differentiable (almost everywhere) so that gradients can be computed and propagated backward.
  2. Loss Functions:

    • Examples include Mean Squared Error (MSE) for regression tasks and Cross-Entropy Loss for classification tasks. These functions must be differentiable to compute gradients.

What Happens When Functions Are Not Differentiable?

When functions within the network are not differentiable, backpropagation cannot compute the necessary gradients, and the training process fails. Here are some potential issues:

  1. Non-Differentiable Activation Functions:

    • If an activation function is not differentiable, gradients cannot be propagated through it. For instance, the step function is not differentiable at the threshold point. If used as an activation function, it would prevent the backpropagation algorithm from updating the weights effectively.
  2. Non-Differentiable Loss Functions:

    • If the loss function is not differentiable, the optimization algorithm cannot compute the direction in which to adjust the weights to minimize the loss. For example, using the absolute error (L1 loss) is technically differentiable everywhere except at zero. However, modern frameworks handle such cases with subgradients.

Example of Non-Differentiable Function and Its Impact

Example: Step Function as an Activation Function

  • The step function outputs 0 for inputs less than a threshold and 1 for inputs greater than or equal to the threshold. It looks like this:

    [ f(x) = \begin{cases} 0 & \text{if } x < 0 \ 1 & \text{if } x \geq 0 \end{cases} ]

  • This function is not differentiable at (x = 0), and its gradient is zero everywhere else. When used as an activation function in a neural network, the gradient descent algorithm would not receive useful gradient information to update the weights, causing the training to stall or fail entirely.

Alternative: Sigmoid Function

  • The sigmoid function, defined as:

    [ \sigma(x) = \frac{1}{1 + e^{-x}} ]

    is differentiable and has a gradient that is non-zero for a wide range of input values, making it suitable for use in backpropagation.

Conclusion

For effective backpropagation and training of neural networks, both the loss function and the activation functions used within the network need to be differentiable. This allows the computation of gradients, which are essential for updating the weights using optimization algorithms like gradient descent. When functions are not differentiable, the training process fails because gradients cannot be computed, leading to issues like stalled training or inability to optimize the model effectively.

RAG

RAG Sequence vs RAG Token

RAG-Sequence:

  1. Document Retrieval:

    • The system takes the input query: "What is the capital of France?"
    • It converts this query into a vector representation using an encoder.
    • This vector is used to search a pre-indexed document database (often using techniques like cosine similarity).
    • The k most relevant documents are retrieved (let's say k=3 in our example).
  2. LLM Integration:

    • The LLM (e.g., GPT model) is given the task of generating an answer.
    • For each of the k retrieved documents, the LLM does the following: a. Takes the original query and the retrieved document as input. b. Generates a complete answer sequence based on this input.
  3. Probability Calculation:

    • For each generated sequence, the system calculates two probabilities: a. The probability of the document being retrieved given the input query. b. The probability of the generated sequence given the document and query.
    • These probabilities are combined to get an overall probability for each sequence.
  4. Selection:

    • The sequence with the highest overall probability is selected as the final answer.

RAG-Token:

  1. Initial Document Retrieval:

    • Similar to RAG-Sequence, the system retrieves k relevant documents based on the initial query.
  2. Token Generation:

    • The LLM begins generating the answer token by token.
    • For each token: a. The LLM considers the original query, previously generated tokens, and all k documents. b. It generates a probability distribution over the vocabulary for the next token.
  3. Probability Marginalization:

    • For each possible next token, the system marginalizes over all documents: a. It calculates the probability of the token given each document. b. These probabilities are weighted by the probability of each document being relevant. c. The weighted probabilities are summed to get the final probability for each token.
  4. Token Selection:

    • The token with the highest marginalized probability is selected and added to the output.
  5. Document Re-retrieval (optional):

    • After generating each token, the system can optionally re-retrieve documents.
    • It uses the original query plus all generated tokens so far as the new query.
    • This allows for dynamic document selection as the answer progresses.
  6. Repetition:

    • Steps 2-5 are repeated for each token until the answer is complete.

How the LLM Comes into Picture:

  • In both approaches, the LLM serves as the generator of text based on the input and retrieved documents.
  • In RAG-Sequence, the LLM generates complete sequences for each document.
  • In RAG-Token, the LLM generates probability distributions for each token, considering all documents.

The key difference is that RAG-Sequence uses the LLM to generate complete answers from each document independently, while RAG-Token uses the LLM to generate tokens sequentially, potentially using different document combinations for each token.

FiD (Fusion-in-Decoder):

  1. Document Retrieval:

    • Similar to RAG, retrieves k relevant documents for the query.
  2. Input Preparation:

    • Instead of generating tokens sequentially, FiD prepares a set of inputs.
    • For each retrieved document, it creates an input by concatenating: [Question] + [Retrieved Document]
  3. Parallel Processing:

    • All k inputs are processed in parallel through the encoder.
    • This results in k separate encoded representations.
  4. Fusion in Decoder:

    • The decoder takes all k encoded representations simultaneously.
    • It "fuses" the information from all documents in its self-attention layers.
  5. Single-Pass Generation:

    • Unlike RAG-Token, FiD generates the entire answer in one pass.
    • It doesn't generate token-by-token with intermediate retrievals.

Example for FiD:

Query: "What is the capital of France?"

  1. Retrieve 3 documents (k=3):

    • Doc 1: "Paris is the capital of France."
    • Doc 2: "France is a country in Europe."
    • Doc 3: "The Eiffel Tower is in Paris, France."
  2. Prepare 3 inputs:

    • Input 1: "What is the capital of France? Paris is the capital of France."
    • Input 2: "What is the capital of France? France is a country in Europe."
    • Input 3: "What is the capital of France? The Eiffel Tower is in Paris, France."
  3. Encode all inputs in parallel.

  4. The decoder processes all encoded inputs together, fusing the information.

  5. Generate the complete answer in one pass: "The capital of France is Paris."

Key Differences:

  1. Token Generation:

    • RAG-Token: Generates each token separately, potentially using different document combinations.
    • FiD: Generates the entire answer in one pass after fusing all document information.
  2. Document Usage:

    • RAG-Token: Can dynamically change document relevance for each token.
    • FiD: Uses all documents throughout the entire generation process.
  3. Computational Efficiency:

    • RAG-Token: May be slower due to token-by-token generation and potential re-retrievals.
    • FiD: More efficient as it processes all documents in parallel and generates the answer in one pass.
  4. Flexibility:

    • RAG-Token: More flexible, can adapt to newly retrieved information for each token.
    • FiD: Less flexible but potentially more coherent as it considers all information simultaneously.

FiD's approach of fusing all document information before generation can lead to more globally coherent answers, while RAG-Token's approach allows for more dynamic, token-level adaptation to the retrieved information.

RETRO (Retrieval-Enhanced Transformer):

  1. Chunked Processing:

    • RETRO processes the input in chunks (e.g., 64 tokens per chunk).
  2. Retrieval:

    • For each chunk, RETRO retrieves k nearest neighbors from a large dataset.
    • These neighbors are sections of text similar to the current chunk.
  3. Encoder-Decoder Architecture:

    • Uses a modified Transformer architecture.
    • The encoder processes retrieved chunks.
    • The decoder generates the output, attending to both the input and retrieved chunks.
  4. Cross-attention Mechanism:

    • Introduces a special cross-attention mechanism between the generated text and retrieved chunks.
  5. Chunked Generation:

    • Generates output chunk by chunk, retrieving new neighbors for each chunk.

Example for RETRO:

Query: "What is the capital of France?"

  1. Process the query as the first chunk.

  2. Retrieve k similar chunks from the dataset, e.g.:

    • Chunk 1: "Paris is the capital city of France. It is known for..."
    • Chunk 2: "The capital of France is Paris, a global center for..."
  3. The encoder processes these retrieved chunks.

  4. The decoder starts generating the answer, attending to both the query and the retrieved chunks.

  5. As it generates, it may retrieve new chunks based on the partially generated answer.

  6. Final output: "The capital of France is Paris."

Comparison with RAG-Token and FiD:

  1. Granularity of Retrieval:

    • RAG-Token: Retrieves documents for each token generation.
    • FiD: Retrieves documents once for the entire input.
    • RETRO: Retrieves chunks for each input chunk (intermediate granularity).
  2. Processing of Retrieved Information:

    • RAG-Token: Marginalizes over documents for each token.
    • FiD: Fuses all document information in the decoder.
    • RETRO: Processes retrieved chunks in the encoder and uses cross-attention in the decoder.
  3. Generation Process:

    • RAG-Token: Token-by-token generation with potential re-retrieval.
    • FiD: One-pass generation after fusing all document information.
    • RETRO: Chunk-by-chunk generation with retrieval for each chunk.
  4. Scalability:

    • RAG-Token and FiD: Limited by the number of documents that can be processed.
    • RETRO: Can scale to much larger retrieval databases due to its chunked approach.
  5. Attention Mechanism:

    • RAG-Token and FiD: Standard attention mechanisms.
    • RETRO: Introduces a special cross-attention mechanism between generated text and retrieved chunks.
  6. Computational Efficiency:

    • RAG-Token: Potentially less efficient due to token-level operations.
    • FiD: More efficient with parallel processing of documents.
    • RETRO: Efficient for large-scale retrieval but may have overhead for chunk processing.
  7. Flexibility:

    • RAG-Token: Highly flexible, adapting at token level.
    • FiD: Less flexible, but considers all information at once.
    • RETRO: Balances flexibility and coherence with chunk-level adaptation.

RETRO's approach allows it to scale to much larger retrieval databases compared to RAG and FiD, while still maintaining the ability to adapt its generation based on retrieved information. It strikes a balance between the token-level flexibility of RAG-Token and the global coherence of FiD, operating at the chunk level.

Internet-augmented LMs

It proposes using a humble “off-the-shelf” search engine to augment LLMs. First, they retrieve a set of relevant documents via Google Search. Since these retrieved documents tend to be long (average length 2,056 words), they chunk them into paragraphs of six sentences each. Finally, they embed the question and paragraphs via TF-IDF and applied cosine similarity to rank the most relevant paragraphs for each query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment