Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save VenkataSakethDakuri/c876823a49590c62db11a4f065661ad7 to your computer and use it in GitHub Desktop.
Save VenkataSakethDakuri/c876823a49590c62db11a4f065661ad7 to your computer and use it in GitHub Desktop.
RAG
Table Stakes
Better Parsers
Chunk Sizes
Hybrid Search
Metadata Filters
Advanced Retrieval
Reranking
Recursive Retrieval
Embedded Tables
Small-to-big Retrieval
Fine-tuning
Embedding fine-tuning
LLM fine-tuning
Agentic Behavior
Routing
Query Planning
Multi-document Agents
Less Expressive, Easier to Implement, Lower Latency/Cost ⟶ More Expressive, Harder to Implement, Higher Latency/Cost
BM25 + Vector Search: Combines traditional keyword-based retrieval (BM25) with embedding similarity search to capture both semantic meaning and exact keyword matches.
Minimum Similarity Threshold: Instead of using a fixed top-k, you can filter retrieved nodes based on a minimum similarity score.
Reranking is a specialized component in information retrieval systems that performs a crucial second-stage evaluation of search results after an initial set of potentially relevant items is retrieved. The process involves reassessing and reordering these results to push the most relevant items to the top of the list.
At its core, reranking implements a two-stage retrieval process:
Initial retrieval: A fast, scalable method (like embedding-based similarity search or BM25) retrieves an initial set of candidate documents.
The chunk overlap ensures that contextual information at the boundaries between chunks is preserved, which is essential for maintaining semantic coherence and improving retrieval accuracy.
Smaller Chunks (e.g., 256 tokens or less)
Higher precision: Smaller chunks contain more focused information, leading to more precise retrieval
Reduced noise: Less irrelevant information is included with the relevant content
Better for specific queries: When users ask highly specific questions, smaller chunks can pinpoint exact answers
Improved semantic focus: Each chunk tends to cover a single concept or topic
Larger Chunks (e.g., 512-1024 tokens or more)
Better recall: More likely to capture all relevant information about a topic
Preserved context: Maintains more surrounding context that might be important for understanding
Reduced fragmentation: Related concepts are less likely to be split across multiple chunks
Better for complex queries: When questions require synthesizing multiple pieces of information
Retrieval accuracy improves when using a database like ChromaDB instead of in memory.
Advanced Techniques:
When storing the chunks generate a hypothetical question that this chunk can possibly answer and store this as well.
Can use relational or graph based database depending on usecase.
Retrieval-Augmented Generation (RAG)
Dynamic Information Retrieval: RAG retrieves relevant external data from knowledge bases or databases in real-time to augment the model's responses, ensuring accuracy and contextual relevance.
Reduces Hallucinations: By grounding responses in authoritative sources, RAG minimizes the chances of generating incorrect or fabricated information.
Cost-Effective Updates: RAG eliminates the need for frequent retraining by dynamically incorporating updated external data into the response generation process.
Cache-Augmented Generation (CAG)
Preloaded Knowledge: CAG preloads all required information into the model's extended context window, avoiding real-time retrieval altogether.
Low Latency: Eliminates retrieval delays by using cached knowledge, enabling faster and more consistent responses.
Simplified Architecture: CAG avoids complex retrieval mechanisms, making it ideal for scenarios with stable datasets that fit within the model's memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment