VenkataSakethDakuri · July 28, 2025 07:09
diff --git a/gistfile1.txt b/gistfile1.txt
 Table Stakes
 Better Parsers
 Chunk Sizes
 Hybrid Search
 Metadata Filters

 Advanced Retrieval
 Reranking
 Recursive Retrieval
 Embedded Tables
 Small-to-big Retrieval

 Fine-tuning
 Embedding fine-tuning
 LLM fine-tuning
 Agentic Behavior
 Routing
 Query Planning
 Multi-document Agents

 Less Expressive, Easier to Implement, Lower Latency/Cost ⟶ More Expressive, Harder to Implement, Higher Latency/Cost


 BM25 + Vector Search: Combines traditional keyword-based retrieval (BM25) with embedding similarity search to capture both semantic meaning and exact keyword matches.
 Minimum Similarity Threshold: Instead of using a fixed top-k, you can filter retrieved nodes based on a minimum similarity score.
 Reranking is a specialized component in information retrieval systems that performs a crucial second-stage evaluation of search results after an initial set of potentially relevant items is retrieved. The process involves reassessing and reordering these results to push the most relevant items to the top of the list.
 At its core, reranking implements a two-stage retrieval process:
 Initial retrieval: A fast, scalable method (like embedding-based similarity search or BM25) retrieves an initial set of candidate documents.


 The chunk overlap ensures that contextual information at the boundaries between chunks is preserved, which is essential for maintaining semantic coherence and improving retrieval accuracy.

 Smaller Chunks (e.g., 256 tokens or less)
 Higher precision: Smaller chunks contain more focused information, leading to more precise retrieval
 Reduced noise: Less irrelevant information is included with the relevant content
 Better for specific queries: When users ask highly specific questions, smaller chunks can pinpoint exact answers
 Improved semantic focus: Each chunk tends to cover a single concept or topic

 Larger Chunks (e.g., 512-1024 tokens or more)
 Better recall: More likely to capture all relevant information about a topic
 Preserved context: Maintains more surrounding context that might be important for understanding
 Reduced fragmentation: Related concepts are less likely to be split across multiple chunks
 Better for complex queries: When questions require synthesizing multiple pieces of information


 Retrieval accuracy improves when using a database like ChromaDB instead of in memory.



 Advanced Techniques:
 When storing the chunks generate a hypothetical question that this chunk can possibly answer and store this as well.
 Can use relational or graph based database depending on usecase.



 Retrieval-Augmented Generation (RAG)
 Dynamic Information Retrieval: RAG retrieves relevant external data from knowledge bases or databases in real-time to augment the model's responses, ensuring accuracy and contextual relevance.

 Reduces Hallucinations: By grounding responses in authoritative sources, RAG minimizes the chances of generating incorrect or fabricated information.

 Cost-Effective Updates: RAG eliminates the need for frequent retraining by dynamically incorporating updated external data into the response generation process.

 Cache-Augmented Generation (CAG)
 Preloaded Knowledge: CAG preloads all required information into the model's extended context window, avoiding real-time retrieval altogether.

 Low Latency: Eliminates retrieval delays by using cached knowledge, enabling faster and more consistent responses.

 Simplified Architecture: CAG avoids complex retrieval mechanisms, making it ideal for scenarios with stable datasets that fit within the model's memory.
	Table Stakes
	Better Parsers
	Chunk Sizes
	Hybrid Search
	Metadata Filters

	Advanced Retrieval
	Reranking
	Recursive Retrieval
	Embedded Tables
	Small-to-big Retrieval

	Fine-tuning
	Embedding fine-tuning
	LLM fine-tuning
	Agentic Behavior
	Routing
	Query Planning
	Multi-document Agents

	Less Expressive, Easier to Implement, Lower Latency/Cost ⟶ More Expressive, Harder to Implement, Higher Latency/Cost


	BM25 + Vector Search: Combines traditional keyword-based retrieval (BM25) with embedding similarity search to capture both semantic meaning and exact keyword matches.
	Minimum Similarity Threshold: Instead of using a fixed top-k, you can filter retrieved nodes based on a minimum similarity score.
	Reranking is a specialized component in information retrieval systems that performs a crucial second-stage evaluation of search results after an initial set of potentially relevant items is retrieved. The process involves reassessing and reordering these results to push the most relevant items to the top of the list.
	At its core, reranking implements a two-stage retrieval process:
	Initial retrieval: A fast, scalable method (like embedding-based similarity search or BM25) retrieves an initial set of candidate documents.


	The chunk overlap ensures that contextual information at the boundaries between chunks is preserved, which is essential for maintaining semantic coherence and improving retrieval accuracy.

	Smaller Chunks (e.g., 256 tokens or less)
	Higher precision: Smaller chunks contain more focused information, leading to more precise retrieval
	Reduced noise: Less irrelevant information is included with the relevant content
	Better for specific queries: When users ask highly specific questions, smaller chunks can pinpoint exact answers
	Improved semantic focus: Each chunk tends to cover a single concept or topic

	Larger Chunks (e.g., 512-1024 tokens or more)
	Better recall: More likely to capture all relevant information about a topic
	Preserved context: Maintains more surrounding context that might be important for understanding
	Reduced fragmentation: Related concepts are less likely to be split across multiple chunks
	Better for complex queries: When questions require synthesizing multiple pieces of information


	Retrieval accuracy improves when using a database like ChromaDB instead of in memory.



	Advanced Techniques:
	When storing the chunks generate a hypothetical question that this chunk can possibly answer and store this as well.
	Can use relational or graph based database depending on usecase.



	Retrieval-Augmented Generation (RAG)
	Dynamic Information Retrieval: RAG retrieves relevant external data from knowledge bases or databases in real-time to augment the model's responses, ensuring accuracy and contextual relevance.

	Reduces Hallucinations: By grounding responses in authoritative sources, RAG minimizes the chances of generating incorrect or fabricated information.

	Cost-Effective Updates: RAG eliminates the need for frequent retraining by dynamically incorporating updated external data into the response generation process.

	Cache-Augmented Generation (CAG)
	Preloaded Knowledge: CAG preloads all required information into the model's extended context window, avoiding real-time retrieval altogether.

	Low Latency: Eliminates retrieval delays by using cached knowledge, enabling faster and more consistent responses.

	Simplified Architecture: CAG avoids complex retrieval mechanisms, making it ideal for scenarios with stable datasets that fit within the model's memory.