Reference: https://arxiv.org/abs/2401.05856
The Seven Failure Points from "Seven Failure Points When Engineering a Retrieval Augmented Generation System"
The seven critical failure points identified by Barnett et al. in their paper are:
1. Missing Content - The foundation of any RAG system is its knowledge base. An incorrect or incomplete knowledge base can lead to erroneous or insufficient information retrieval, ultimately affecting the quality of generated outputs.
2. Missed the Top Ranked Documents - When retrieving context from a knowledge base, a limited number of chunks can be returned. Given both a limited number of retrieved chunks, and a non-optimal chunking strategy can lead to missing retrieval of the desired chunk containing relevant context
3. Not in Context - Consolidation Strategy Limitations - Some systems will implement a consolidation step after context retrieval before invoking an LLM to decrease the number of input tokens. The consolidation step has the possibility of removing the relevant context before the LLM processes the query
4. Not Extracted - When a prompt with a large amount of input tokens is being processed by an LLM there is the possibility of the LLM not finding the relevant context to properly answer the question. This is referred to as a "noisy" prompt where the noise refers to extraneous and irrelevant information to answering the question.
5. Wrong Format - Part of prompt engineering is specifying a desired format or structure of the answer. As the input prompt token sizes increase due to failure points 2 and 3, it is possible the LLM is unable to recall the format specified in the original prompt and returns the answer in a format of its choice.
6. Incorrect Specificity - The final failure point identified is the returned being too specific or not specific enough. This failure point can be caused by a variety of points throughout the entire RAG process, but ultimately results in the returned answer being too specific or not specific enough for the user asking the question.
7. Incomplete - Incomplete answers are not incorrect, but are missing important context included in the input prompt that are not in the answer. This can happen for a variety of reasons, namely failure points 2 and 3 including large amounts of irrelevant context into the input context window.
Hybrid Retrieval Methods: Combining vector search, sparse vector search, and full-text search achieves optimal recall. This is easy to understand, as vectors can represent semantics; a sentence or even an entire article can be encapsulated in a single vector.
Query Transformation: A good way to improve the reasoning capability of RAG is to add a query understanding layer – add query transformations before actually querying the vector store. Here are four different query transformations: Routing: Retain the initial query while pinpointing the appropriate subset of tools it pertains to.
Late Chunking: This year, Jina launched "Late Chunking", which targets text data by placing the text chunking step after embedding. In other words, it first uses an embedding model to encode the entire document and then outputs the chunk boundaries just before the final mean pooling
Semantic Chunking: Two primary approaches are identified: heuristic-based chunking, which relies on syntactic markers, and semantic chunking, which considers textual meaning. The author suggests that further research is needed to evaluate the comparative efficacy of these methods
Context Compression: Advanced techniques to reduce noise in retrieved context while preserving relevant information.
Reranking Strategies: In addition to adding a reranker and finetuning the reranker as described in the above section, we can explore the following proposed solutions: LlamaIndex offers an array of retrieval strategies, from basic to advanced, to help us achieve accurate retrieval in our RAG pipelines.
Data Cleaning: Deduplication: Remove duplicate records or similar records that might bias the retrieval process. Unstructured.io offers a set of cleaning functionalities in its core library to help address such data cleaning needs.
Multimodal Document Processing: We can anticipate the potential development of a unified multi-modal document parsing model capable of accurately converting various unstructured documents into text content.
Improved Prompt Engineering: Better prompting can significantly help in situations where the system might otherwise provide a plausible but incorrect answer due to the lack of information in the knowledge base. By instructing the system with prompts such as "Tell me you don't know if you are not sure of the answer," you encourage the model to acknowledge its limitations
Structured Output Control: Use structured outputs like JSON or predefined schemas. Separate formatting instructions from content instructions in prompts. Leverage LLM features that enforce schema-based responses (e.g., OpenAI's function calling).
Agentic RAG: Agentic RAG allows for a lower level agent tool per document, with a higher order agent orchestrating the agent tools.
GraphRAG: KG-Retriever combines knowledge graphs with original data to create a multi-level graph index structure for retrieval at varying granularities; the latter introduces time-based relevance information based on personalised PageRank.
Self-RAG and Adaptive RAG: Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.
Corrective RAG (CRAG): Corrective RAG (CRAG): Improving accuracy through adaptive retrieval evaluation
RAGAS Framework: RAGAS (RAG Assessment), which is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) applications.
Continuous Improvement: Modern RAG systems often include a feedback loop where the quality of the generated responses is assessed and used to improve the system over time. This iterative process can involve fine-tuning the retriever, adjusting the LLM, or refining the retrieval and generation strategies.
Parallel Processing: LlamaIndex's parallel ingestion pipelines are specifically designed to handle large data volumes by distributing the ingestion process across multiple parallel streams.
Efficient Indexing: The implementation of efficient indexing techniques such as HNSW (Hierarchical Navigable Small World) for approximate nearest-neighbor search could cut retrieval costs down.