You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[arxiv] From RAG to RICHES: Retrieval Interlaced with Sequence Generation
Source
Palak Jain, Livio Baldini Soares, Tom Kwiatkowski
We present RICHES, a novel approach that interleaves retrieval with sequence generation tasks. RICHES offers an alternative to conventional RAG systems by eliminating the need for separate retriever and generator. It retrieves documents by directly decoding their contents, constrained on the corpus. Unifying retrieval with generation allows us to adapt to diverse new tasks via prompting alone. RICHES can work with any Instruction-tuned model, without additional training. It provides attributed evidence, supports multi-hop retrievals and interleaves thoughts to plan on what to retrieve next, all within a single decoding pass of the LLM. We demonstrate the strong performance of RICHES across ODQA tasks including attributed and multi-hop QA.
▲ What llm model does this paper uses to demonstrate the idea? How is this method different from RAG?
This paper uses the PaLM2-M and PaLM2-L models to demonstrate the idea of Riches.
Here's how Riches differs from RAG:
Unified Model: Riches uses a single LLM for both retrieval and generation, eliminating the need for separate retriever and generator models like in RAG.
Constrained Decoding: Riches retrieves documents by directly decoding their contents or related retrieval keys, constrained to the corpus. This is different from RAG, which uses a separate retriever to find relevant documents.
Interleaved Retrieval and Generation: Riches interleaves retrieval with generation, allowing for multi-hop retrievals and "thinking" steps to plan what to retrieve next, all within a single decoding pass. RAG typically performs retrieval and generation as separate steps.
Task Adaptability: Riches can be adapted to diverse new tasks via prompting alone, without additional training. RAG pipelines often require task-specific retraining.
In essence, Riches offers a more integrated and efficient approach to retrieval-augmented generation, leveraging the capabilities of a single LLM to perform both retrieval and generation tasks in a unified and interleaved manner.
▲ Tell me more about how exactly “Constrained Decoding” works
Let's break down how "Constrained Decoding" works in the context of the Riches paper:
1. The Goal:
The goal of constrained decoding is to ensure that the LLM's generated output is grounded in a specific corpus of text. This means the generated text must be a valid sequence of tokens that exists within the corpus.
2. The Mechanism:
Retrieval Keys: The paper introduces the concept of "retrieval keys." These are specific sequences of tokens that act as pointers to documents within the corpus.
Constrained Generation: During decoding, the LLM is constrained to generate only sequences that are valid retrieval keys. This is achieved by:
Masking: The LLM's vocabulary is masked, so it can only predict tokens that are part of the retrieval keys.
FM-Index: The paper uses an FM-index, a data structure that efficiently stores and searches for substrings within a corpus. The FM-index helps identify the valid continuations of a sequence based on the retrieval keys.
Beam Search: The decoding process uses beam search, which explores multiple possible continuations of the generated sequence. However, the beam is constrained to only consider sequences that are valid retrieval keys.
3. Example:
Imagine the corpus contains the following sentence: "The capital of France is Paris."
Retrieval Key: "The capital of France"
Decoding: The LLM starts generating text. When it reaches the prefix "The capital of France", the FM-index identifies "is" as the only valid continuation because it's the next token in the retrieval key. The LLM is forced to predict "is" and continue generating the rest of the sentence.
4. Benefits:
Grounding: Constrained decoding ensures that the generated text is grounded in the corpus, reducing the risk of hallucinations.
Retrieval: By generating retrieval keys, the LLM effectively retrieves relevant documents from the corpus.
Efficiency: The FM-index allows for efficient constraint checking, making the decoding process relatively fast.
5. Limitations:
Verbatim Generation: Constrained decoding forces the LLM to generate text verbatim from the corpus, which might not be ideal for tasks requiring paraphrasing or creative writing.
Limited Scope: The retrieval keys must be pre-defined, limiting the scope of what the LLM can retrieve.
In summary, Constrained Decoding is a technique that guides the LLM's generation process to ensure that the output is grounded in a specific corpus of text. It achieves this by using retrieval keys and an efficient data structure (FM-index) to constrain the decoding process.
▲ Tell me more about retrieval keys, are they generated during text corpus preprocessing? How exactly can we constrain a LLM to generate only valid retrieval keys instead of whatever token it like to choose next?
You're right to ask about retrieval keys! They are a crucial part of how Riches works. Here's a deeper dive into them:
1. Retrieval Keys: The Pointers to Knowledge
Definition: Retrieval keys are specific sequences of tokens that act as pointers to documents within a text corpus. They are like bookmarks that help the LLM find relevant information.
Purpose: They are designed to be:
Unique: Each retrieval key should ideally point to a single document or a small set of related documents.
Informative: The key should contain enough information to help the LLM identify the relevant document.
Examples:
Sentence Substrings: "The capital of France" (pointing to a document containing that sentence)
Paragraph Substrings: "The Battle of Hastings took place in 1066" (pointing to a document containing that paragraph)
Propositions: "Paris is the capital of France" (a more structured representation of a fact)
2. Retrieval Key Generation: Preprocessing or On-the-Fly?
Preprocessing: The paper explores various ways to generate retrieval keys, some of which involve preprocessing the corpus:
Document Titles and Headers: These are often pre-defined and can be used as retrieval keys.
Propositional Index: This involves using a separate model to extract structured propositions from the corpus, which are then used as retrieval keys.
On-the-Fly: The paper also explores generating retrieval keys during the decoding process itself:
Sentence/Paragraph Substrings: The LLM can be trained to generate substrings of sentences or paragraphs as retrieval keys.
3. Constraining the LLM: How It Works
Masking: The LLM's vocabulary is masked during decoding. This means that the LLM can only predict tokens that are part of the valid retrieval keys.
FM-Index: The FM-index is a data structure that efficiently stores and searches for substrings within a corpus. It's used to:
Identify Valid Continuations: Given a prefix of the generated sequence, the FM-index can quickly identify the next tokens that would form a valid retrieval key.
Zero Out Probabilities: The LLM's probability distribution over the vocabulary is modified. The probabilities of tokens that are not valid continuations are set to zero, effectively forcing the LLM to choose a valid retrieval key token.
4. Example: Constrained Decoding in Action
Input: "What is the capital of France?"
Retrieval Key: "The capital of France"
Decoding:
The LLM starts generating: "The capital of France..."
The FM-index identifies "is" as the only valid continuation.
The LLM's probability distribution is modified:
"is" has a probability of 1.
All other tokens have a probability of 0.
The LLM is forced to predict "is" and continue generating the rest of the sentence.
In essence, the LLM is trained to generate retrieval keys, and the FM-index acts as a gatekeeper, ensuring that the LLM only generates sequences that are valid retrieval keys. This constraint helps ground the LLM's output in the corpus and enables it to retrieve relevant information.
▲ In order to apply this method, does it required to fine tune or re-train the LLM with the given text corpus?
That's a great question! The paper emphasizes that Riches does not require fine-tuning or retraining the LLM with the given text corpus.
Here's why:
Instruction Tuning: The paper uses off-the-shelf instruction-tuned LLMs like PaLM2-M and PaLM2-L. These models are already trained on a massive dataset of text and code, making them capable of understanding and following instructions.
Few-Shot Learning: The paper demonstrates the effectiveness of Riches using a few-shot learning approach. This means that the LLM is only given a few examples of how to perform the task (e.g., question answering) before being asked to perform it on new examples.
Constrained Decoding: The key to Riches' success lies in the constrained decoding mechanism. The LLM's ability to generate valid retrieval keys is not dependent on fine-tuning on the specific corpus. Instead, the FM-index and the masking technique are used to guide the LLM's generation process, ensuring that the output is grounded in the corpus.
However, there are some nuances:
Retrieval Key Generation: While the LLM itself doesn't need to be fine-tuned, the process of generating retrieval keys might require some training. For example, if you're using a propositional index, you'll need to train a separate model to extract propositions from the corpus.
Corpus-Specific Knowledge: The LLM's performance will be influenced by the knowledge it has acquired during its initial training. If the corpus contains information that is significantly different from the LLM's training data, the LLM might struggle to generate accurate retrieval keys.
In summary, Riches is designed to work with pre-trained LLMs without requiring fine-tuning on the specific corpus. The constrained decoding mechanism and the use of retrieval keys allow the LLM to leverage its existing knowledge and adapt to new tasks without extensive retraining.
▲ How is this method compared to gpt-4 + RAG? Does the paper provide any benchmarks or evaluation results?
You're right to ask about comparisons! The paper doesn't directly compare Riches to GPT-4 + RAG, as GPT-4 wasn't released at the time of the paper's publication. However, it does provide benchmarks and evaluation results that allow us to make some inferences about how Riches might stack up against GPT-4 + RAG.
Here's a breakdown:
1. Riches' Strengths:
Single Model: Riches uses a single LLM for both retrieval and generation, potentially leading to better coherence and consistency in the output compared to a separate GPT-4 and RAG system.
Efficiency: Riches performs retrieval and generation in a single decoding pass, which could be more efficient than the separate steps involved in GPT-4 + RAG.
Task Adaptability: Riches can be adapted to new tasks via prompting alone, without retraining. This could be advantageous for tasks where fine-tuning GPT-4 + RAG might be challenging or time-consuming.
2. GPT-4 + RAG's Strengths:
Advanced Language Model: GPT-4 is a more powerful language model than the PaLM2 models used in the Riches paper, potentially leading to better overall performance in tasks like question answering.
Specialized RAG: GPT-4 + RAG can leverage specialized RAG techniques, such as dense retrieval models, which might be more effective at finding relevant documents than the constrained decoding approach used in Riches.
3. Benchmarks and Evaluation Results:
The paper evaluates Riches on:
Open-NQ: A single-hop question answering dataset.
Hotpot-QA: A multi-hop question answering dataset.
Musique-Ans: A multi-hop question answering dataset with varying hops and composition operations.
Metrics:
F1 Answer Match Accuracy: Measures the accuracy of the generated answer.
AutoAIS: Measures the faithfulness of the answer to the retrieved evidence.
Results:
Single-hop: Riches performs competitively with dense retrievers, achieving higher answer accuracy but lower attribution.
Multi-hop: Riches significantly outperforms iterative baselines, demonstrating its effectiveness in multi-hop reasoning.
4. Inferences:
GPT-4 + RAG might outperform Riches in single-hop tasks due to GPT-4's advanced language capabilities and the potential for more effective retrieval techniques.
Riches might be more efficient and adaptable to new tasks than GPT-4 + RAG, especially in multi-hop scenarios.
5. Conclusion:
While a direct comparison is not possible, the paper's benchmarks and evaluation results suggest that Riches offers a promising alternative to GPT-4 + RAG, particularly for multi-hop tasks where efficiency and adaptability are crucial. However, GPT-4's advanced language capabilities might give it an edge in single-hop tasks.
It's important to note that the field of retrieval-augmented generation is rapidly evolving, and new techniques and models are constantly being developed. Future research will likely provide more definitive comparisons between Riches and other approaches like GPT-4 + RAG.
▲ What are the major limitations of this method?
You're right to point out limitations! While Riches is a promising approach, it does have some drawbacks that need to be addressed:
1. Verbatim Generation:
Limited Creativity: Riches forces the LLM to generate text verbatim from the corpus. This can limit creativity and prevent the LLM from paraphrasing or summarizing information in its own words.
Overkill for Some Tasks: For tasks where a similarity-based metric is sufficient (e.g., document ranking), verbatim generation might be overkill.
2. Retrieval Key Limitations:
Pre-defined Keys: The retrieval keys must be pre-defined, limiting the scope of what the LLM can retrieve. This can be a challenge for tasks where the relevant information might not be easily captured by pre-defined keys.
Corpus-Specific Knowledge: The LLM's performance is influenced by the knowledge it has acquired during its initial training. If the corpus contains information that is significantly different from the LLM's training data, the LLM might struggle to generate accurate retrieval keys.
3. Scalability:
Large Corpora: While the FM-index is efficient for searching, indexing large corpora can be computationally expensive.
Many-to-Many Mapping: The mapping between retrieval keys and documents can be many-to-many, making it challenging to identify the most relevant document for a given key.
4. Inference Speed:
Beam Search and Communication: The use of beam search and communication between the FM-index and the LLM can slow down inference, especially for large corpora.
5. Hallucination:
Partial Evidence: While Riches aims to ground its generation in the corpus, the LLM might still hallucinate based on partial evidence or misinterpret the retrieved information.
6. Limited Task Scope:
Factoid QA: The paper primarily focuses on factoid question answering. The effectiveness of Riches for other tasks, such as summarization or creative writing, remains to be explored.
7. Ethical Considerations:
Bias and Fairness: The corpus used for retrieval might contain biases, which could be reflected in the LLM's output.
Data Privacy: The use of a corpus for retrieval raises concerns about data privacy, especially if the corpus contains sensitive information.
In summary, while Riches offers a promising approach to retrieval-augmented generation, it faces limitations in terms of creativity, retrieval key limitations, scalability, inference speed, hallucination, task scope, and ethical considerations. Further research is needed to address these limitations and improve the robustness and generalizability of the method.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"title": "From RAG to RICHES: Retrieval Interlaced with Sequence Generation",
"subtitle": "Palak Jain, Livio Baldini Soares, Tom Kwiatkowski",
"description": "We present RICHES, a novel approach that interleaves retrieval with sequence generation tasks. RICHES offers an alternative to conventional RAG systems by eliminating the need for separate retriever and generator. It retrieves documents by directly decoding their contents, constrained on the corpus. Unifying retrieval with generation allows us to adapt to diverse new tasks via prompting alone. RICHES can work with any Instruction-tuned model, without additional training. It provides attributed evidence, supports multi-hop retrievals and interleaves thoughts to plan on what to retrieve next, all within a single decoding pass of the LLM. We demonstrate the strong performance of RICHES across ODQA tasks including attributed and multi-hop QA."
From RAG to RICHES: Retrieval Interlaced with Sequence Generation
Palak Jain Livio Baldini Soares Tom Kwiatkowski
Google Deepmind
{palakj,liviobs,tomkwiat}@google.com
Abstract
We present Riches, a novel approach that interleaves retrieval with sequence generation tasks. Riches offers an alternative to conventional RAG systems by eliminating the need for separate retriever and generator. It retrieves documents by directly decoding their contents, constrained on the corpus. Unifying retrieval with generation allows us to adapt to diverse new tasks via prompting alone. Riches can work with any Instruction-tuned model, without additional training. It provides attributed evidence, supports multi-hop retrievals and interleaves thoughts to plan on what to retrieve next, all within a single decoding pass of the LLM. We demonstrate the strong performance of Riches across ODQA tasks including attributed and multi-hop QA.
1 Introduction
Large language models (LLMs) have increasingly become the backbone for much of natural language processing and there has been a push to formulate a wide range of tasks as sequence to sequence transduction. However, when LLMs need to interact with non-parametric knowledge in the form of an external evidence corpus, the typical approaches chain LLM generations with calls to a separate retrieval model as part of a multi-system pipeline. In this paper we introduce a new approach, Riches (Retrieval Interlaced with Sequence Generation) which can natively interleave text generations with retrievals from an evidence corpus using a single LLM and decoding process.
Riches builds on previous work that demonstrated the application of constrained decoding to retrieval over a corpus Jain et al. (2023); Bevilacqua et al. (2022) but extends this work to support multiple retrievals, entwined in a standard text generation procedure. In this approach, we retrieve documents by directly decoding their contents or related natural language retrieval keys that point to the documents they were generated from. For example, Figure 1 illustrates a solution from Riches to multi-hop question answering Yang et al. (2018), where evidence must be retrieved from multiple separate documents, by iteratively generating an unconstrained ‘thought’ about what needs to be retrieved and then generating a supporting proposition derived from an evidence corpus and tied to an original piece of supporting text. Riches executes this task in a single decoder pass. For this example task, which is evaluated alongside others in Section 6, we have built on recent advances in chain-of-thought reasoning via prompting alone Yao et al. (2022) but have directly integrated the retrieval step without needing to account for any interaction with an external retrieval system.
The observations we build this work on are:
LLMs are knowledge warehouses: They internalise and generalise over vast quantities of training data and are often able to generate surprisingly accurate knowledge in response to complex inputs Sun et al. (2022). However they are also susceptible to hallucination and cannot account for fresh knowledge, not available at the time of training. That is where retrieval shines.
2. 2.
LLM decoding is a search process: Language model decoders search for a single sequence in the set of all possible token sequences (Graves, 2012). Retrievers just need to constrain this search space to those sequences that are known to exist in a corpus of interest.
3. 3.
Unifying tasks unlocks rapid development via prompting By unifying retrieval with generation in a single decoder pass, we create a system that can be adapted to diverse new tasks via prompting alone, directly benefiting from the advances in instruction following. We later show that Riches works with an off-the-shelf instruction-tuned model, without any additional training. This is in contrast to pipelines that need to be rebuilt/retrained on a task-by-task basis.
There is an another advantage of using language models as search agents. Of the two core operations in retrieval, indexing and search, indexing is constrained by corpus size, while search typically depends only on the index structure. Using large language models for indexing billion-token corpora is highly expensive, but search does not face the same bottle-neck. This enables us to unlock the knowledge stored in very large models for retrieval.
This work overlaps with a variety of related work focusing on retrieval, retrieval augmented generation Lewis et al. (2020), reasoning in language models, and open domain question answering. We discuss their connections to Riches in Section 2, then introduce the key components of the generalizable Riches approach in Section 3.
While Riches is applicable to any task that can be reduced to an interleaved generation of unconstrained text and pre-defined retrieval keys, we validate the approach with tasks in open domain question answering and show how it natively supports single-hop question answering, including the case where attribution to a source text is required; multi-hop question answering; and interleaving retrieval with ‘planning steps’ that enhance the retrieval performance. Results are presented in Section 6.2 along with qualitative examples and analysis in Section 6.3 to help motivate the approach.
2 Related Work
Retrieval Augmented Generation (RAG)
ODQA tasks predominantly employ the RAG approach Lewis et al. (2020) where typically a dense retriever (Karpukhin et al., 2020) retrieves documents from an evidence corpus and feeds to a language model for the final answer. These pipelines involve switching between heterogeneous models and are hard to train in concert. Moreover, Dense retrievers fail to generalize out-of-domain (Thakur et al., 2021).
Generative Retrieval
(Metzler et al., 2021) techniques shifting the onus of Search from non-parametric nearest neighbor scan to language models. Differentiable Search Index (Tay et al., 2022) memorizes a mapping of query to opaque document identifiers, however memorization struggles to generalize to unseen corpus (Pradeep et al., 2023). An alternative approach is to use natural language keys as document identifiers, where keys are constrained decoded to lie in the corpus (De Cao et al., 2020; Bevilacqua et al., 2022). These systems still need an external model to generate answers. 1-Pager (Jain et al., 2023) unifies evidence and answer generation, by generating a sequence of keywords that map to a document. However, isolated keywords limit context understanding and suffer similar pitfalls as lexical matching.
Recitation
Separate from retrieval augmentation, language models have been shown to recite entire passages from memory (Sun et al., 2022; Yu et al., 2022). But these passages are prone to hallucination. Our aim is to intersect contextual passage generation with corpus grounding. GopherCite (Menick et al., 2022), a noteworthy work in this direction, generates quotes verbatim from a small set of documents using constrained decoding. Riches aims to scale this to a billion-token corpus.
Iterative reasoning and Search
In recent times, there have been several efforts to improve multi-hop question answering by better reasoning (Asai et al., 2023) and planning (Adolphs et al., 2021; Yao et al., 2022). Language models have also been applied to the task of search to explore alternative paths Yao et al. (2023); Hao et al. (2023).
Our work builds on these advances in reasoning while integrating search within generation.
3 Retrieving while Generating
We present a method of interleaving unconstrained text generation with the generation of retrieval keys that point into a retrieval corpus. For example, Figure 1 shows generations that interleave unconstrained ‘thoughts’ with evidence sentences drawn from a predefined corpus for a multi-hop question answering task. Later in this section we’ll introduce a number of different choices of retrieval key as well as a variety of tasks that benefit from interleaved generation and retrieval. However, for now we simply define a retrieval key as a sequence of tokens that exists in a pre-defined finite set of sequences K𝐾Kitalic_K where every entry is associated with one or more documents in an underlying corpus C𝐶Citalic_C.
Formally, we focus on the sequence to sequence transduction task where we predict an output sequence 𝐲=[y0,…,yn]𝐲subscript𝑦0…subscript𝑦𝑛\mathbf{y}=[y_{0},\dots,y_{n}]bold_y = [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] conditioned on an input sequence 𝐱=[x0,…,xm]𝐱subscript𝑥0…subscript𝑥𝑚\mathbf{x}=[x_{0},\dots,x_{m}]bold_x = [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] and we mark the start and end of a retrieval key in y𝑦yitalic_y with special markers << and >>. If we let Q(𝐲)𝑄𝐲Q(\mathbf{y})italic_Q ( bold_y ) be a function that returns all retrieval key spans from y𝑦yitalic_y (i.e. >}{},\dots,y_{% n}])" class="ltx_Math" display="inline" id="S3.p2.6.m6.7">(i,j)∈Q([y0,…,<<,yi,…,yj>>,…,yn])𝑖𝑗𝑄subscript𝑦0…<<subscript𝑦𝑖…subscript𝑦𝑗>>…subscript𝑦𝑛(i,j)\in Q([y_{0},\dots,\texttt{<<}{},y_{i},\dots,y_{j}\texttt{>>}{},\dots,y_{% n}])( italic_i , italic_j ) ∈ italic_Q ( [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , << , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT >> , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] )) then we can update the standard autoregressive language modeling probability
Pθ(𝐲
𝐱)=∏i=0
𝐲
to include the indicator function 𝟙K(𝐪)subscript1𝐾𝐪\mathds{1}_{K}(\mathbf{q})blackboard_1 start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_q ) that maps elements of K𝐾Kitalic_K onto one and otherwise to zero.
where Z𝑍Zitalic_Z is a normalizing term that accounts for the probability mass assigned by Equation 1 to disallowed sequences. In practice, we do not need to compute Z𝑍Zitalic_Z and can sample from Equation 2 in the usual way, one token at a time, by simply zeroing out the probability of disallowed continuations as presented in Section 3.1.
3.1 Constrained Beam Decoding
We opt for Beam Search (Graves, 2012) as our decoding strategy to simulate a heuristic Best-first search. Here, the action or next node space is the entire vocab. At each time step, the LLM estimates the value of each node (token) given the paths explored so far and adds them to the fixed-size queue (Beam). Figure 2 visualizes how the beam progresses over decoding timesteps. Unlike regular beam decoding where the top decoded sequences have only small variations, constraints impose sparsity over the search space resulting in diverse beams. In Section 3.3, we discuss how beam can hurt unconstrained generation and suggest hybrid decoding strategy as workarounds. Constrained decoding can also gain from more sophisticated algorithms such as value-based decoding (Ren et al., 2017), look-ahead scoring and planning (Lu et al., 2021; Hao et al., 2023).
3.2 Efficient Constraints via the FM-Index
During decoding, model outputs are constrained to the corpus by masking out any continuation not in the corpus. To compute the continuations of a sequence, we use FM-index (Ferragina and Manzini, 2000), a compressed suffix array augmented with additional data structures to support fast substring search operations. Unlike a Trie structure, it is also highly space economical due to the compression. Given a prefix, FM-Index can efficiently compute the next allowed tokens in O(Vocab), independent of the corpus-size. Below is the pseudo code for the modified decoding process.
In Section 5.2 we introduce some tasks that interleave constrained and unconstrained generation. The constrained generations must be precise—to match the target retrieval key exactly. The unconstrained generations are generally more robust to small variations in surface form—these only need to convey the correct information to a reader, or to provide the model room for a ‘thought’ trace when reasoning about a response.
To ensure that Riches can properly make use of beam search, which is here intended to ensure the model does not get stuck irretrievably after generating an incorrect constrained prefix, we introduce an adaptive decoding strategy that switches between full beam decoding for the sensitive constrained sequences but opts for greedy decoding when unconstrained. In practise, a constrained prefix is expanded to next beam-size probable tokens while an unconstrained prefix is expanded to only the next one token. This is expected to provide room for rest of the beam to be utilized largely for constrained sequences. Section 6.1 shows experiments with multiple decode modes.
3.4 Indexing Strategies
The FM-Index used by Riches supports efficient indexing of all sub-strings in a corpus, which is useful when we want to generate corpus text verbatim. However, it is not clear that this is the best option of retrieval key for the auto-regressive decoder in Section 3.1. A key question in index construction is the document representation used in indexing. In traditional lexical-based retrieval systems, documents are represented by the terms in it, with transformations such as stemming, weighing by corpus statistics Robertson et al. (2009). Neural retrieval systems transform raw text into dense vector representations and offload representation computation to the neural network. But even in this case, proper document chunking and/or multi-vector document significantly impact final performance Lee et al. (2021); Khattab and Zaharia (2020).
In this section, we introduce a few different choices of retrieval keys, including a propositional index that requires indexing time neural computation. A key consideration here is the interplay between the retrieval index and the search strategy.
Document Title and Section Headers
Many retrieval corpora such as Wikipedia have consistent structures in the form of titles and sometimes sub-titles and metadata. This provides a hierarchical structure such that one can first decode titles, sub-titles and then the document.
Paragraph Sub-string
A natural option for retrieval key is any sub-string of the unit of text being indexed itself. In most open domain question answering approaches, paragraph is the de-facto unit of evidence. We can index paragraphs efficiently using the FM-index (Section 3.2) and decode sub-strings directly with Riches to get pointers into the retrieval corpus. It should be noted that this yields an inherently many-to-many mapping between paragraphs and retrieval keys, but that the mapping is in-effect one-to-one for longer sequences of tokens.
Sentence Sub-string
Similarly, individual sentences form a natural retrieval key. Sentence are smaller units of information than passage, but may not be interpretable or stand-alone.
Propositional Index
The above choices do not perform any non-trivial indexing step, unlike standard approaches in information retrieval where documents are mapped to sparse or dense vectors. The omission of this indexing step may be desirable but it also forces Riches to deal with the non-uniform and diffused information in raw text. An alternative that is closer, in intent, to the offline indexing step used by other IR systems, is to map each indexed chunk to a set of uniformly structured propositions (Min et al., 2023; Chen et al., 2022). A proposition is a stand-alone unit that efficiently encodes small, atomic chunks of factual information. For example, instead of the sentence "He has 7M followers on Twitter" a proposition would be decontextualized to "Tom Cruise has 7M followers on Twitter." We adopt a pre-existing propositional index from Chen et al. 2023 described in Section 5.1.
Section 6.1 compares various Retrieval keys for the ODQA task with illustrations in Section A.4.
4 Interleaving Retrieval and Generation
We have presented a method of interleaving unconstrained text generation with constrained generation of retrieval keys. In this section we introduce a handful of tasks that make use of this interleaving either as a core task requirement, or as a means to an end by interleaving ‘thoughts’ with retrieval actions to help guide search.
Attributed Question Answering
We apply Riches to the open domain question answering (ODQA) task where we score both the ability to correctly predict a short answer string and retrieve attribution for that answer Bohnet et al. (2022). See [Table 1](https://arxiv.org/html/2407.00361v1#S4.T1 "In "Thinking" for Retrieval ‣ 4 Interleaving Retrieval and Generation ‣ From RAG to RICHES: Retrieval Interlaced with Sequence Generation") for examples.
Multi-hop Question Answering
Interleaving between generation and retrieval can be powerful in multi-hop reasoning, where the model needs to retrieve and stitch together knowledge from multiple sources. Examples of Riches outputs for multi-hop QA are given in [Table 2](https://arxiv.org/html/2407.00361v1#S4.T2 "In "Thinking" for Retrieval ‣ 4 Interleaving Retrieval and Generation ‣ From RAG to RICHES: Retrieval Interlaced with Sequence Generation").
"Thinking" for Retrieval
Multi-step questions often require breaking down a query into smaller steps and reasoning or planning what to retrieve next. Foreshadowing retrieval with thoughts is crucial in this context. It helps direct the retrieval process, avoid repetitions, and, more importantly, allows for iterating upon and correcting previously erroneous retrievals. A few such demonstrations can be found in [Table 2](https://arxiv.org/html/2407.00361v1#S4.T2 "In "Thinking" for Retrieval ‣ 4 Interleaving Retrieval and Generation ‣ From RAG to RICHES: Retrieval Interlaced with Sequence Generation").
5 Experimental Setup
5.1 Datasets
Queryset
Our experiments are focused on open domain question answering tasks including both single and multi-hop benchmarks. For single-hop, we use the Open-NQ Kwiatkowski et al. (2019) dataset. To evaluate multi-hop reasoning, we look into Hotpot-QA (Yang et al., 2018) and Musique-Ans (Trivedi et al., 2022). The latter includes varying hops and different composition operations, offering a rich test-bed for how well Riches can generalize across a diverse range of queries.
Corpus
Section 3.4 describes multiple strategies to index the corpus. Each type of retrieval key needs to be accompanied with its own corpus. Title, passage and sentence keys are derived from the Wikipedia corpus presented in Bohnet et al. 2022. For propositions, we re-use the Factoid-Wiki corpus built by Chen et al. 2023. This is derived from Bohnet et al. 2022 by decomposing passages into smaller, compact propositions using a finetuned Flan-T5-large (Wei et al., 2021) model. We drop the titles from Factoid-Wiki and only use the propositions (See Section A.2).
5.2 Evaluation
The standard metric for ODQA benchmarks has predominantly been F1 answer match accuracy. However, language models are prone to hallucinate and F1 stand-alone can be misleading as the answer may not be conditioned on the evidence. Attribution (Rashkin et al., 2021) helps us trade-off answer accuracy for faithfulness to the evidence. Thus, we measure two competing metrics: i) end-to-end answer accuracy with F1 and ii) attribution of the answer to evidence using AutoAIS (Bohnet et al., 2022). AutoAIS, or AIS for short, is automatically computed by classifying whether the evidence text entails the question and predicted answer pair. We re-use the NLI scorer and formulation from Bohnet et al. 2022 (See details in Section A.2). The evidence text here is the concatenation of all retrieval keys in the Riches output. The unconstrained thoughts are discarded from evaluation. Only the top beam output is considered for evaluation.
5.3 Models and Inference
Throughout our experiments, we use off-the-shelf instruction-tuned models in a few-shot setting, without any fine-tuning. We test the instruction-tuned versions of PaLM2-M and its larger variant PaLM2-L (Anil et al., 2023) based on stacked Transformer architecture. We use 3 example demonstrations in our prompt (Section A.1), with different sets of examples for single-hop (NQ) and multi-hop (Hotpot, Musique) datasets. The unconstrained sequences or thoughts are formulated as hint keywords. Our final setup uses a beam of 10 with constrained decoding (Section 3.1), adaptive beam size (Section 3.3) and propositions as retrieval keys. Later in Section 6, we ablate these choices. Note that only the top-beam result is considered for evaluation.
5.4 Baselines
We experiment with 3 types of baselines: no retrieval, the standard dense retriever and an iterative retriever suited for multi-hop QA. We opt for baselines that test the out-of-domain system performance, similar to Riches.
No Retrieval
We compare to a few-shot unconstrained baseline using the same 3-shot prompt as Riches allowing for chain-of-thought reasoning. The setup generates an answer along with hallucinated evidences, not grounded to a corpus. This is a measure of model’s memorization capabilities.
Generalized Dense Retriever
For single-hop QA, we compare our approach against the Generalized T5 retriever (GTR-xxl, 11B variant) (Ni et al., 2021). GTR undergoes multi-staged training, first on unsupervised web-mined corpus and then supervised search datasets including NQ. It has been shown to generalize well out-of-domain. However, GTR and other conventional dense retrievers provide only retrieved documents, not the answers themselves. To extract answers, we use the PaLM2-M model in a few-shot setting (see Section A.1 for the details).
Since Riches generates a single output with a varying number of interleaved documents, direct comparison with dense retrievers that fetch a fixed top-k documents is challenging. We set k to a value equivalent to the mean documents Riches fetches for single-hop. When retrieval keys are different, such as passages vs propositions, we approximately match the tokens used by both setups. In our final experiments, we compare against k=1 passage and k=2 propositions for GTR-xxl.
Iterative Retrieval (Iter)
For Multi-hop QA, we adopt a popular method where question is decomposed into sub-queries (Khot et al., 2022). At each step, passages are retrieved for a sub-query and fed as input for the next query, until one converges to an answer. The method has the same surface form as Riches, except for the key distinction that each step requires switching between a heterogeneous mix of models. In our experiments, we retrieve top-1 document with GTR-xxl and use PaLM2-M few-shot for both decomposing the query and generating the final answer (See prompt at Section A.1). Max allowed steps is set to 4 where most of the queries converge.
6 Results
In the following sections, we investigate the key building blocks of Riches: i) indexing strategies (Section 3.4) amenable to auto-regressive decoding ii) effect of beam decoding (Section 3.1) iii) suitable mechanisms to interleave thoughts and retrieval keys (Section 3.3). Finally, we compare Riches against conventional retrieval systems. We also draw a detailed analysis of wins and losses to fathom the strengths and pitfalls of the system.
6.1Riches building blocks
Retrieval Keys
We explore the following retrieval key candidates as detailed in Section 3.4: a) Title: Wikipedia page and section titles, ranking paragraphs within the section using TF-IDF scores. b) Paragraph with Title: Decodes the page title, section title, and full paragraph. c) Paragraph: Decodes the paragraph only. d) Sentence: Uses individual sentences. e) Proposition: Uses atomic information units derived from paragraphs. Table 3 shows that among the retrieval keys explored, the propositional index is best aligned with our decoding search strategy, perhaps its compact nature is most suited for autoregressive decoding. An in-depth analysis of retrieval keys is provided in Section A.4. In the following experiments, we use proposition as our retrieval key.
Effect of Beam size
Table 5 shows how greedy decoding can get stuck with poor retrieval keys. A larger beam enables better search space exploration, albeit with diminishing returns. In our final experiments, we use a beam of 10.
Interleaving with Adaptive Beam
Table 6 shows the impact of interleaving thoughts with retrieval keys. First, we note that an adaptive beam is crucial for interleaving unconstrained and constrained sequences. Without an adaptive beam, minor irrelevant variations in unconstrained thoughts can consume and overwhelm the available space in the beam. By greedily decoding unconstrained sequences, the beam space is preserved for backtracking during document search. Once we have an adaptive beam in place, the insertion of keywords enhances both answer and retrieval performance, reminiscent of chain-of-thought technique to enable better retrieval.
6.2 Overall Results
Table 4 shows the overall performance of Riches across various datasets. First, we compare with the no-retrieval baseline and observe that jointly retrieving and answering does not negatively impact model’s answering capabilities. For single-hop tasks, Riches competes well with dense retrievers, offering higher answer accuracy at the expense of attribution. In multi-hop QA, Riches excels, outperforming iterative baselines by +15 F1 points on Hotpot and +11 on Musique, with comparable or better attribution. The increase in answer accuracy with the larger PaLM2-L model suggests improved performance with larger model sizes. Notably, Riches achieves these results with a single inference pass, unlike the Iterative baseline, which requires a model call at each sub-query step.
6.3 Qualitative analysis
We inspect 50 win and loss examples each to analyze the strength and weaknesses of the system.
Wins
Several properties distinguish Riches from dense retrievers: a) Riches allows large language models to utilize their parametric knowledge for retrieval. Since the search operation in Riches is independent of corpus size, it can employ much larger models at query time. b) The inherent alignment of instruction-tuned models enables them to retrieve contextually relevant passages, whereas dense retrievers may sometimes latch onto keywords. c) The interleaved thoughts guide the model toward more accurate retrievals. [Table 1](https://arxiv.org/html/2407.00361v1#S4.T1 "In "Thinking" for Retrieval ‣ 4 Interleaving Retrieval and Generation ‣ From RAG to RICHES: Retrieval Interlaced with Sequence Generation") demonstrates these scenarios for single-hop retrievals and [Table 2](https://arxiv.org/html/2407.00361v1#S4.T2 "In "Thinking" for Retrieval ‣ 4 Interleaving Retrieval and Generation ‣ From RAG to RICHES: Retrieval Interlaced with Sequence Generation") for multi-hop retrievals.
Can the model retrieve what it doesn’t know?
A language model may hold stale or incorrect information. However, Riches can often override model’s pre-existing knowledge and generate correct answers by constraining on the corpus (Section A.3)
Losses
We inspect 50 failed queries and categorize the losses (Table 7) as follows: a) Index failure: the proposition is absent from the index or not decontextualized. b) Search failure: Proposition exists in the index, but could not be generated c) Attribution failure: The answer is partially attributed, with LLM hallucinating based on partial evidence. (see Section A.3 for examples)
7 Conclusion
Retrieval has so far been alienated from the rapid progress in instruction tuning. This work makes the following contribution: i) an approach that can seamlessly integrate retrieval with generation. ii) a thorough investigation of indexing and search strategies that enable such an approach to be effective. iii) proof-of-concept of the capabilities of such a system on a variety of QA tasks. We hope the ideas introduced in this work fuel progress in aligning retrieval to generation and simplifying it.
8 Limitations
First we note the limitations in our experimental setup. All our experiments are based on Wikipedia, a corpus heavily seen during pre-training. This work does not analyze how Riches fares on corpora unseen during pre-training. Furthermore, we only examine a handful of factoid question-answering tasks due to the lack of objective evaluations. Performance on tasks such as long-form QA is deferred for future work. There are also certain inherent limitations with Riches. It forces verbatim emission of corpus text, which might be an overkill for tasks where a similarity-based metric is sufficient. Riches lacks the ability to retrieve dozens of documents, a necessity for certain summarization tasks. For long documents with diffused information, rewriting into propositions adds complexity and can be cumbersome. Lastly, while Riches’s search operation is independent of corpus size, the use of beam search and communication between the FM-index and Transformer model can slow down inference.
9 Ethical Considerations
All artifacts used in this paper, including models, datasets, and baselines, are under permissive licenses and publicly available. We have attempted to provide detailed information to facilitate the reproduction of our results.
Our findings are based on English-language data from Wikipedia, and we have not tested the generalizability of our claims to other languages or domains.
Lastly, the datasets used in this work are not expected to contain any offensive content. However, it is important to note that Large Language Models (LLMs) can exhibit biases related to gender, race, and region, and are also prone to hallucination. Although Riches aims to ground its generation in an external corpus, some biases may still be present.
References
Adolphs et al. (2021)Leonard Adolphs, Benjamin Boerschinger, Christian Buck, Michelle Chen Huebscher, Massimiliano Ciaramita, Lasse Espeholt, Thomas Hofmann, Yannic Kilcher, Sascha Rothe, Pier Giuseppe Sessa, et al. 2021.Boosting search engines with interactive agents.arXiv preprint arXiv:2109.00527.
Anil et al. (2023)Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023.Palm 2 technical report.arXiv preprint arXiv:2305.10403.
Asai et al. (2023)Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023.Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511.
Bohnet et al. (2022)Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, et al. 2022.Attributed question answering: Evaluation and modeling for attributed large language models.arXiv preprint arXiv:2212.08037.
Chen et al. (2022)Sihao Chen, Senaka Buthpitiya, Alex Fabrikant, Dan Roth, and Tal Schuster. 2022.Propsegment: A large-scale corpus for proposition-level segmentation and entailment recognition.arXiv preprint arXiv:2212.10750.
Chen et al. (2023)Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Dong Yu, and Hongming Zhang. 2023.Dense x retrieval: What retrieval granularity should we use?arXiv preprint arXiv:2312.06648.
De Cao et al. (2020)Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2020.Autoregressive entity retrieval.arXiv preprint arXiv:2010.00904.
Ferragina and Manzini (2000)Paolo Ferragina and Giovanni Manzini. 2000.Opportunistic data structures with applications.In Proceedings 41st annual symposium on foundations of computer science, pages 390–398. IEEE.
Hao et al. (2023)Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023.Reasoning with language model is planning with world model.arXiv preprint arXiv:2305.14992.
Jain et al. (2023)Palak Jain, Livio Baldini Soares, and Tom Kwiatkowski. 2023.1-pager: One pass answer generation and evidence retrieval.arXiv preprint arXiv:2310.16568.
Karpukhin et al. (2020)Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020.Dense passage retrieval for open-domain question answering.arXiv preprint arXiv:2004.04906.
Khattab and Zaharia (2020)Omar Khattab and Matei Zaharia. 2020.Colbert: Efficient and effective passage search via contextualized late interaction over bert.In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
Khot et al. (2022)Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022.Decomposed prompting: A modular approach for solving complex tasks.arXiv preprint arXiv:2210.02406.
Kwiatkowski et al. (2019)Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019.Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466.
Lee et al. (2021)Jinhyuk Lee, Alexander Wettig, and Danqi Chen. 2021.Phrase retrieval learns passage retrieval, too.arXiv preprint arXiv:2109.08133.
Lewis et al. (2020)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474.
Lu et al. (2021)Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, et al. 2021.Neurologic a* esque decoding: Constrained text generation with lookahead heuristics.arXiv preprint arXiv:2112.08726.
Menick et al. (2022)Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. 2022.Teaching language models to support answers with verified quotes.
Metzler et al. (2021)Donald Metzler, Yi Tay, Dara Bahri, and Marc Najork. 2021.Rethinking search: making domain experts out of dilettantes.ACM SIGIR Forum, 55(1):1–27.
Min et al. (2023)Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023.Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.arXiv preprint arXiv:2305.14251.
Ni et al. (2021)Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Yi Luan, Keith B Hall, Ming-Wei Chang, et al. 2021.Large dual encoders are generalizable retrievers.arXiv preprint arXiv:2112.07899.
Pradeep et al. (2023)Ronak Pradeep, Kai Hui, Jai Gupta, Adam D Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Q Tran. 2023.How does generative retrieval scale to millions of passages?arXiv preprint arXiv:2305.11841.
Rashkin et al. (2021)Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. 2021.Measuring attribution in natural language generation models.arXiv preprint arXiv:2112.12870.
Ren et al. (2017)Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017.Deep reinforcement learning-based image captioning with embedding reward.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 290–298.
Robertson et al. (2009)Stephen Robertson, Hugo Zaragoza, et al. 2009.The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389.
Sun et al. (2022)Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. 2022.Recitation-augmented language models.arXiv preprint arXiv:2210.01296.
Tay et al. (2022)Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022.Transformer memory as a differentiable search index.Advances in Neural Information Processing Systems, 35:21831–21843.
Thakur et al. (2021)Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021.Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663.
Trivedi et al. (2022)Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022.Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554.
Wei et al. (2021)Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021.Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652.
Yang et al. (2018)Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018.Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600.
Yao et al. (2023)Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023.Tree of thoughts: Deliberate problem solving with large language models, 2023.URL https://arxiv. org/pdf/2305.10601. pdf.
Yao et al. (2022)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022.React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629.
Yu et al. (2022)Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. 2022.Generate rather than retrieve: Large language models are strong context generators.arXiv preprint arXiv:2209.10063.
Appendix A Appendix
A.1 Experiment Details
In-context prompts
We use 2 different sets of few-shot demonstration for single-hop (NQ) and multi-hop (Hotpot, Musique) datasets displayed in Table 10 and Table 11 respectively. Both prompts carry the same instruction, but the multi-hop variants provides demonstrations with multiple evidence passages.
Computing constraints
An example of constrained decoding is illustrated in Figure 3.
Baselines
For the dense-retriever baseline, answers are extracted from retrieved passages with an external reader. We use PaLM2-M with a few-shot prompt (Table 12).
For iterative retrieval baseline, we use PaLM2-M for both query decomposition and answering. At each step, the model can choose to generate a sub-query or the final answer. The unified prompt is provided at Table 13.
A.2 Evaluation
Datasets
We use Musique-Ans (Trivedi et al., 2022) subset of Musique which consists of answerable queries. Details of query sets evaluated can be found in Table 8. To make retrieval challenging, we use the full Wikipedia corpus for retrieval (Table 9). This is different from the typical Hotpot and Musique setting which use the first Wikipedia paragraph (5M documents) and documents associated with query-set (1.3M) respectively.
Metrics
AutoAIS is an automated way of measuring AIS (Attributable to Identified Source) (Rashkin et al., 2021). AutoAIS formulates evaluation as a Natural Language Inference task that asks a model whether the question and answer are entailed by the provided evidence. We re-use a T5-11B checkpoint finetuned on a collection of NLI tasks from (Bohnet et al., 2022). Question answering is formulated into NLI task as follows:
hypothesis: ... premise: The answer to the question ’’ is ’’ The NLI scorer provides a calibrated score between 0 (not attributed) to 1 (completely attributed) which is averaged over the dataset to get the final AutoAIS score.
A.3 Extended Qualitative Analysis
We provide examples for loss categories defined in Section 6 in Table 14. Table 15 showcases a few selected examples where the unconstrained model emits incorrect answer, but constraining on the corpus guides it towards correct answer.
A.4 Index representation qualitative analysis
In this section we analyse retrievals from a few select examples from the OpenNQ development dataset. We compare retrievals using different document representations in the index, including: title + tf-idf, passage, sentence and proposition (see Section 3.4).
Question who holds the world record for the most world records
Gold Answer Ashrita Furman
•
title + tf-idf: List of world records in track cycling. World records in the sport of track cycling are ratified by the Union Cycliste Internationale (UCI).
•
Passage: According to the Guinness Book of World Records, the record for the largest event was set in 2009.
•
Sentence: As of January 2021, he has 1.7 million followers on Twitter
•
Proposition: Ashrita Furman holds the Guinness world record for the most Guinness world records.
•
Source text for proposition: Ashrita Furman (born Keith Furman, September 16, 1954) is a Guinness World Records record-breaker. As of 2017, Furman has set more than 600 official Guinness Records and currently holds 531 records, thus holding the Guinness world record for the most Guinness world records.
Analysis In this example, only the propositional-based retrieval was able to retrieve a correct answer. Part of the reason why passage or sentence representation is hard to retrieve with auto-regressive decoding is that the main evidence for this answer in the "source text for proposition" comes at the end of a complex sentence ("[…] thus holding the Guiness world record for the most Guiness world records").
Question who has the most number one single hits
Gold Answer The Beatles
•
title + tf-idf: List of Billboard Hot 100 number-one singles of the 1990s. This article is about the American Billboard Hot 100 chart held during the 1990s.
•
Passage: According to the Recording Industry Association of America (RIAA), the album has sold more than 500,000 copies in the United States,
•
Sentence: As of 2021, The Beatles remain the act with the most Christmas number-one singles in the United Kingdom.
•
Proposition: The Beatles hold the record for most number-one hits on the Billboard Hot 100.
•
Source text for proposition: The Beatles, Awards and achievements. As of 2017, they hold the record for most number-one hits on the Billboard Hot 100, with twenty.
Analysis The correct answer is retrieved in this example using both sentence and proposition based representations. However, the sentence representation here does not provide proper justification for the answer. Note that the propositional representation correctly decontextualizes the pronoun in "[…] they hold the record […]" to "The Beatles hold the record […]" making the retrieval easier using constrained decoding.
Question how many episodes of sabrina the teenage witch are there
Gold Answer 163
•
title + tf-idf: Sabrina the Teenage Witch (1996 TV series). The first four seasons aired on ABC from September 27, 1996 to May 5, 2000. The final three seasons ran on The WB from September 22, 2000 to April 24, 2003.
•
Passage: Sabrina the Teenage Witch is an American television sitcom created by Nell Scovell, based on the Archie Comics series of the same name.
•
Sentence: Sabrina the Teenage Witch is an American television sitcom created by Nell Scovell, based on the Archie Comics series of the same name.
•
Proposition: Sabrina the Teenage Witch had 163 episodes.
•
Source text for proposition: This is an episode list for Sabrina the Teenage Witch, an American sitcom that debuted on ABC in 1996. From Season 5, the program was aired on The WB. The series ran for seven seasons totaling 163 episodes.
Analysis All retrievals using non-propositional representations select part of the main article for "Sabrina the Teenage Witch". This article, however, does not contain the answer to the question. In the propositional case, there is a straightforward proposition that is constructed from a passage from the "List of Sabrina the Teenage Witch episodes". Note that the source passage contains a reference that becomes ambiguous out-of-context ("The series" is decontextualized to "Sabrina the Teenage Witch" in the proposition).
Question what is dj’s boyfriends name on full house
Gold Answers Steve Hale, Steven "Steve" Hale, rich kid Nelson, or Viper
•
title + tf-idf: Full House (season 8). The eighth and final season of the ABC sitcom Full House originally aired between September 27, 1994 and May 23, 1995.
•
Passage: Full House (1987–1995) and its Netflix sequel Fuller House.
•
Sentence: In the 1990s, she appeared in the films Blues Brothers 2000
•
Proposition: Steve Hale was D.J.’s boyfriend in seasons six and seven.
•
Source text for proposition: Full House, Production, Casting. As babies, the children were played by Daniel and Kevin Renteria, and in season six, the roles of the twins were succeeded by Blake and Dylan Tuomy-Wilhoit. The last main character added was Steve Hale, who was D.J. ’s boyfriend in seasons six and seven. He was played by Scott Weinger.
Analysis The source sentence with the correct answer presents a challenge for auto-regressive decoding since the sentence prefix focuses on an aspect unrelated to the question ("The last main character added […]"). With propositionalization, the sentence structure becomes aligned with the question, but requires that the model already knows the answer to the question, given that the first entity in the sentece is the answer.
Question who is the girl in green day 21 guns
Gold Answer Lisa Stelly
•
title + tf-idf: Boulevard of Broken Dreams (Green Day song), Music video. The video won six awards at the MTV Video Music Awards in 2005, most notably for Video of the Year. It also won Best Group Video, Best Rock Video, Best Direction, Best Editing, and Best Cinematography.
•
Passage: "21 Guns" is a song by American rock band Green Day. It was released as the second single from their eighth studio album, 21st Century Breakdown (2009), and serves as the sixteenth track from the album. The single was released through Reprise Records on May 25, 2009 as a digital download and July 14, 2009 as a CD single.
•
Sentence: "21 Guns" is a song by American rock band Green Day.
•
Proposition: The girl in the music video is Teresa Lourenco.
•
Source text for proposition: The music video for Ägainf̈eatures Kravitz with his girlfriend in his apartment (Gershon), whom he does not seem to be interested in. Similar to the song’s lyrical content, he meets a girl (Teresa Lourenco), who works as a waitress in a restaurant/diner.
Analysis In this case, all retrievals fail to retrieve the correct answer. In the case of the proposition-based representation, the model decodes a proposition where the subject is an ambiguous reference ("The girl") which has not been properly decontextualized (the source passage above makes it clear that the reference is not related to the question). Interestingly, the source passage with the correct answer requires an inferential step and its proposition representations are been decontextualized properly. Source text with correct answer: 21 Guns (song), Music video. The video takes place with the band and the album’s two protagonists Christian (Josh Boswell) and Gloria (Lisa Stelly) taking refuge in a white room after robbing a bank..
Relevant generated propositions:
•
The video takes place with the band and the album’s two protagonists Christian and Gloria.
•
Gloria is played by Lisa Stelly.
To properly retrieve this passage using proposition-based representation we would need to properly disambiguate "The video" to "21 guns" and perform inference over these two propositions. Alternatively, proposition generation could generate more complex propositions containing both pieces of information, such as: The "21 Guns" video takes place with the protagonist Gloria, played by Lisa Stelly.
Question how many seasons of vampire diaries r there
Gold Answer eight, or 8
•
title + tf-idf: The Vampire Diaries. The Vampire Diaries is an American supernatural teen drama television series developed by Kevin Williamson and Julie Plec, based on the book series of the same name written by L. J. Smith. The series premiered on The CW on September 10, 2009, and concluded on March 10, 2017, having aired 171 episodes over eight seasons.
•
Passage: The Vampire Diaries is an American supernatural teen drama television series developed by Kevin Williamson and Julie Plec, based on the book series of the same name written by L. J. Smith. The series premiered on The CW on September 10, 2009, and concluded on March 10, 2017, having aired 171 episodes over eight seasons.
•
Sentence: The series premiered on The CW on September 10, 2009, and concluded on March 10, 2017, having aired 171 episodes over eight seasons.
•
Proposition: The Vampire Diaries is an American supernatural drama television series.
•
Source text for proposition: The Vampire Diaries is an American supernatural drama television series that premiered on The CW on September 10, 2009, and concluded on March 10, 2017 after airing eight seasons.
Analysis In this case only the proposition-based representation retrieval is incorrect. We believe the retrieval fails here due to improper decontextualization of the correct answer passage. The sentence with the correct answer includes the proposition: The series aired 171 episodes over eight seasons.. Making it difficult for the model to
A.5 Computations involved
Evaluating the precise compute cost for Riches depends on the specific implementations of the decoding algorithm, but we can sketch the key operations involved in retrieval: indexing and search. Indexing depends on the number of items in the corpus |D|𝐷|D|| italic_D |. We use a model of size ℳℳ\mathcal{M}caligraphic_M to rewrite each passage (average length |p|𝑝|p|| italic_p |) into propositions. The overall indexing cost is proportional to O(Dℳp2)𝑂𝐷ℳsuperscript𝑝2O(D\mathcal{M}p^{2})italic_O ( italic_D caligraphic_M italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), similar in magnitude to the cost for encoding the corpus in dense retrieval, differing only by a constant factor. Note that our experiments use a T5-large backbone (770M) for Riches much smaller than T5-xxl (11B) used in the dense baselines.
Now let’s look at the search operation. At each auto-regressive step, besides standard decoding, the only additional operation is computing FM-index constraints, which consumes CPU resources. However, while the index is efficient, communication between the index on the host and the Transformer model on the GPU/TPU adds latency to the decoding step. In contrast, RAG systems retrieve documents from index using nearest neighbor scan in a single go. But even there, the documents need to encoded as input to the language model.