You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[arxiv] FaaF: Facts as a Function for the evaluation of generated text
Source
Vasileios Katranidis, Gabor Barany
The demand for accurate and efficient verification of information in texts generated by large language models (LMs) is at an all-time high, but remains unresolved. Recent efforts have focused on extracting and verifying atomic facts from these texts via prompting LM evaluators. However, we demonstrate that this method of prompting is unreliable when faced with incomplete or inaccurate reference information. We introduce Facts as a Function (FaaF), a new approach to the fact verification task that leverages the function-calling capabilities of LMs. FaaF significantly enhances the ability of LMs to identify unsupported facts in texts, while also improving efficiency and significantly lowering costs compared to prompt-based methods. Additionally, we propose a framework for evaluating factual recall in Retrieval Augmented Generation (RAG) systems, which we employ to compare prompt-based and FaaF methods using various LMs under challenging conditions.
Here is a summary of the paper "FaaF: Facts as a Function for the evaluation of generated text" in bullet points:
Problem: Existing methods for verifying the factual accuracy of text generated by large language models (LLMs) are unreliable when faced with incomplete or inaccurate information. These methods often rely on prompting LLMs to evaluate individual facts, which can be costly and time-consuming.
Solution: The paper introduces Facts as a Function (FaaF), a new approach that leverages the function-calling capabilities of LLMs to verify multiple facts simultaneously. FaaF uses JSON or XML to represent facts as function arguments, allowing for more structured and efficient evaluation.
Benefits of FaaF:
Improved Accuracy: FaaF outperforms prompt-based methods in identifying unsupported facts, especially in cases of incomplete or inaccurate text.
Increased Efficiency: FaaF significantly reduces the number of LLM calls and completion tokens required for fact verification, making it more cost-effective.
Enhanced Calibration: FaaF allows for more nuanced responses by providing options like "Not clear from the given passage," which helps LLMs distinguish between different types of rejection.
Evaluation Framework: The paper proposes a framework for evaluating factual recall in Retrieval Augmented Generation (RAG) systems, which is used to compare prompt-based and FaaF methods using various LLMs.
Dataset: The paper augments the WikiEval dataset with ground truth facts and human annotations, creating a new dataset called WikiEvalFacts. This dataset allows for testing fact verification methods under challenging conditions.
Results:
FaaF consistently outperforms prompt-based methods in terms of accuracy and efficiency.
Larger LLMs (like GPT-4-turbo and Claude-3-opus) generally perform better than smaller models.
Claude models using XML are more reliable in formatting responses for function calling than GPT models using JSON.
Limitations:
The study is limited by the relatively small size of the WikiEvalFacts dataset.
Further research is needed to optimize FaaF configurations and explore the interplay of function argument metadata.
Future Work:
Investigate the performance of FaaF with open-source LLMs.
Explore the maximum number of facts and the maximum length of facts that can be incorporated into a function object.
Analyze the impact of token count and other performance implications on FaaF.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"title": "FaaF: Facts as a Function for the evaluation of generated text",
"subtitle": "Vasileios Katranidis, Gabor Barany",
"description": "The demand for accurate and efficient verification of information in texts generated by large language models (LMs) is at an all-time high, but remains unresolved. Recent efforts have focused on extracting and verifying atomic facts from these texts via prompting LM evaluators. However, we demonstrate that this method of prompting is unreliable when faced with incomplete or inaccurate reference information. We introduce Facts as a Function (FaaF), a new approach to the fact verification task that leverages the function-calling capabilities of LMs. FaaF significantly enhances the ability of LMs to identify unsupported facts in texts, while also improving efficiency and significantly lowering costs compared to prompt-based methods. Additionally, we propose a framework for evaluating factual recall in Retrieval Augmented Generation (RAG) systems, which we employ to compare prompt-based and FaaF methods using various LMs under challenging conditions."
FaaF: Facts as a Function for the evaluation of generated text
Vasileios Katranidis Gabor Barany
IMMO Capital
{vasileios.katranidis, gabor.barany}@immo.capital
Abstract
The demand for accurate and efficient verification of information in texts generated by large language models (LMs) is at an all-time high, but remains unresolved. Recent efforts have focused on extracting and verifying atomic facts from these texts via prompting LM evaluators. However, we demonstrate that this method of prompting is unreliable when faced with incomplete or inaccurate reference information. We introduce Facts as a Function (FaaF), a new approach to the fact verification task that leverages the function-calling capabilities of LMs. FaaF significantly enhances the ability of LMs to identify unsupported facts in texts, while also improving efficiency and significantly lowering costs compared to prompt-based methods. Additionally, we propose a framework for evaluating factual recall in Retrieval Augmented Generation (RAG) systems, which we employ to compare prompt-based and FaaF methods using various LMs under challenging conditions.
1 Introduction
The adoption and transformative impact of large language models (LMs) across industries are significantly driven by their application in knowledge-intensive tasks.
In these applications, Retrieval Augmented Generation (RAG) is often used to integrate out-of-training knowledge to the LM (Lee et al., 2019). This is achieved by dynamically fetching and incorporating relevant data from a trusted reference source prior LM text generation, resulting in more informed, accurate, and contextually rich outputs.
Considering the importance of factual accuracy of generated text in this setting, a significant body of work has focused on automated ways for obtaining it. At a high level, two distinct schools of thought emerge - shaped by the intended use case, the available test datasets and required scale. First, authors who address factual precision —the truthfulness of each statement in a generated text— in both RAG and non-augmented LM generation scenarios (Chen et al., 2022; Zhang et al., 2023; Gao et al., 2023; Lee et al., 2023; Min et al., 2023; Azaria and Mitchell, 2023; Yuan et al., 2021; Fu et al., 2023; Wang et al., 2023; Kadavath et al., 2022; Manakul et al., 2023). Given that a generated text may include both accurate and inaccurate statements, a common approach is to initially extract a number of fact statements from the generated text and verify them individually by prompting an LM evaluator model with appropriate references. Secondly, other studies (Cuconasu et al., 2024; Liu et al., 2023; Kandpal et al., 2023; Mallen et al., 2023) attempt to evaluate LMs and RAG systems in terms of exact matching any one of a set of predefined correct answers in the generated text.
There is an apparent lack of work on the practical measurement of factual recall—the extent to which all the information required to sufficiently answer a question is included in the generated text. Factual recall is a key performance metric, particularly in the RAG scenario, because it directly captures the performance of retrieval and generation simultaneously and closely reflects the central purpose of the system. Factual recall is at least as important as factual precision for RAG systems, where a response can be factually precise but given the wrong context may be irrelevant to the posed question.
Although the exact match method can be viewed as a fast and efficient form of binary recall score (where the existence of a single accepted answer in the generated text signals success), it faces serious challenges and limitations in the verification task.
As recognised by Cuconasu et al. (2024), exact matching of ground truth answers in the generated text is prone to false negatives since the ground truth information might be present but phrased differently. Additionally, this approach cannot be applied when the grounding information is longer than a few words since the chances of exact match become very slim. Lastly, being binary renders it a relatively poor metric considering that there may be plural grounding information which need to be included in the generated text for a complete answer.
Further, current approaches relying on verification of each fact independently can be prohibitively costly in time and resources. Specifically, RAG systems include many moving parts (knowledge base, retrieval, prompt formulation, LM) and require substantial tuning (Es et al., 2023) therefore the efficiency and speed of the evaluation task is a requirement for practical usage.
To address the gap above, we make the following contributions:
We introduce Facts as a Function (FaaF), a new fact verification formulation which outperforms fact verification via prompting and reduces the required number of LM calls and completion tokens by more than 5 times.
2. 2.
We propose an end-to-end factual recall evaluation framework which is tailored to RAG systems. It can be used to (i) create a test dataset and (ii) perform automated factual recall evaluation given a RAG system (or generally a language model).
3. 3.
We probe into the performance of fact verification formulations in conditions of highly incomplete or inaccurate generated text. To achieve that, we augment WikiEval111https://huggingface.co/datasets/
explodinggradients/WikiEval(Es et al., 2023) with ground truth facts and human annotation of their validity. WikiEval features question/answer pairs with answers of variable factual quality which enable simulating deficient RAG responses. We find that prompt-based fact verification faces serious challenges in identifying unsupported facts in the presence of inaccurate or incomplete generated text.
4. 4.
Recently, authors introduced RAGAS (Es et al., 2023), an evaluation framework for RAG systems which is measuring the performance of retrieval and generation without the requirement of ground truth human annotation. Although the authors recognise and are motivated by the need for a practical an efficient RAG evaluation method, RAGAS does not capture the factual aspect of the evaluation, which is a key performance metric and the central focus of RAG’s intended use.
Min et al. (2023) use prompt variations and an aggregate non-parametric probability of the tokens in a fact statement to directly verify individual facts extracted from LM generated biographies. They compare their method with human evaluation and find low error rates when retrieving the ground truth for the evaluated fact.
Zhang et al. (2023) propose self-measuring factuality by the LM via a few-shot prompting method combined by generated facts pertinent to the statement in question. They argue that while leveraging facts from a knowledge base is more dependable, its effectiveness is confined to the scope of the knowledge base and the quality of retrieval. Conversely, self-evaluating with generated facts offers more flexibility but risks introducing inaccuracies.
Li et al. (2023) indicate that LMs have difficulty identifying non-factual information with standard prompting strategies and report improvement using Chain of Thought (CoT).
Azaria and Mitchell (2023) also find that fact verification by prompting insufficient and propose to train a classifier on the hidden layer activations of open source LMs to predict the truthfulness of a generated statements. However, current leading commercial models are lacking layer activation access and so require alternative methods.
Another approach in the same spirit is to look at the conditional probabilities of each generated token as an indicator of LM confidence and truthfulness of the generated text with the view that low LM confidence is a proxy for incorrect statements (Yuan et al., 2021).
Fu et al. (2023) build on the concept of utilising token probabilities, introducing a self-evaluation framework for LMs. This framework leverages few-shot prompting to evaluate various instructed aspects of LM responses, such as factuality, fluency, interest, among others.
Manakul et al. (2023) propose SelfCheckGPT which automates the detection of factual errors in LM outputs through statistical analysis of multiple responses to the same prompt, without external knowledge sources. This is again, an expression of the general idea that the probability distribution of the generated response is indicative to the confidence on its truthfulness. So similar to Yuan et al. (2021), SelfCheckGPT makes this assessment post LM-generation by sampling multiple answers on the same prompt thereby removing the requirement of access to the token probabilities or layer weights and making this approach applicable to closed models.
Aly et al. (2021) use a Roberta encoder with a linear layer to learn and predict the fact label given text evidence.
Wang et al. (2023) describe a method where the LM is prompted directly to score an answer’s specific aspect from 0 to 100 or rates it on a 5-star scale, yielding notable results. However, this approach’s effectiveness heavily depends on the prompt’s design.
Zhang et al. (2020) attempt a flexible self-evaluation of generated text using reference answers (BertScore). BertScore calculates a similarity score between tokens in the generated and reference sentences using contextual embeddings. The key benefit being that there is no reliance on exact matching between generated and reference text. Nevertheless, a high semantic score at sentences level does not guarantee factual precision, especially when the information examined is not contextual and depends only on a small number of tokens (E.g. date).
The work of Zhao et al. (2019) also relies on contextual embeddings but their approach allows for an intentional bias towards precision or recall via reformulating the semantic similarity between generated and reference text as an optimisation problem of finding the minimum effort to transform between the two.
Kadavath et al. (2022) observe that LLMs offer well-calibrated probabilities for self-evaluation via constraining the LM response into multiple-choice and True/False questions. This work highlights that simply requiring discrete response options prior to text generation can aid the response calibration by effectively narrowing the available distribution of next tokens — which would alternatively include many semantically overlapping paraphrases.
Lastly, recent work on the factual accuracy of LMs and RAG systems (Cuconasu et al., 2024; Liu et al., 2023; Kandpal et al., 2023; Mallen et al., 2023) took the approach of using the NaturalQuestions-Open (NQ-open)444https://ai.google.com/research/NaturalQuestions dataset (Kwiatkowski et al., 2019) and calculate accuracy by judging whether any of the ground truth answers (NaturalQuestions annotations) appear in the generated text via exact matching. NQ-open is a large scale dataset which comprises historical Google search queries and their human-annotated answers sourced from Wikipedia. Even though NQ-open, is valuable for its extensive scope and domain-agnostic nature, fact-verification via exact matching faces serious challenges. As highlighted by Cuconasu et al. (2024), a key issue is accurately determining the truthfulness of answers, especially when dealing with phrases that have the same meaning but (slightly different format or wording) or different date formats. Thus, the necessity for a more advanced analysis of answer variations is recognised and left for future research.
3 Facts as a Function
Facts as a Function (FaaF) is a streamlined fact verification method using function calling for multi-fact assessment. Although in this paper we focus in examining FaaF in the scope of factual recall evaluation, the method itself can equivalently be used for factual precision and broader fact verification tasks.
Key idea 1: JSON and XML over promptWe propose that by using the function calling ability of the LM, we enforce a more formal mode of token generation compared to natural language for three reasons. First, by leveraging the metadata of function arguments, type annotations and tailored instructions can effectively constrain the LM to the accepted modes of response. This provides an element of structured repetition of the instructions towards the LM (via the metadata attached to each function argument) and ultimately results in a more consistent guidance compared to instructions given once at the end or the beginning of a prompt. Secondly, with formalising the type annotation, we can avoid relying on exact matching to interpret the LM’s text response—which can prove detrimental as we demonstrate in the results of this paper. Type annotations can be combined with custom types to essentially convert a function argument into a classification result to a multiple-choice question. Building on the findings of Kadavath et al. (2022), who established that LMs show well calibrated probabilities when presented with multiple choice questions, we propose that using the function argument’s type annotations to convey the accepted LM responses is a step further in the same direction. Thirdly, due to the strict nature of code syntax compared to natural language, gradients during training are expected to be steeper and thus the LM learns to do a better job at adhering to the expected output and responding with lower stochasticity—which is particularly helpful in the fact verification use case.
Key idea 2: Generated text as a unitAs discussed above, a function definition can encapsulate a sufficient number of arguments to be used by the LM so that they capture all the facts statements which need to be verified. Therefore, we move away from the concept that each fact should be verified individually via a fact-specific prompt and we propose instantiating a function per LM-generated text which needs to be factually assessed. The function can be programatically generated in a manner such that it includes all the individual facts which need to be verified in a given piece of text. In other terms, function calling allows for the fact verification of a long-form text as a unit, with a single LM call. This approach results in a reduction of cost and time for fact-verification which is proportional to the number of facts which would otherwise needed to be assessed individually, as seen in Chen et al. (2022); Gao et al. (2023); Min et al. (2023); Lee et al. (2023).
Key idea 3: Outsourcing judgement from the LM to the functionUsing function objects to communicate with the LM, enables access to a multitude of tools and further processing we can execute on LM’s output. This strategy permits us to delegate certain deterministic judgments away from the language model. We demonstrate this capability by mapping a range of LM responses into a binary format (True / False). In doing so, the calibration of the LM response is enhanced as we provide a more accurate representation of the spectrum of potential outcomes than a simple True/False dichotomy. The underlying intuition is that ultimately, we can afford to ask simpler and clearer questions to the LM which can be answered more reliably and further process the LM output into a final response.
DefinitionWe aim to present the facts to the LM as a callable function. Let S𝑆Sitalic_S be the list of fact statements as strings to verify.
A constructor function C𝐶Citalic_C then maps the input list of facts S𝑆Sitalic_S and control parameters P𝑃Pitalic_P to an function object O𝑂Oitalic_O with arguments f𝑓fitalic_f.
Each argument from (f1,f2,…,fn)subscript𝑓1subscript𝑓2…subscript𝑓𝑛(f_{1},f_{2},\ldots,f_{n})( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) corresponds to a fact statement and is further parameterised by P𝑃Pitalic_P. Control parameters P𝑃Pitalic_P include the methods, argument properties and metadata which are injected into resulting object O𝑂Oitalic_O. Such methods can describe for example a desired post-processing step on the values in the arguments (f1,f2,…,fn)subscript𝑓1subscript𝑓2…subscript𝑓𝑛(f_{1},f_{2},\ldots,f_{n})( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Before passing object O𝑂Oitalic_O to a language model, we convert it to a JSON or XML representation—depending on the LM’s function calling requirements:
Let M𝑀Mitalic_M be the language model used for fact verification (LMeval). The input of M𝑀Mitalic_M is a concatenation of JSONO𝐽𝑆𝑂subscript𝑁𝑂JSON_{O}italic_J italic_S italic_O italic_N start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, a prompt q𝑞qitalic_q which instructs M𝑀Mitalic_M to utilise O𝑂Oitalic_O and the input text x𝑥xitalic_x which is to be assessed for factuality with respect to the given facts S𝑆Sitalic_S.
M𝑀Mitalic_M responds with the output oxsubscript𝑜𝑥o_{x}italic_o start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT which is of string type. Then, a parsing function GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT which adheres to particular response schema of M𝑀Mitalic_M is used to parse the raw response oxsubscript𝑜𝑥o_{x}italic_o start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and invoke O𝑂Oitalic_O by assigning values on its arguments, yielding O´´𝑂\acute{O}over´ start_ARG italic_O end_ARG. The values being assigned to each of (f1,f2,…,fn)subscript𝑓1subscript𝑓2…subscript𝑓𝑛(f_{1},f_{2},\ldots,f_{n})( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) being the verification result of the underlying fact statement. Upon invocation of O𝑂Oitalic_O, the function arguments undergo type validation to ensure that its schema and type annotations are respected.
4 Assessment of fact verification formulations in the RAG setting
Figure 2 outlines the factual recall evaluation framework which also serves as the experimental setup which we use to compare fact verification formulations with each other. Starting from the ground truth answer containing the desired information to fully address the posed question, we derive a set of fact statements using a fact-generator LM (LMf). We then use these derived facts to evaluate each of the other answer variants in WikiEval for their factual recall via LMeval. In this manner, each answer is evaluated with respect to the information that is expected from it. It is easy to see how this framework could be applied in the RAG setting where different configurations or model choices impact the quality of the final response.
Dataset
In order to probe into performance of automatic fact verification methods we chose to work with the WikiEval dataset (Es et al., 2023) which features three versions of answers, of variable quality, to questions on Wikipedia articles. Specifically, for each question, there is an answer (referred to as ground truth answer in this paper for clarity), ungrounded answer and poor answer. All answers have been generated with GPT-3.5-turbo-16k.
The ground truth answer is generated by providing the LM with the correct context from the respective Wikipedia page. The ungrounded answer is generated by not providing any context to the LM. Finally, the poor answer is generated by instructing the LM to give an incomplete answer to the given question. An example of each answer type can be seen in Figure 2. Generally, ungrounded answers contain false, incomplete and redundant information with respect to the ground truth answers. On the other hand, poor answers contain primarily incomplete information compared to ground truth answers i.e. no evidence for support or rejection for ground truth facts.
This dataset enables us to asses the impact of quality and completeness of different answer variants to the performance of fact-verification. In that way, we test closely the ability of different fact verification methods and LMs to identify unsupported facts when presented with (i) incorrect, (ii) indirectly relevant and (ii) incomplete information with some degree of distinction.
Fact generation
To prepare the WikiEval dataset, we initially generate fact statements that fully capture the information from the ground truth answers, followed by manual annotation of the generated facts for each answer variant. We call the resulting augmented dataset WikiEvalFacts. We use gpt-4-turbo to generate facts from the questions and ground truth answers from the WikiEval dataset (QA pair) using the following prompt:
Convert the given passage into a list of short facts which specifically answer the given question.
Make sure that the facts can be found in the given passage.
The facts should be coherent and succinct sentences with clear and simple syntax.
Do not use pronouns as the subject or object in the syntax of each fact.
The facts should be independent to each other.
Do not create facts from the passage which are not answering the given question.
Add a ”-” before each fact.
Passage: [ground truth answer]
Question: [question]
Fact generation via the prompt above results in a variable number of facts for each ground truth QA pair which depends on the length and information density in the processed ground truth answer. This process yielded 281 individual facts, which were annotated for each answer type (thus 843 annotated facts in total considering ground truth answer, ungrounded answer and poor answer) with an average of 5.6 fact statements generated for every QA pair. The prompt has been designed to ensure that the generated facts are complete sentences, understandable independently of each other or any external references.
Human fact-verification
We outsource the fact verification of the generated facts against the triplet of answers in WikiEval (ground truth answer, ungrounded answer and poor answer) to human evaluators. In this manner we build a ground truth evaluation for each answer type, enabling us to assess the effectiveness of automated fact-verification methods against it. The accuracy from the human fact verification can be seen in Table 1 where the factual accuracy of ground truth answers is 100% since all the generated facts are True by design. The deterioration of the ungrounded answer and poor answer relative to the ground truth answer is evident.
Prompt fact-verificationFollowing Min et al. (2023), we use a prompt and the respective answer variant as context to verify a single fact at a time with LMeval:
Passage: [answer]
Considering the given passage, the claim [fact] is True or False?
Facts as a function
For each set of fact statements which encapsulate a ground truth answer, we construct a facts-specific function object and a parsing function. Since the created function object contains all the input facts as arguments, we perform verification to the set of facts as a unit. An example of the JSON representation of a function object containing the first fact can be seen bellow:
{’properties’:
‘fact_0:{
’description’:"It
is clear from the passage
that Pope Benedict XVI became
the head of the Catholic Church
and sovereign of the Vatican
City State on April 19, 2005.
Respond by using one of the accepted
Enum types.",
Ψ’enum’: [’True’, ’False’],
Ψ’type’: ‘string’
},
…
},
’required’:
Ψ[‘fact_0’,
Ψ‘fact_1’
Ψ …,
Ψ‘fact_n’]
’title’: ’FactChecker’,
’type’: ‘object’}
Each function argument includes metadata which can be used to pass instructions, type annotations and the fact statement to be verified itself. In addition to the function object, we pass the following prompt:
Consider the given passage and assign the correct values in the fact checker function.
Passage: [answer]
The answer in the prompt is the input text we want to evaluate against the respective facts which have been previously derived by the ground truth answer in our dataset. After LMeval generates a response, the parsing function is used to invoke the function object by supplying the arguments parsed from the LMeval’s response.
We test the following configurations:
FaaF(T/F)A function object with arguments which only accept True or False as a response from LMeval. As seen from the JSON example above, these are specified as custom type annotations (enum).
FaaF(T/F/N)A function object with arguments which only accept True, False or Not clear from the given passage as a response from LMeval. In this scenario, further processing inside the function object will map Not clear from the given passage to False after invocation. This is an example of applying a simple processing step on the LM output, post-generation. The intuition behind this configuration is that the rejection of a claim based on contradicting evidence is conceptually different to the rejection of a claim based on absence of evidence and we help the LMeval’s calibration by providing a clear response option for each.
FaaF(T/F)+citationIn this instance we construct a function object with two arguments for each input fact. One argument for the factual evaluation and one argument where we instruct the LMeval to generate an exact excerpt from the input text which directly supports the fact in question (i.e. citation). We place the citation argument prior to the factual evaluation argument so that LMeval is made to first try and find a supporting citation from the input text before verifying the fact that is being assessed. Similarly, the intuition here is that by asking LMeval to search and retrieve a specific citation from the input text which supports a specific fact, it will result in a better calibrated verification of the respective fact statement.
FaaF(T/F/N)+citationIn this configuration we combine the two approaches outlined above to explore their combined effect. In detail, we construct the function object to include citation arguments and we define True, False or Not clear from the given passage as accepted responses.
Language models (LMeval)We use the latest commercially available models which support function calling gpt-4-turbo, gpt-3.5-turbo, claude-3-opus and claude-3-sonnet. We also examined the recently released (mistral-Large)555https://mistral.ai/news/mistral-large/ LM but it was excluded from the results in this paper due to its high failure rate of over 80% in some cases in generating appropriately formatted responses, rendering its results non-contributory to the discussion. It is important to note that we did not allow models to retry in case of a failed parsing of their response or failed invocation of the function object due to formatting. FaaF introduces strict constraints on the expected LM response and by permitting only one attempt, we also assess the LM’s proficiency in formatting their response, as well as verifying factual accuracy.
In addition, we highlight the crucial role of system prompt to model performance. A change in the system prompt can impact significantly the fact verification accuracy of the LM. We kept the system prompts for the GPT models unchanged but modified Claude models’ system prompts to incorporate a 1-shot example of simple function calling. This adjustment is a result of following the official function-calling recommendations of the two model families at the time of writing this paper — GPT uses JSON, while Claude uses XML. As the pace of developments in this space is fast, we expect further changes and improvements in the optimal interface between LMs and tools/functions.
Lastly, a comprehensive assessment of open source models which support function calling is left for future study as there are none widely established available at this time.
MetricsThis paper’s metrics exclusively evaluate language models (LMs) based on their successfully formatted responses where applicable (FaaF). In doing so, we ensure that the comparison of the LMs’ fact verification ability is not influenced by their capacity to format responses correctly — which is discussed separately.
Error rateWe use Error Rate (ER) between the human fact-verification and the fact verification formulation as the main indicator of verification accuracy.
F1microWe also use F1micro score (F1m) as defined in Min et al. (2023) to measure the successful identification of unsupported facts and probe further into the individual fact verification. It should be noted that F1 scores explicitly depend on the class ratio (T/F) via precision and recall. For that reason F1m scores should be compared across fact verification approaches in the same answer category (where the T/F ratio is preserved) in Table 2, and not across answer categories.
5 Results
Table 2 presents the non-answer rate (N/A), Error Rate and F1m from the examined fact verification formulations and LMs, across the answer categories which are examined.
Prompting for fact verification is not reliable in cases of incorrect or incomplete information. Although in the case of ground truth answers Prompt(T/F) ER is in the low percentage points in all LMs—in line with what is reported in Min et al. (2023) when retrieval is enabled—we see a sharp rise when we try to verify facts in text from the ungrounded answers and a further deterioration with poor answers with ER exceeding 50% and 70% with GPT and Claude LMs respectively. Although the failure mechanisms are distinct between Claude and GPT, the performance in both cases indicates that prompting coupled with word-matching is not a suitable approach for fact verification in text with unknown information quality and completeness.
In the case of gpt-4-turbo and gpt-3.5-turbo, the high error rates are attributed to an overall overestimation of the factual truthfulness of a fact statement given a reference text which does not support it. Regarding claude-3-opus and claude-3-sonnet, the exceedingly high ER is primarily due to the erroneous parsing of the LM’s response. In detail, on many occasions the verification response from Claude LMs contains both True and False words, which exposes the fragility of word-matching as a means of parsing the LM’s response. Although qualitative evaluation found Claude models to be less prone to overestimation than GPT LMs, the increased verbosity of the responses some times includes phrases like ”To determine if the claim is true or false based on the given passage…”, ”…cannot be determined as true or false.” or ”…given the lack of information in the passage, we cannot determine whether the claim is True or False.” which mislead the interpretation of the response when using word matching.
FaaF outperforms prompting in fact verification accuracy. Formulations which leverage function calling demonstrate a notable improvement in ER and F1micro particularly in the cases of ungrounded and poor answer with all LMs. The improvement is most pronounced in the poor answer category where all models show lower ER scores from 30 to 70 percentage points compared to prompting, denoting a paradigm shift in the model’s ability to identify unsupported facts in the examined text. In addition, the larger LMs (claude-3-opus and gpt-4-turbo) preform notably better in the poor answer category than their respective inferior versions (claude-3-sonnet and gpt-3.5-turbo).
Rejecting facts is more difficult in the presence of tangentially relevant information when using FaaF. It is evident from Table 2 that FaaF approaches struggle more with ungrounded answers than with poor answers. The key difference between the two answer types is that ungrounded answers typically contain incorrect or tangentially relevant information to the facts that are examined whilst poor answers usually lack information altogether. This is because ungrounded answers are generated by an LM without any context thus, depending on the popularity and prevalence of the theme of the question, a degree of factual hallucination can be expected. It is apparent that, from the perspective of fact verification, presence of indirectly relevant (but not useful) information in the generated text is more dangerous than lack of information.
Consider the following example from WikiEvalFacts:
Question:
What factors contributed to the Sri Lankan economic crisis?
Ground truth answer:
The Sri Lankan economic crisis was caused by multiple compounding factors. These include tax cuts and money creation by the government, which led to budget deficits and a decline in government revenue. The government’s decision to shift to organic or biological farming also played a role. Additionally, the crisis was exacerbated by the 2019 Sri Lanka Easter bombings and the impact of the COVID-19 pandemic. The country’s high external debt, with a significant increase in foreign debt in recent years, also contributed to the crisis. The combination of these factors resulted in unprecedented levels of inflation, near-depletion of foreign exchange reserves, shortages of medical supplies, and an increase in prices of basic commodities.
Ungrounded answer:
There were several factors that contributed to the Sri Lankan economic crisis:
High levels of government debt: Sri Lanka had accumulated a significant amount of government debt, both domestic and foreign. This debt burden became unsustainable, leading to a crisis in public finances.
2. 2.
Fiscal mismanagement: The government had been running large budget deficits for several years, spending more than it was earning. This led to a widening fiscal deficit and increased borrowing to finance government expenditure.
3. 3.
Political instability: Sri Lanka experienced political instability during this period, with frequent changes in government and policy uncertainty. This created an unfavourable business environment and deterred foreign investment.
4. 4.
Declining exports: Sri Lanka’s export sector, particularly garments and textiles, faced challenges due to global economic slowdowns and increased competition from other countries. This led to a decline in export earnings, which further strained the country’s external finances.
5. 5.
High inflation: Sri Lanka experienced high inflation during this period, which eroded the purchasing power of the population and increased the cost of living. This put additional pressure on the economy and contributed to the crisis.
6. 6.
External shocks: Sri Lanka was also affected by external shocks, such as rising oil prices and global financial market volatility. These factors further exacerbated the economic challenges faced by the country.
7. 7.
Structural issues: Sri Lanka had several structural issues that hindered its economic growth, such as a lack of diversification in the economy, inadequate infrastructure, and low productivity levels. These issues made it difficult for the country to sustain economic growth and overcome the crisis.
Poor answer:
There were several factors that contributed to the Sri Lankan economic crisis. Sri Lanka is known for its beautiful beaches and rich cultural heritage.
Extracted facts(using question and ground truth Answer):
•
Tax cuts and money creation by the government contributed to the Sri Lankan economic crisis.
•
Budget deficits and a decline in government revenue were factors in the Sri Lankan economic crisis.
•
The governments decision to shift to organic or biological farming played a role in the crisis.
•
The 2019 Sri Lanka Easter bombings exacerbated the economic crisis.
•
The impact of the COVID 19 pandemic contributed to the Sri Lankan economic crisis.
•
High external debt, with a significant increase in foreign debt in recent years, also contributed to the crisis.
Although the ungrounded answer is quite verbose and has several mentions and indirect references of the extracted facts, it fails to capture with clarity the information from the ground truth answer which would allow for their confident verification. Meanwhile, the poor answer caries no useful information in this instance.
In this scenario LMeval has a higher risk of a misjudgement in the ungrounded answer than the poor answer. This seems coherent intuitively since rejecting a claim in the presence of relevant information is a more demanding and complex task which requires deeper interpretation of the language than when there is no relevant information.
LMs tend to overestimate fact truthfulness overall. The human evaluation of factual accuracy in ground truth answers is 100% i.e. every fact is True (Table 1). This coincides with the lowest ER scores in Table 2, irrespective of the fact verification approach. False positive verifications are responsible almost exclusively for the observed error rates with all language models demonstrating excellent verification performance when the facts can be directly supported from the given text.
Providing a “not clear” option helps the larger LMs. We observe a reduction of the error rate and corresponding increase in F1m in claude-3-opus and gpt-4-turbo, when we include the option for LMeval to respond with Not clear from the given passage which is mapped to False as a post-generation step in the invoked function. The helpful mechanism in this instance appears to be that we provide a needed third option to LMeval when the token probability distribution between True / False is not clearly indicating one over the other. Rejecting a statement due to conflicting evidence and due to lack of evidence are both as valid rejection reasons as they are distinct to each other. Using False to capture both rejection scenarios has proved to lead to more false positives than providing LMeval with the option to distinguish between them. The fact that the improvement is only seen in the more capable LMs supports this view since they are more capable for complex tasks and language comprehension.
Asking for citations helps in the presence of correct and clear information but can also lead to false positives otherwise. The positive impact of adding citations is most evident in the ground truth answer category where the provided text always contain the required evidence to support the facts. In this instance, asking for text evidence from LMeval results in avoiding some false negative verifications. The beneficial mechanism is associated with inserting a preliminary step to the fact verification process—of the explicit use of evidence from the input text. This aligns with the findings reported in Wei et al. (2023) regarding the chain-of-thought method.
Interestingly, for the other answer categories, the citation benefit becomes less clear and even reversed. ER is relatively stable in poor answers but it is seen to increase in the case of ungrounded answer when citation arguments are included in FaaF. In detail, we notice the following conflicting effects: firstly, citations can prevent false positives by highlighting the absence of supporting text for a given fact statement when they are left empty by the LM, which is beneficial. Secondly, in other cases they can cause false positives when they contain an indirectly relevant excerpt or an excerpt which only supports partially the fact in question. Consider the following example:
Ungrounded answer:
The human climate niche refers to the range of climatic conditions in which humans can thrive and maintain a sustainable population. It encompasses various factors such as temperature, …
Fact to verify:
The human climate niche refers to the range of climate conditions that have supported human life and activities over the past thousand years
FaaF(T/F)+citation – claude-3-opus:
LM citation: ”The human climate niche refers to the range of climatic conditions in which humans can thrive and maintain a sustainable population.” LM response: True
FaaF(T/F) – claude-3-opus:
LM response: False
Human:
Manual annotation: False
In the example above, only part of the fact statement can be supported from the ungrounded answer (i.e. there is no evidence that the human climate niche refers to the past thousand years). By asking for the citation, the LM captures the partially supporting excerpt and concludes that the fact is True (which is a false positive) but when the same LM verifies the fact without citation, it correctly rejects it.
It is the net effect of the above competing behaviours which determines the impact of adding citations to the overall evaluation performance. Further, results show that Claude LMs are more sensitive to the adverse effects of citations compared to the GPT family, as seen in ungrounded answers (Table 2).
Claude LMs are more reliable than GPT in correctly formatting the response for function calling. The capacity of the LM to return a correctly formatted response for function calling is distinct to their ability for accurate fact verification. In this work we allowed only one chance to respond correctly (regarding format) for all experiments and we report the cases of an failed responses for all LMs in the N/A columns in Table 2. The ER and F1micro metrics are calculated considering only the correctly formatted LM responses.
As evident in Table 2, the XML format used in Claude LMs results to more reliable response formatting compared to JSON format used by GPT models. Claude LMs returned a correctly formatted response 100% of the time when using FaaF whereas GPT LMs produced some failed attempts. The failure mechanism in GPT models appears to be directly related to the citations mode—all formatting failures are seen in FaaF+citation formulations (Table 2). Looking more closely, we find that when the citation of a fact is null, there is a risk that the LM will return null in the fact verification argument as well (but only True or False are accepted according to the type annotations of the function object definition), which results to failed invocation of the FaaF function object. In the case of gpt-4-turbo, the above failure mechanism is much less pronounced compared to gpt-3.5-turbo, which shows up to 25% failure rate (71/281) in returning a correctly formatted response in the poor answers, where the citations are expected to be null most of the time.
We attempted to include mistral-large LM in our study but the failure rates associated the response formatting was prohibitive (80% in some cases). It is worth noting that the failure mechanisms of mistral-large were different to what we observed in GPT LMs. In detail, on many occasions mistral-large failed to return all the expected arguments in its response and in other cases, in stead of returning the expected JSON with the function arguments, it responded with a long-form text, which included the function arguments as part of the natural language response (which led to failure when parsing).
FaaF requires less than one-fifth of the calls to the LM to perform fact verification compared to prompting. This corresponds to the average number of facts which are examined for each text (answer type) in WikiEvalFacts. Thus, the degree of efficiency improvement using FaaF is proportional to the number of facts we can encapsulate in the function object, there by avoiding their individual verification.
Table 3 presents the token counts for the tested verification formulations and model families. The token counts include tokens used for all the answer types in WikiEvalFacts dataset in each scenario. Only one model from the GPT and Claude families is included since the token count differences between models in the same family are not significant. Lastly, it should be noted that although we can constrain the completion tokens from the ML parameters we chose not to do so, to examine the token usage in the unconstrained scenario.
GPT LMs using JSON format are significantly more efficient than Claude with XML in token usage. Considering the differences between gpt-4-turbo and claude-3-opus, the observed increase in prompt and completion tokens in the FaaF approaches is associated to the tags used in the XML format which is expected from claude function calling (versus the more succinct JSON format that is expected by gpt). In the case of prompt tokens, the one-shot system prompt (which also contains XML tags) used in claude-3-opus further contributes to the token count. A respective improvement in speed is also noted, with GPT LMs being faster than Claude.
Focusing at the completion tokens, the sharp increase (more than 4X) between gpt-4-turbo and claude-3-opus seen in the case of prompt-based verification is attributed to the extra verbosity by claude-3-opus. When prompted to verify a single fact, gpt-4-turbo most times returns small phrase containing True/False. Conversely, claude-3-opus always returns a coherent explanation to justify the verification. It should be noted that the same prompt was used in both models.
FaaF significantly reduces token usage, requiring fewer than one-fifth the completion tokens needed for prompt-based verification. Following the trend seen in the required LM calls, the most notable reduction in token count is achieved when replacing prompt-based verification with FaaF(T/F). In this scenario, with gpt-4-turbo half of the prompt tokens and less than one-fifth of the completion tokens are required and with claude-3-opus, two thirds of the prompt tokens and less than one-sixth of of the completion tokens are required for fact verification for a complete evaluation on WikiEval(all answer types). Including citations and the response option Not clear from the given passage progressively increases the token count due to the additional information we include in the LM verification. However, in most cases, it still remains below the token requirements of prompt-based fact verification.
6 Conclusions & future work
We show that prompt-based fact verification is prone to overestimating the truthfulness of fact statements in texts with inaccurate and/or missing information. In particular the error rate of prompting can exceed 70% when the text under review is significantly lacking in information. In such challenging situations, presenting the facts as a function (FaaF) to the language model significantly enhances the its ability to verify facts accurately. The improvement comes from leveraging a more structured generation mode of the language model. This is achieved by generating type-annotated function call arguments instead of natural language text. Additionally, it avoids the unreliable method of exact word matching to parse verification responses, a common issue in prompt-based verification methods.
Using FaaF, we observe that texts with tangentially relevant and inaccurate information are more likely to cause false positives than texts with missing or incomplete information. By testing various configurations with FaaF, we find that by including a “not clear” option to the True/False dichotomy helps the larger LMs. The impact of asking for citations before fact verification is sensitive to the quality and coverage of information in the examined text and not beneficial in many cases.
Additionally, we report significant cost and time efficiency improvements between prompting and FaaF fact representations. Generally, using FaaF leads to a reduction in both the number of calls to the LM and the number of tokens needed for fact verification by a multiple-factor.
We examine the latest commercial language models and find significant improvement in fact verification accuracy between the larger LMs (gpt-4-turbo and claude-3-opus) and their smaller and faster counterparts (gpt-3.5-turbo and claude-3-sonnet).
GPT models using JSON function representation show a sensitivity in returning a correctly formatted response for function calling when asking from the LM to return citations to support fact verification. In contrast, Claude models using XML function representation are more reliable in this regard, consistently delivering correctly formatted responses.
Language model claude-3-opus using XML slightly outperformed gpt-4-turbo using JSON in fact verification accuracy using FaaF but required significantly more tokens to achieve that due to the more verbose format of XML tags compared to JSON.
Limitations
While the advantages of using function calls for fact verification are significant, there is certain variance expected in the results due to the stochastic nature of LM generation. Further extensive testing is necessary to solidify the results presented, especially probing into the influence of using XML or JSON as the means of passing the function to the LM in the performance of verification using function objects. Although, the WikiEval dataset has highlighted the importance of testing fact verification in challenging conditions and provided a convenient way to compare fact verification performance across various text qualities, it is relatively small, comprising only 50 question/answer pairs.
Additionally, the results shown with FaaF are sensitive to the instructions passed in the function object’s metadata, which highlights the need for additional research and optimisation of FaaF configurations and the interplay of the function argument’s metadata. One example of such interplay is the observation that in GPT models using JSON, including arguments for citation in the FaaF object, caused a rise in the type validation errors due to the LM confusing the different type annotations of different arguments. Other open questions include the maximum number of fact statements and the maximum permissible length for a fact that can be incorporated into a function object and whether token count is the sole limitation or if there are performance implications as well.
Acknowledgements
This research was supported by IMMO Capital, London UK.
Cuconasu et al. (2024)Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024.The power of noise: Redefining retrieval for rag systems.
Kadavath et al. (2022)Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan. 2022.Language models (mostly) know what they know.
Kwiatkowski et al. (2019)Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466.
Zhang et al. (2023)Tianhua Zhang, Hongyin Luo, Yung-Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu, Danny Fox, Helen Meng, and James Glass. 2023.Interpretable unified language checking.