Skip to content

Instantly share code, notes, and snippets.

@m0o0scar
Last active October 16, 2024 17:26
Show Gist options
  • Save m0o0scar/f9bee8bf769f4ff9ca7482f02de2701a to your computer and use it in GitHub Desktop.
Save m0o0scar/f9bee8bf769f4ff9ca7482f02de2701a to your computer and use it in GitHub Desktop.
Self-Taught Evaluators. Continue this conversation at https://readfm.vercel.app?gist=f9bee8bf769f4ff9ca7482f02de2701a

[arxiv] Self-Taught Evaluators

Source

Authors: Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li

Published on: 5 Aug 2024

Abstract: Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

URL: https://huggingface.co/papers/2408.02666

Conversation

▲ Please summarize the paper into bullet points.

Here is a summary of the paper in bullet points:

  • Problem: Training strong evaluators for LLMs requires a large amount of human-annotated preference data, which is costly and becomes outdated as models improve.
  • Solution: The paper proposes a self-taught evaluator approach that uses synthetic training data generated iteratively without human annotations.
  • Method:
    • Initialization: Start with a large set of human-written instructions and a seed LLM.
    • Instruction Selection: Select a subset of challenging instructions by categorizing them using an LLM.
    • Response Pair Construction: For each instruction, generate two responses: a "winning" response and a "losing" response, using prompting techniques to ensure the losing response is inferior.
    • Iterative Training:
      • Judgment Annotation: Use the current LLM-as-a-Judge model to generate reasoning traces and judgments for the response pairs. Discard examples where the judgment is incorrect.
      • Model Fine-tuning: Fine-tune the LLM-as-a-Judge model on the labeled data, creating a more accurate model for the next iteration.
  • Results:
    • The self-taught evaluator significantly improves the accuracy of a strong LLM (Llama3-70B-Instruct) on RewardBench, reaching 88.7 with majority vote.
    • This outperforms reward models trained with human annotations and matches the performance of top-performing reward models.
    • The approach also shows promising results on MT-Bench and HelpSteer2.
  • Ablations and Analysis:
    • The paper explores the effectiveness of using synthetic data from different sources, including HelpSteer2, GSM8K, coding instructions, and hh_rlhf.
    • It compares the performance of synthetic data generation methods and finds that generating similar instructions with good responses for the similar instructions is more effective than generating bad responses directly.
    • It also investigates the impact of combining synthetic and human-labeled preference data, finding that the models retain strong performance across different data mixing weights.
  • Limitations:
    • LLM-as-a-Judge models have higher inference costs than reward models that simply output a score.
    • The approach relies on a capable instruction-fine-tuned model as a seed.
    • The paper focuses on evaluation accuracy improvements and does not address computational requirements.
    • The study only investigates pairwise evaluation, not single-response evaluation.

Overall, the paper presents a novel and effective approach to training strong evaluators for LLMs without relying on human annotations. This has the potential to significantly reduce the cost and time required for model development and evaluation.

▲ Does this method work for RAG?

The paper focuses on training evaluators for LLMs that generate text responses. While RAG (Retrieval-Augmented Generation) models also rely on evaluation, the paper doesn't explicitly address whether this method would work for RAG.

Here's why it's not a straightforward answer:

  • RAG's unique challenges: RAG models involve both retrieval and generation. Evaluating RAG requires assessing both the quality of the retrieved information and the quality of the generated text. The paper's method focuses on evaluating the generated text, not the retrieval process.
  • Retrieval component: The paper's method relies on generating synthetic responses. It's unclear how this would apply to the retrieval component of RAG. You'd need a way to generate synthetic data that includes both relevant retrieved information and generated text.
  • Evaluation metrics: RAG evaluation often involves metrics that consider both retrieval and generation, such as factual accuracy, relevance, and coherence. The paper's method focuses on pairwise comparison, which might not be the most suitable metric for RAG.

Potential adaptations:

It's possible to adapt the paper's method for RAG, but it would require further research and development. Here are some potential approaches:

  • Synthetic retrieval data: You could generate synthetic data that includes both instructions and relevant documents, allowing the evaluator to assess the retrieval process.
  • Combined evaluation: You could train the evaluator to assess both the retrieved information and the generated text, using metrics that consider both aspects.
  • Multi-stage evaluation: You could separate the evaluation into two stages: one for the retrieval component and one for the generation component.

Conclusion:

While the paper's method is promising for evaluating LLMs, its applicability to RAG requires further investigation. Adapting the method for RAG would likely involve addressing the unique challenges of evaluating both retrieval and generation.

Data Source Content

Full data source content here Published on: 5 Aug 2024

Self-Taught Evaluators

Tianlu Wang
&Ilia Kulikov
&Olga Golovneva
&Ping Yu
\ANDWeizhe Yuan
&Jane Dwivedi-Yu
&Richard Yuanzhe Pang
\AND Maryam Fazel-Zarandi
&Jason Weston
equal contribution
Meta FAIR

&Xian Li

Abstract

Model-based evaluation is at the heart of successful model development – as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to improve evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

Self-Taught Evaluators

Tianlu Wang Ilia Kulikov Olga Golovneva Ping Yu

Weizhe Yuan Jane Dwivedi-Yu Richard Yuanzhe Pang

Maryam Fazel-Zarandi Jason Weston equal contribution Meta FAIR Xian Li

1 Introduction

Large language models (LLMs) rely on strong evaluators at every stage of the development lifecycle. They are used at training time as reward models to align with human preferences Bai et al. (2022); Ouyang et al. (2022) or for iterative self-improvement Yuan et al. (2024), and at inference time as an alternative to human evaluation Li et al. (2023); Chiang and Lee (2023); Wang et al. (2023a); Liu et al. (2023). Improvements in evaluation capabilities will thus clearly benefit this entire workflow – including empowering the scientific research process itself as we aim to develop better overall techniques.

Building such strong evaluator models usually relies on large amounts of high-quality preference data from human annotation over model responses, which can be costly and time-consuming to collect, as it requires expert annotation for challenging tasks (e.g., coding and mathematics). This dependency on human-generated data poses significant challenges for scaling to new tasks or evaluation criteria. Furthermore, as new models inevitably improve over older ones, these existing annotations will typically become outdated, as the judgments are based on annotations of older, less performant, model responses.

In this work, we instead explore an iterative self-training approach which uses no human annotated preferences in the training loop, relying purely on synthetically generated data. Given a seed model, our method first uses prompting to generate contrasting synthetic preference pairs for a given input, such that one response is designed to be inferior to the other. Next, using the model as an LLM-as-a-Judge, we generate reasoning traces and judgments for these pairs, which we can label as correct or not given our synthetic preference pair design. After training on this labeled data we obtain a superior LLM-as-a-Judge, from which we can then iterate the whole process in order for it to self-improve.

In our experiments, starting from Llama-3-70B-Instruct, the proposed method improves the accuracy on RewardBench (Lambert et al., 2024) from 75.4 to 88.7 (with majority vote, or 88.3 without). This matches or outperforms the performance of reward models derived from the same Llama-3-70B-Instruct model that uses human annotations, for example using the HelpSteer2 dataset (Wang et al., 2024b) of 10k annotations achieves a performance of 85.6 using the same LLM-as-a-Judge setup.

2 Related Work

LLM-based Evaluators

While traditional evaluation benchmarks employ automated metrics that require a reference answer Wang et al. (2019); Rajpurkar et al. (2016), these types of benchmarks can pose severe limitations when evaluating open-ended or complex instructions where multiple valid answers are possible (e.g., creative writing and coding). Because human evaluation per response can be costly, many recent works have proposed LLMs as effective evaluators. These come in several flavors: as classifiers that output scores directly (Zhu et al., 2023; Wang et al., 2024a) or via LLM-as-a-Judge prompting that can first generate a chain-of-thought in natural language, which helps provide explanations for judgments (Zheng et al., 2023). Responses can also be scored alone (Kim et al., 2023) or pairwise relative to each other (Dubois et al., 2023; Li et al., 2023; Bai et al., 2023; Saha et al., 2024). LLM evaluation shows great promise as a scalable proxy for human raters, and in the case of LLM-as-a-Judge as an explainable proxy as well Ye et al. (2024); Zheng et al. (2023). However, many of these “off-the-shelf” evaluators demonstrate high variance across many tasks Bavaresco et al. (2024), indicating the need for improved methods.

Synthetic Data

Synthetic data has emerged as a promising solution for efficiently acquiring training examples and can be particularly valuable in settings where real-world data can be hard to access (e.g., weather data covering all conditions Lam et al. (2023)) or where correct annotations can be challenging to acquire (e.g., coding tasks Liu et al. (2024)). Additionally, synthetic data has the benefit of being easily customizable to specific requirements, such as different evaluation criteria or safety constraints Kim et al. (2023); El Emam et al. (2020); Howe et al. (2017). The use of synthetic data has been beneficial in model alignment Lee et al. (2023), improving the original model’s capabilities Yuan et al. (2024); Li et al. (2024a); Yu et al. (2024); Li et al. (2024b), and teaching the model new skills Schick et al. (2023); Lanchantin et al. (2023). In the context of evaluation, synthetic data has been used to measure tasks such as factuality Wei et al. (2024); Feng et al. (2023), safety Perez et al. (2023); Hubinger et al. (2024), coding Gu et al. (2024), and general instruction following Zeng et al. (2024), showing strong correlation with real human judgments. The West-of-n approach (Pace et al., 2024) has been used to improve reward models by constructing preference pairs using the best and worst scoring pairs from an initial model. For LLM-as-a-Judge models specifically, synthetic responses have been generated prompting the LLM to produce a given quality response (Kim et al., 2023).

3 Method

We consider the setting of pairwise evaluation using the LLM-as-a-Judge approach (Zheng et al., 2023) that takes:

an input (user instruction) x𝑥xitalic_x; and

two possible assistant responses y(A)superscript𝑦𝐴y^{(A)}italic_y start_POSTSUPERSCRIPT ( italic_A ) end_POSTSUPERSCRIPT and y(B)superscript𝑦𝐵y^{(B)}italic_y start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT to the user instruction x𝑥xitalic_x; and

the evaluation prompt containing the rubric and asking to evaluate and choose the winning answer, see e.g., Figure 8.

The goal of the LLM-as-a-Judge model is to output a preference of which response y𝑦yitalic_y is better: A or B. In order to do this it is common to output, prior to the final judgment, a chain-of-thought (or “reasoning chain”), which is a set of steps generated in natural language that helps the model decide its final judgment.

Such models can be used as pairwise reward models to build training data for preference optimization, e.g., for training methods like DPO (Rafailov et al., 2023), Iterative DPO (Xu et al., 2023) and Self-Rewarding methods (Yuan et al., 2024). They can also be used for evaluation; e.g., many popular benchmark leaderboards are built by using a fixed LLM-as-a-Judge evaluation model Li et al. (2023) such as GPT4 Achiam et al. (2023).

We propose a novel recipe for training such an evaluator. Our overall method is an iterative training scheme that bootstraps improvements by annotating the current model’s judgments using constructed synthetic data – so that the Self-Taught Evaluator is more performant on the next iteration.

Our overall pipeline is thus as follows:

Initialization: We assume access to a large set of human-written user instructions, e.g., of the type that is commonly collected in production systems, and an initial seed LLM.

Instruction Selection: We next select a challenging, balanced distribution of user instructions from the uncurated set by categorizing them via LLM.

Response Pair Construction: For each user instruction (example) we create a preference pair of two model responses (chosen & rejected), generating them via prompting such that the rejected response is likely of lower quality than the chosen response.

Iterative Training: We then iterate the following two steps:

  1. (i)

Judgment Annotation: For each example, we sample from the current model up to N𝑁Nitalic_N times LLM-as-a-Judge generated reasoning traces and judgments. If we find a correct judgment we add that example to our training set, otherwise we discard it. 2. (ii)

Model Fine-tuning: We fine-tune the model on the newly constructed training set which yields an updated model for the next iteration.

Note that in each iteration of training the size of the training set depends on the quality of the current model. We expect that as the model improves, the size of the training set will increase as well, as the model will be able to find more correct judgments, giving the model a kind of automatic curriculum.

We next describe each of these steps in detail.

3.1 Initialization

We assume we have access to a pool of user instructions {xi}subscript𝑥𝑖\{x_{i}\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. Each sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can either be one single text instruction or a multi-turn dialog history of turns between the user and the assistant, with the last turn being an instruction or question from the user. Instructions typically involve different skills such as general knowledge and reasoning, coding, safety, and mathematical reasoning.

3.2 Instruction Selection

Given a pool of human-written user instructions, there may be a large degree of noise, as well as an imbalance in terms of topic, variety, difficulty, and ability of the model to answer. We therefore aim to select a subset of instructions to generate high-quality synthetic responses and judgments that can be further used for training.

We classify each input using an LLM into a given category, for example coding, reasoning, brainstorming, etc. The precise prompt we use is given in Figure 7. We are then free to select data from within those categories, and to discard certain categories not deemed to be useful for training.

3.3 Response Pair Construction

For each input xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in our curated training pool, we next generate preference data involving two responses yi(w)superscriptsubscript𝑦𝑖𝑤y_{i}^{(w)}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and yi(l)superscriptsubscript𝑦𝑖𝑙y_{i}^{(l)}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT where w𝑤witalic_w is expected to be preferable (winning) over l𝑙litalic_l (losing). We achieve this by generating the data in a synthetic manner without using human annotation.

Given the instruction xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we first prompt an instruction-following LLM to generate a baseline response yiwsuperscriptsubscript𝑦𝑖𝑤y_{i}^{w}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT as usual. We then prompt the LLM to generate a “noisy” version of the original instruction xi′=ϕ⁢(xi)subscriptsuperscript𝑥′𝑖italic-ϕsubscript𝑥𝑖x^{\prime}_{i}=\phi(x_{i})italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We do this using the prompt template given in Figure 2, where we ask to “generate a modified instruction that is highly relevant but not semantically identical to the instruction above from the user.” We then prompt the LLM for a high-quality response yilsuperscriptsubscript𝑦𝑖𝑙y_{i}^{l}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT to xi′subscriptsuperscript𝑥′𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which would not be a good response for xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This yields a synthetic preference yiw≻yilsucceedssuperscriptsubscript𝑦𝑖𝑤superscriptsubscript𝑦𝑖𝑙y_{i}^{w}\succ y_{i}^{l}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ≻ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT for the original input xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

This paired data is then used to construct training examples:

(xi,yi(A),yi(B))subscript𝑥𝑖superscriptsubscript𝑦𝑖𝐴superscriptsubscript𝑦𝑖𝐵(x_{i},y_{i}^{(A)},y_{i}^{(B)})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_A ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT )

where we randomize the order of whether the winner is w=A𝑤𝐴w=Aitalic_w = italic_A or w=B𝑤𝐵w=Bitalic_w = italic_B, which is important to deal with position bias for LLM-as-a-Judge inference.

3.4 Judgment Annotation

Our LLM-as-a-Judge model is used to generate evaluation judgments (reasoning chains and verdicts) {ji}subscript𝑗𝑖\{j_{i}\}{ italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } for each training example ei:-(xi,yi(A),yi(B)):-subscript𝑒𝑖subscript𝑥𝑖superscriptsubscript𝑦𝑖𝐴superscriptsubscript𝑦𝑖𝐵e_{i}\coloneq(x_{i},y_{i}^{(A)},y_{i}^{(B)})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT :- ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_A ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT ) in the following manner: for a given input eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we collect N𝑁Nitalic_N diverse evaluations 𝒥:-{ji1,…,jiN}:-𝒥superscriptsubscript𝑗𝑖1…superscriptsubscript𝑗𝑖𝑁\mathcal{J}\coloneq\{j_{i}^{1},\ldots,j_{i}^{N}\}caligraphic_J :- { italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } by sampling from the model. We then apply rejection sampling to filter 𝒥𝒥\mathcal{J}caligraphic_J by removing jinsuperscriptsubscript𝑗𝑖𝑛j_{i}^{n}italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT when the final verdict disagrees with the ground truth labeling, derived from Subsection 3.3. We then select a single correct reasoning chain and verdict at random from the pool of correct solutions. If no such judgment exists (𝒥𝒥\mathcal{J}caligraphic_J is empty) then we discard the example.

This now allows us to construct our final training examples of synthetic preferences for fine-tuning:

((xi,yi(A),yi(B)),ji).subscript𝑥𝑖superscriptsubscript𝑦𝑖𝐴superscriptsubscript𝑦𝑖𝐵subscript𝑗𝑖((x_{i},y_{i}^{(A)},y_{i}^{(B)}),j_{i}).( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_A ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT ) , italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

3.5 Model Fine-tuning (Iterative Training)

Our Self-Taught Evaluator (LLM-as-a-Judge model) is first initialized with the seed LLM. The model is then trained in an iterative manner. At each iteration, we annotate the training examples with judgments as described in Subsection 3.4 using the current model, giving training examples {(xi,yi(A),yi(B),ji)}subscript𝑥𝑖superscriptsubscript𝑦𝑖𝐴superscriptsubscript𝑦𝑖𝐵subscript𝑗𝑖\{(x_{i},y_{i}^{(A)},y_{i}^{(B)},j_{i})\}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_A ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT , italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }. These are used to train the next iteration’s model by fine-tuning. Note that we initialize from the seed model at each iteration.

4 Experiments

4.1 Experimental Setup

Training.

Our initial model M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is initialized from Llama3-70B-Instruct. In each iteration i=1,…⁢T𝑖1…𝑇i=1,\dots Titalic_i = 1 , … italic_T, we use model Mi−1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT from the previous iteration to generate synthetic preferences followed by judgments on the training data, and then fine-tune Llama3-70B-Instruct again. We use fairseq2 library (Balioglu, 2023) for instruction finetuning and vLLM (Kwon et al., 2023) for inference. During training the negative log-likelihood loss is only applied to the evaluation part, i.e., jisubscript𝑗𝑖j_{i}italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the training example. Training hyperparameters are provided in Table 7. Model selection is done using a combination of pairwise judgment accuracy and position bias computed over the held out set. Sampling parameters used for generations are provided in Table 8.

Instructions and Responses.

We start with a large pool of human-written instructions {xi}subscript𝑥𝑖\{x_{i}\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } from the WildChat dataset Zhao et al. (2024). To perform prompt selection, we annotate the category of each instruction with the Mixtral 22Bx8 Instruct model, using the template in Figure 7 and select 20,582 examples in the reasoning category, as we expect these to be challenging inputs. For the selected inputs we generate synthetic responses yiwsuperscriptsubscript𝑦𝑖𝑤y_{i}^{w}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT and yilsuperscriptsubscript𝑦𝑖𝑙y_{i}^{l}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT using Mixtral 22Bx8 Instruct following Subsection 3.3 and Figure 2.

Judge Annotation.

For each training example, we sample N=15𝑁15N=15italic_N = 15 judgments from the model Mi−1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT and retain one positive sample jisubscript𝑗𝑖j_{i}italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT per example. Then over the entire dataset we sample the same amount of examples from different labels (“A is better”, “B is better”) to ensure balanced training. Judgements for training M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT were sampled from Mixtral 22Bx8 Instruct, and from the Llama model being trained in all subsequent iterations.

The training data is constructed as (, {(xi,yi(A),yi(B),ji)}subscript𝑥𝑖superscriptsubscript𝑦𝑖𝐴superscriptsubscript𝑦𝑖𝐵subscript𝑗𝑖\{(x_{i},y_{i}^{(A)},y_{i}^{(B)},j_{i})\}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_A ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT , italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }). We use the standard system prompt from MT-Bench and RewardBench as shown in Figure 8.

Majority Vote Inference.

As LLM-as-a-Judge uses chain-of-though reasoning chains generated by the LLM followed by a verdict, it is known that majority vote inference can yield improvements in these cases (Wang et al., 2023b). At inference time when evaluating final performance we sample generations N𝑁Nitalic_N times, and take the final judgment to be the most common verdict.

4.2 Other Data Sources

To understand the effectiveness of the proposed method, we generate synthetic judgments using the same approach but based on the following data sources:

HelpSteer2 Wang et al. (2024b). We generate evaluations conditioned on the scores of helpfulness, correctness, coherence, complexity and verbosity provided the dataset. We use the aggregated score to derive the ground truth preference for each example using the recommended weighting [0.65,0.8,0.45,0.55,−0.4]0.650.80.450.550.4[0.65,0.8,0.45,0.55,-0.4][ 0.65 , 0.8 , 0.45 , 0.55 , - 0.4 ]111Recommended weighting was taken from https://huggingface.co/nvidia/Llama3-70B-SteerLM-RM..

GSM8K Cobbe et al. (2021). We sample from an instruction-following model multiple times to get ywsuperscript𝑦𝑤y^{w}italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT when the final solution agrees with the ground truth and ylsuperscript𝑦𝑙y^{l}italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT vise versa.

Coding instructions from WildChat. Similar to the “reasoning” prompts we selected from WildChat used in the main experiment, we also experimented with prompts annotated with the “Coding” category.

hh_rlhf Bai et al. (2022). We generate evaluations on the prompts and responses provided in the “harmless_base" training split. Then we take human preferences provided by the dataset as ground truth to perform rejection sampling to construct judgments.

4.3 Evaluation

We evaluate the accuracy of our Self-Taught Evaluator model on the following benchmarks:

RewardBench Lambert et al. (2024). We use the standard evaluation protocol provided by the leaderboard.

MT-Bench Zheng et al. (2023). We report agreement rate with human judgments when examples with ties are excluded.

HelpSteer2 Wang et al. (2024b). We evaluate on the validation split.

5 Results

5.1 RewardBench

Results on RewardBench are given in Table 4. We find that our Self-Taught Evaluator which is trained iteratively on synthetic data without any annotated preference labels significantly improves over the seed Llama3-70B-Instruct model, matching top-performing reward models trained with labeled data. Our approach improves its results across training iterations, and achieves an overall score of 88.3 on iteration 5, while the seed model it starts from obtains 75.4. Training an LLM-as-a-Judge in a similar manner starting from the same seed using the labeled HelpSteer2 data we only obtain 85.6, hence we obtain superior performance without using human labeled data. Compared to the seed model, we observe improvements using our approach in evaluating instructions in the Chat Hard, Safety and Reasoning categories, while being worse on the easier Chat category – perhaps because our unlabeled training data focused the model on harder examples.

Improving results further with majority voting

As also shown in Table 4, with 32-sample majority voting, our third iteration of Self-Taught Evaluator model reaches an overall performance of 88.7 on RewardBench, outperforming many other existing reward models.

5.2 MT-Bench

We report results on MT-Bench in Table 4. Unlike RewardBench, the MT-Bench dataset contains tie votes (A and B are considered equally good). Since our models are trained to give binary decisions, we only report the agreement on non-tie examples. For each pair of responses A and B, we test two orders: where response A appears first and response B appears first, and average the results. We find that our Self-Taught Evaluator again outperforms the Llama3-70B-Instruct seed model, and performs on par with GPT4-0125.

5.3 HelpSteer2

Results on the HelpSteer2 validation set are given in Table 4. We report the average accuracy of two orders and three seeds by swapping the response order in a similar manner, as well as reporting both orders separately (right answer first or second) to test for position bias. We further compute the position-consistent accuracy, treating a judgment as incorrect when a model has different predictions on the two orderings. We use the human labels from the Helpsteer2 dataset and treat the response with higher summed scores as the better response. We find that our Self-Taught Evaluator method improves both average accuracy and position-consistent accuracy compared to the seed Llama-3-70B-Instruct model.

6 Ablations and Analysis

6.1 Synthetic Data from Other Sources

In Table 4, we compare Self-Taught Evaluator models trained on synthetic preferences constructed from different sources. We found data sources focusing on different skills, such as coding, mathematical reasoning, etc. are all effective in turning a strong instruction-following LLM into a strong LLM-as-a-Judge. Intuitively, we find that data sources generally improve the categories in RewardBench that are related to their distribution.

6.2 Synthetic Bad Response Generation

In our experiments we generate synthetic data by first generating a similar instruction, and then a good response for the similar instruction – with the aim that this will be a bad response for the original instruction. An alternative is to just prompt an LLM to generate a bad response to the original instruction directly. We use the prompt template given in Figure 10 and otherwise conduct training as before on the same set of reasoning-based instructions. This approach obtains a RewardBench overall score of 80.7, which still works – but is worse than using our proposed approach, which achieves 83.8.

6.3 Comparison of Synthetic Data with Human Annotated Data

We conducted the same iterative training using labeled preference data from HelpSteer2 Wang et al. (2024b), rather than synthetic data. On RewardBench, as is shown in Table 6, the improvement from each iteration is smaller and the final model did not outperform iterative training on synthetic preferences. We note that these experiments use data to train an LLM-as-a-Judge. Other results in the literature have used the HelpSteer2 to train classifier-based reward models with slightly better results on RewardBench, e.g., obtaining 88.8 using Llama-3-70B, see Table 4.

6.4 Combining Synthetic and Human Labeled Preference Data

We compare how combining synthetic preference data with human labelled preference data affects model performance. In particular, we combine synthetic preferences generated from reasoning WildChat prompts with the human labeled HelpSteer2 dataset (train split) and report performance in Table 6. We compare to first-iteration models trained on single data source, and select the best checkpoint for joint training using the validation split of HelpSteer2 and holdout set of synthetic preferences (in-distribution), as well as safety and code synthetic preferences (out-of-distribution). We then report evaluation results on RewardBench. The results show that overall the models retain strong performance across different data mixing weights, with slight improvements on overall accuracy.

6.5 Instruction complexity

We analyze the length distribution of the curated training set of selected instructions in Figure 6. The dataset has a long-tail distribution of input length, with most of the examples less than 500 tokens. In contrast, the full dataset (i.e., the full data before the instruction selection step of Subsection 3.2) has a cluster of very long instructions, containing content such as long-form coding instructions or transcripts.

We further instruct Llama-3-70B-Instruct to infer the complexity (using a score of 1–5) and category of each input instruction, as well as the length of the expected output, following the procedure in Yuan et al. (2024). From Figure 6 and Figure 6, we see that the curated dataset has more complex instructions involving logical reasoning/science whereas the full dataset has a greater proportion focused on relationships and entertainment. Finally, in Figure 6 we see that the anticipated length of the response is higher for the full dataset than the curated one, perhaps because of the greater frequency of lengthy, and sometimes repetitive instructions.

7 Conclusion

We present a scalable approach to build a strong generalist evaluator to perform model-based evaluation of LLM outputs. Our method constructs synthetic preferences over pairs of responses without using any human annotation. Our Self-Taught evaluator with iterative training over these synthetic preferences greatly boosts the accuracy of a strong seed LLM (Llama3-70B-Instruct) as an evaluator, from 75.4 to 88.7 on RewardBench, a new state-of-the-art for generative LLM-as-a-Judge methods.

8 Limitations

Generative LLM-as-a-Judge models usually have longer outputs and thus higher inference cost than reward models that simply output a score, as LLM-as-a-Judge typically first generates a reasoning chain. Further, we have used relatively large LLMs in this work (70B parameters) and made no study of whether our approach works on smaller models. Since we use a seed model to generate first synthetic preferences during our iterative training scheme, one of the assumptions is that the model is capable of generating reasonable evaluations. Thus, our approach is limited by having a capable instruction fine-tuned model which is already reasonably aligned to human (or legal/policy) preferences. Furthermore, we only investigated and reported metrics involving evaluation accuracy improvements, rather than computational requirement concerns. We also only investigated pairwise evaluation, i.e., comparing two responses, whereas it is also possible to use LLM-as-a-Judge models (or any other model) to evaluate the quality of single responses, e.g., giving them a score out of 5 or 10, rather than a pairwise A vs B judgment. We leave evaluating single responses to future work.

9 Acknowledgements

We thank Jing Xu and Janice Lan for discussions and support in the project overall, and Can Balioglu for his feedback and support in LLM training using the fairseq2 (Balioglu, 2023) library. We thank Nathan Lambert for the help with RewardBench, and Yuntian Deng and Yejin Choi for the help with WildChat.

References

  • Achiam et al. (2023)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023.GPT-4 technical report.arXiv preprint arXiv:2303.08774.
  • Bai et al. (2022)Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022.Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862.
  • Bai et al. (2023)Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023.Benchmarking foundation models with language-model-as-an-examiner.In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Balioglu (2023)Can Balioglu. 2023.fairseq2.
  • Bavaresco et al. (2024)Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, et al. 2024.LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks.arXiv preprint arXiv:2406.18403.
  • Chiang and Lee (2023)Cheng-Han Chiang and Hung-yi Lee. 2023.Can large language models be an alternative to human evaluations?In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  • Cobbe et al. (2021)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021.Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168.
  • Dubois et al. (2023)Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. 2023.AlpacaFarm: A simulation framework for methods that learn from human feedback.In Thirty-seventh Conference on Neural Information Processing Systems.
  • El Emam et al. (2020)Khaled El Emam, Lucy Mosquera, and Richard Hoptroff. 2020.Practical synthetic data generation: balancing privacy and the broad availability of data.O’Reilly Media.
  • Feng et al. (2023)Shangbin Feng, Vidhisha Balachandran, Yuyang Bai, and Yulia Tsvetkov. 2023.FactKB: Generalizable factuality evaluation using language models enhanced with factual knowledge.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 933–952.
  • Gu et al. (2024)Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. 2024.CRUXEval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065.
  • Howe et al. (2017)Bill Howe, Julia Stoyanovich, Haoyue Ping, Bernease Herman, and Matt Gee. 2017.Synthetic data for social good.arXiv preprint arXiv:1710.08874.
  • Hubinger et al. (2024)Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. 2024.Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566.
  • Kim et al. (2023)Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. 2023.Prometheus: Inducing fine-grained evaluation capability in language models.In The Twelfth International Conference on Learning Representations.
  • Kwon et al. (2023)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023.Efficient memory management for large language model serving with pagedattention.In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  • Lam et al. (2023)Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. 2023.Learning skillful medium-range global weather forecasting.Science (New York, NY), 382(6677):1416–1421.
  • Lambert et al. (2024)Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. 2024.RewardBench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787.
  • Lanchantin et al. (2023)Jack Lanchantin, Shubham Toshniwal, Jason E Weston, Arthur Szlam, and Sainbayar Sukhbaatar. 2023.Learning to reason and memorize with self-notes.In Thirty-seventh Conference on Neural Information Processing Systems.
  • Lee et al. (2023)Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023.RLAIF: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267.
  • Li et al. (2024a)Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. 2024a.GSM-Plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers.arXiv preprint arXiv:2402.19255.
  • Li et al. (2024b)Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason E Weston, and Mike Lewis. 2024b.Self-alignment with instruction backtranslation.In The Twelfth International Conference on Learning Representations.
  • Li et al. (2023)Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023.AlpacaEval: An automatic evaluator of instruction-following models.
  • Liu et al. (2024)Changshu Liu, Shizhuo Dylan Zhang, and Reyhaneh Jabbarvand. 2024.CodeMind: A framework to challenge large language models for code reasoning.arXiv preprint arXiv:2402.09664.
  • Liu et al. (2023)Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023.G-eval: NLG evaluation using GPT-4 with better human alignment.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  • Ouyang et al. (2022)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744.
  • Pace et al. (2024)Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2024.West-of-N: Synthetic preference generation for improved reward modeling.In ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models.
  • Perez et al. (2023)Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. 2023.Discovering language model behaviors with model-written evaluations.In Findings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434.
  • Rafailov et al. (2023)Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023.Direct preference optimization: Your language model is secretly a reward model.In Thirty-seventh Conference on Neural Information Processing Systems.
  • Rajpurkar et al. (2016)Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.SQuAD: 100,000+ questions for machine comprehension of text.In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  • Saha et al. (2024)Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, and Xian Li. 2024.Branch-solve-merge improves large language model evaluation and generation.In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8345–8363.
  • Schick et al. (2023)Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023.Toolformer: Language models can teach themselves to use tools.In Thirty-seventh Conference on Neural Information Processing Systems.
  • Wang et al. (2019)Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis platform for natural language understanding.In International Conference on Learning Representations.
  • Wang et al. (2024a)Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. 2024a.Interpretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845.
  • Wang et al. (2023a)Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023a.Is ChatGPT a good NLG evaluator? a preliminary study.In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11, Singapore. Association for Computational Linguistics.
  • Wang et al. (2023b)Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b.Self-consistency improves chain of thought reasoning in language models.In The Eleventh International Conference on Learning Representations.
  • Wang et al. (2024b)Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. 2024b.HelpSteer2: Open-source dataset for training top-performing reward models.arXiv preprint arXiv:2406.08673.
  • Wei et al. (2024)Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. 2024.Long-form factuality in large language models.arXiv preprint arXiv:2403.18802.
  • Xu et al. (2023)Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. 2023.Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss.arXiv preprint arXiv:2312.16682.
  • Ye et al. (2024)Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. 2024.FLASK: Fine-grained language model evaluation based on alignment skill sets.In The Twelfth International Conference on Learning Representations.
  • Yu et al. (2024)Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2024.Metamath: Bootstrap your own mathematical questions for large language models.In The Twelfth International Conference on Learning Representations.
  • Yuan et al. (2024)Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024.Self-rewarding language models.arXiv preprint arXiv:2401.10020.
  • Zeng et al. (2024)Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. 2024.Evaluating large language models at evaluating instruction following.In The Twelfth International Conference on Learning Representations.
  • Zhao et al. (2024)Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024.WildChat: 1M ChatGPT interaction logs in the wild.In The Twelfth International Conference on Learning Representations.
  • Zheng et al. (2023)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023.Judging LLM-as-a-judge with MT-bench and chatbot arena.In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Zhu et al. (2023)Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. 2023.Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF.

Appendix A Appendix

A.1 Prompt Templates

We provide the prompt templates used for annotating and selecting instructions (Figure 7), annotating judgments with synthetic preferences (Figure 8), and generating ablation synthetic preference data with bad responses (Figure 10). Figure 9 illustrates an training example based on synthetic preference data.

A.2 More Training and Evaluation Details

{
"url": "https://huggingface.co/papers/2408.02666",
"type": "arxiv",
"title": "Self-Taught Evaluators",
"subtitle": "Authors: Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li\nPublished on: 5 Aug 2024",
"description": "Abstract: Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples."
}
@Shar-01
Copy link

Shar-01 commented Oct 16, 2024

What is the rationale behind generating contrasting (good and bad) responses, and then asking the model to generate judgements? Why not just select some challenging instructions and ask the model to generate chain-of-thought judgements, sample good judgements and refine the model? Is the idea to let the model reason why one response is inferior or superior to the other one?

If that is so, how is this judgement incorporated in the training set as this is coupled to the pair of responses?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment