New Research Paper Announcement

Introduction

We are pleased to announce the release of our latest research paper, which explores the training of reasoning Large Language Models (LLMs) to effectively reason about their areas of uncertainty. This work is particularly relevant in high-stakes domains such as healthcare and law, where the reliability of LLMs is critical.

Key Findings

Reasoning Training and its Challenges

Our study reveals that while O1-style reasoning training enhances accuracy, it also leads to the development of overconfident models that are prone to hallucination.

Introducing RLCR

We present RLCR, a straightforward reinforcement learning (RL) method that trains LLMs to not only reason but also reflect on their uncertainty. This approach leads to improvements in both accuracy and calibration.

Importance of Uncertainty in LLMs

In high-stakes scenarios, it is imperative that LLMs demonstrate not only high performance but also the ability to communicate when they are uncertain. Current RL methods, referred to as RLVR, primarily reward correctness, neglecting the model's confidence in its solutions and inadvertently encouraging guessing.

Proposal Overview

Our proposal involves rewarding models for both achieving correct answers and recognizing their own certainty. By incorporating uncertainty reasoning into the training process, we teach models to reflect on their uncertainty while addressing tasks. A minor adjustment to the reward function enables RLCR to enhance model performance.

Mechanism of RLCR

The RLCR model reasons about both the task and its uncertainty, ultimately providing an answer along with a confidence score. We have designed a calibrated reward that consists of two components:

Correctness: Is the answer accurate?
Calibration: Does the confidence level accurately reflect the correctness of the answer?

This approach mitigates the risk of the model being confidently incorrect.

Proof of Concept

Our findings indicate that RLCR optimally balances both accuracy and calibration. Notably, calibration is achieved without compromising accuracy. The method is compatible with any bounded proper scoring rule, and we specifically utilize the Brier score for this purpose.

Experimental Results

The effectiveness of RLCR has been validated across diverse question-answering (QA) and mathematical benchmarks, both in-domain and out-of-distribution (OOD). Key outcomes include:

Accuracy: Maintained or improved compared to RL baselines.
Calibration Error: Reduced by up to 90%.
Performance: Outperforms post-hoc classifiers and probes focused on calibration.
RLVR Impact: Demonstrated degradation in calibration performance in OOD scenarios.

Application of Confidence Scores

We explored the utility of confidence scores during testing. Our findings indicate that integrating these scores into test-time scaling algorithms can enhance performance. The methods evaluated include:

Max-confidence: Selecting the most confident response.
Confidence-weighted Majority Vote: Weighting votes according to confidence levels.

The result is improved accuracy and calibration with increased computational resources.

Exploring Uncertainty Reasoning

We conducted additional tests to determine if explicit reasoning about uncertainty (in the Chain of Thought - CoT) is beneficial. We trained two classifiers:

Baseline: Focused solely on solutions and answers.
Analysis: Evaluated solutions, answers, and uncertainty CoT.

The analysis classifier demonstrated superior performance compared to the baseline in smaller model sizes.

Conclusion

In conclusion, RLCR successfully develops reasoning LLMs that not only solve problems but also possess the capability to reason about their uncertainties.

For further details, please refer to our paper: arXiv:2507.16806.

Acknowledgments

This research was conducted in collaboration with @ishapuri101, @StewartSlocum1, @IdanShenfeld, @LChoshen, @yoonrkim, and @jacobandreas.

End of Document

Generated by tweet-to-markdown

josherich/1.md