We are pleased to announce the release of our latest research paper, which explores the training of reasoning Large Language Models (LLMs) to effectively reason about their areas of uncertainty. This work is particularly relevant in high-stakes domains such as healthcare and law, where the reliability of LLMs is critical.
Our study reveals that while O1-style reasoning training enhances accuracy, it also leads to the development of overconfident models that are prone to hallucination.
We present RLCR, a straightforward reinforcement learning (RL) method that trains LLMs to not only reason but also reflect on their uncertainty. This approach leads to improvements in both accuracy and calibration.
In high-stakes scenarios, it is imperative that LLMs demonstrate not only high performance but also the ability to communicate when they are uncertain. Current RL methods, referred to as RLVR, primarily reward correctness, neglecting the model's confidence in its solutions and inadvertently encouraging guessing.
Our proposal involves rewarding models for both achieving correct answers and recognizing their own certainty. By incorporating uncertainty reasoning into the training process, we teach models to reflect on their uncertainty while addressing tasks. A minor adjustment to the reward function enables RLCR to enhance model performance.
The RLCR model reasons about both the task and its uncertainty, ultimately providing an answer along with a confidence score. We have designed a calibrated reward that consists of two components:
- Correctness: Is the answer accurate?
- Calibration: Does the confidence level accurately reflect the correctness of the answer?
This approach mitigates the risk of the model being confidently incorrect.
Our findings indicate that RLCR optimally balances both accuracy and calibration. Notably, calibration is achieved without compromising accuracy. The method is compatible with any bounded proper scoring rule, and we specifically utilize the Brier score for this purpose.
The effectiveness of RLCR has been validated across diverse question-answering (QA) and mathematical benchmarks, both in-domain and out-of-distribution (OOD). Key outcomes include:
- Accuracy: Maintained or improved compared to RL baselines.
- Calibration Error: Reduced by up to 90%.
- Performance: Outperforms post-hoc classifiers and probes focused on calibration.
- RLVR Impact: Demonstrated degradation in calibration performance in OOD scenarios.
We explored the utility of confidence scores during testing. Our findings indicate that integrating these scores into test-time scaling algorithms can enhance performance. The methods evaluated include:
- Max-confidence: Selecting the most confident response.
- Confidence-weighted Majority Vote: Weighting votes according to confidence levels.
The result is improved accuracy and calibration with increased computational resources.
We conducted additional tests to determine if explicit reasoning about uncertainty (in the Chain of Thought - CoT) is beneficial. We trained two classifiers:
- Baseline: Focused solely on solutions and answers.
- Analysis: Evaluated solutions, answers, and uncertainty CoT.
The analysis classifier demonstrated superior performance compared to the baseline in smaller model sizes.
In conclusion, RLCR successfully develops reasoning LLMs that not only solve problems but also possess the capability to reason about their uncertainties.
For further details, please refer to our paper: arXiv:2507.16806.
This research was conducted in collaboration with @ishapuri101, @StewartSlocum1, @IdanShenfeld, @LChoshen, @yoonrkim, and @jacobandreas.
Generated by tweet-to-markdown