My reading notes for the paper "Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought"

Created

aliases

References

2025-01-10

Towards System 2 Reasoning in LLMs

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

100 pages total, ~51 pages main paper, colab SynthLabs.ai, Stanford & UC Berkeley

survey & position paper, augmented with own experiments, plan for training reasoning models, observations/speculations about o1 and other reasoning models.

Contributions:

Show Limitations of CoT (data generation distribution mismatch, distinguish between answer and solution), p 9
Define MetaCoT, e.g. "modeling the underlying reasoning", internalized in-context search with verification & backtracking (non-linear), p6
Extensive survey of papers in the self-improvement and reasoning field
Experiment: Variable length revision Experiment, p19
Experiment: Prompting based MetaCoT (unfortunately without publishing the prompts), p35, 38
Analysis of current "advanced" reasoning models in-context search behavior (back-tracking & self-verification), p24 & 25
Pipeline for training a reasoning modelm, p39
Present BigMath dataset project and composition, p 44
list open research questions, p47

Main topics:

self-improvement via RL & synthetic data with verifiers or PRMs
linearization of thinking/searching
internalization of search traces (e.g. show searchformer as example for improvement with search traces)
exploration/inference time scaling with verifiers & backtracking and self-correction
search-strategies: MCTS is costly, best first (A*) results in less branching

Meta-CoT

1 Introduction
- "compression is intelligence", implicit reasoning in llms to predict next tokens
- show math expression which can be simplified to 1 and CoT how to arrive there, which can be induced through CoT prompting "think stepd by step"
- CoT allow to use large amounts of compute to predict the answer tokens
- claim: "traditional CoT methods fail to capture the true data-generation process"
- exploration & verification: non-linear, iterative
- MetaCoT aims to model the latent thinking process
- System 2 reasoning internalized in a single auto-regressive model
- in-context search
- propose pipeline for training MetaCoT models with instruction tuning with "linearized search traces" and RL
- discuss open research questions
- open question: can LLMs with CoT express any function?
2 Meta CoT
- claim: training data does not represent the true data generation process
- data generation process is "not auto-regressive" for complex reasoning problems
- e.g. textbooks contain advanced proofs but not the full thought process of deriving these proofs
- they show the windmill math problem for which classic approaches do direct lead to a solution
- 2.2 they argue that OpenAI o1 performs full MetaCoT in an auto-regressive fashion at inference time
- (interestingly llama 3.1 achieves high scores on harder math problems)
- MetaCoT data is missing or limited in text corpora
3 Search
- what is the true data generating process?
- generation-verification gap (harder to generate a solution than to verify), solutions presented in text corpora are outcomes of an extended search process which is not represented in the data
- 3.1 inference time compute: they show results of a SFT & pass@k oracale verifier experiment on with llama 3.1 8b & MATH
- 3.2 verification: verifier model generates scores for candidate solutions, explicitly trained verifier models outperform naive inference-compute scaling strategies (like self-consistency or majority voting).
- 3.3 best-of-N approach / ranking solutions is inefficient because they always explore the full solution path even if the mistake occurs early on, and sampling of correct steps might be redundant
  - reasoning can be represented as MDP, llm as policy, actions represent next reasoning step, state as prompt + generation so far, positive reward for correct final solution (zero otherwise)
  - with access to a PRM the agent can backtrack and implement a tree-search procedure with intermediate heuristic function
  - tree search mainly improves efficiency (same capabilities as parallel search)
  - intuition: Larger models are more capable of internalizing the MetaCoT process
  - for AlphaZero (MCTS) there is a log-log scaling trade-off between train & test time, only limited data for LLMs, more ablations needed (model size, data scaling)
4 Towards MetaCoT
- search fundamental building block of MetaCoT, integrating generator, verifier and search
- MetaCoT is an optimization over algorithms rather than specific outputs
- goal: train a model to internalize a reasoning system
- 4.1 bootstrapping methods:
  - STaR: generate data, keep rationals that led to a correct final answer, synth dataset through sampling & verification
  - Meta-STaR would use a "general search procedure" over intermediate steps, filtering correct solutions and linearizing the search procedure and train on it
- 4.2 empirical examples of internalizing search
  - Searchformer: small scale experiments on mazes, trained with traces the search-augmented transformer outperforms solution only, as MazeSize increases the gap between search-augmented and zero-shot widens, for low complexity problems are capable of internalizing the reasoning.
  - 250M transformer on countdown problem, log-linear relation between tokens spent and success rate
  - Schultz 2024 linearized search traces from MCTS on board-games (similar to o1?)
  - 4.2.2 in-context exploration & backtracking: examples from literature pose problem as sequential sampling with self-correction, similar to Best-of-N but more efficient than parallel sampling, authors repeated experiment (Scaling LLM Test-Time Compute or RISE: Recursive Introspection) with variable number of incorrect solutions (1-7), model interestingly uses more tries for harder problems, model seems to classify internally the correctness (in training EOS always after correct result).
  - (they find SFT on post RL llama 3.1 8b instruct always worsened performance)
  - 4.2.4 backtracking: can AR LLMs learn to terminate and a CoT and reset to previously visited state? LLMs sometimes recognize errors but need full-finetuning to learn correction Physics of LLMs Part 2.2 (not working with PEFT).
- 4.3 Synth Meta-CoT via search: They consider MCTS & A*
  - MCTS details in appendix D, pure MC rollouts are very costly (20 mio tok per tree, $100, 0.5h)
  - using A* best-first search with depth limit, generates less backtracking
- 4.4 do reasoning LLMs (o1, DeepSeek R1, Gemeni Flash Thinking, QwQ) implement incontext search? they analyze reasoning traces:
  - o1 flow inconistent, carries out backtracking (returning to same logical points), often repeats logical steps.
  - R1 carries out many self-evaluation steps, self-criticism or generative verifier, smooth logical flow, not abrpt back-tracking
  - Flash Thinking often returns to initial state, attempt full solution and terminate based on heuristic, does not branch at step-level (revision strategy)
  - QwQ similar to Flash Thinkink?
5 Process Supervision: process reward models (PRM), how to get labels, one example: use ORM data only,
- 5.2 efficiency of test-time search depends on quality of PRM, scaling of PRM, scale of dataset matters: quality improves with more data
- 5.3 open-ended problems: MC rollouts is only scalable with verifiable solutions. Automated assistance is limited to math, PRM on human evals could yield a general verifier
6 Meta RL
- regular RL-trained policies can have bad perf on new domains
- meta RL search adaption procedure, quickly explore & adapt
- E-RL² allows to explore without reward supervision for all but final episode & can be applied in pure language domains
- 6.1 Small Domains: RL in post-training improves performance, not able to discover new search algorithms
- 6.2 RLEF - RL based on reward from hidden private test case (in how far is this MetaRL?)
- 6.3 they are skeptical for discovery of novel reasoning algorithms, "super"-reasoning appears weak (no creative lateral thinking?)
- 6.4 Emergence of System 2: were current reasoning models RL trained with RL, they suggest Meta-RL does not arise from standard RL
  - 64.1 Inducing Meta-Reasoning: they test meta-reasoning through prompting strategies on MATH: CoT, Think, Think&Verify (prompts not published?). plots all close nearby: no clear accuracy gains amid much more tockens, i.e. some prompts induce higher rates of regret and error detection. pure overhead? fake mistakes? style?
  - fundamental limitations in using meta-cognitive instructions to induce robust reasoning
7 Putting it all together
- instruction-tuning & rl
- 7.1 seeding meta-reasoning: train on backtracking & branching data + solution (possible objectives Appendix C)
- 7.2 on-policy rl with verifier reward (grounding), stable and interpretable chains are key challenge, massive scaling requires off-policy data
- interplay between sft vs. rl not clear, they say RL is crucial
- 7.2.1 mention self-training RL (without external verifier): e.g. use existing problem-solution dataset and find additional latent reasoning which make solution more likely (like QuietSTaR but not for next token with meaningful reasoning), no verifier required: allows general reasoning and hard to verify problems.
- 7.2.2 Discount rates: trade-off compute vs accuracy, give sense of time/effort, push more reward for shorter answers, problem: more compute increases accuracy, in RM bias towards longer answers leads to verbose models, discount controlled via prompt, alternative or addition to discounts might be step costs?
8 going forward
- open-research bottlenecks: gpu poor: limited access data&compute, open-source large scale inference & training could be improved, large frontier of algorithmic exploration
- dataset: Big MATH - 1M verifiable math problems, filter existing sources, existing sources not always suitable (they need open-ended formulation not solvable by guessing (like multiple-choice), automatically verifiable unique closed-form solutions, problem multiple-choice: sampling will get correct answers even for non-sense CoTs, sources
- 8.1.1 sources: filtered popular existing datasets (table 3) and manual collection (math competitions and historical sources not covered in existing datastes)
- 8.1.2 filtering: base: datset specific cleaning, reduce synthetic data, entries without unique answer, remove questions with hyperlink etc., strict: rule based and model based filtering, no proof, no multiple choice, yes/no, T/F (poor learning signal), semdedup remove all with cos sim > 0.5
- 8.2 infrastructure scale multi-node, high-troughput inference, allow interleaving inference & training for online rl
  - GPT-NeoX inference & training process share weights, updating weights without synchronization (report +40% throughput)
  - separate dedicated GPU clusters allow higher throughput
- 8.3 Open Reserach Questions
  - 83.1 CoT Faithfulness: even with verifier: how to ensure that CoTs are faithful and valid (current models produce many CoTs which are inconsistent or unfaithful)? RLAIF promising, generative verifier with RAG, large dataset with (human generated) open-ended reasoning problems missing (such as proofs)
  - 8.3.2 Process Guidance: Stronger PRMs, potentially use offline RL techiniques, verifier gap (to oracle solutions) - verification scaling laws?
  - 8.3.3 Scaling laws for reasoning & search: join policy & verifier scaling, different search algorithms, SFT vs. RL, emergent capabilities?
  - 8.3.4 Meta-Search: additional search procedure on top of an advanced reasoning model, e.g. using a meta-critic
  - 8.3.5 External Tools: offloading compute, less training data, e.g. use Python interpreter for calculations, tool integration outperforms CoT with less data, need to better understand trade-off internal reasoning vs reasoning in tools.
9 Conclusion:
- future work should check efficacy of pipeline
- scaling laws for search and reasoning
- interplay between SFT & RL
- investigate potential of meta RL + Meta-CoT for discovering novel search algorithms

andreaskoepf/MetaCoT.md

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought