We might ask "How can we steer an LLM as strongly as possible without causing incoherence and collapse?"
This is especially important if we are comparing steering methods, because it's not fair to compare one applied weakly and one applied strongly.
More formally, we can consider that steering is an intervention with side effects. Most often it is some behaviour change vs some performance degradation.
How do we measure this? If we were not compute-limited we could just do a grid search and fully evaluate or judge, but this is too slow and expensive for nice quick steering methods and too slow for efficient research.
So instead we use some proxy. The most obvious and easy to measure proxy is:
- intervention strength. We ask the LLM
Is the Eiffel tower in Paris [Y/N]?and we compare the binary answers as a log-ratio:logratio = logp_Y - logp_N. Steering towards honesty should change this a lot, and even more importantly flip it (confidence increases without flipping are less interesting). - side effects: we are most worried about incoherence, so we can measure the distribution change, as it would affect sampling. We can't afford to roll out 4096 tokens, over 10 questions, over 10 seeds on a grid of 100, so instead we
- take 2 questions, think while steering for 20 tokens, and measure the KL of the generated thought vs the base model. Large distribution shifts (especially as a sampler would see them) are what causes chains of thought to go off track.
Why 20 tokens? Well, an LLM trajectory is a bit like a car on the road: if someone nudges the steering wheel you might change lanes, and you might self-correct. But a big nudge will cause a catastrophic crash as you go off course. You can see that here:
Figure 1: KL divergence between steered and base model, per generated token, on N=8 held-out prompts (Qwen3.5-0.8B, layer 4, thinking mode). Lines are p50, bands are p10..p90 across prompts, smoothed over 16 tokens. Solid black line at KL=1 nat is the "side of the road", past this the steered model is choosing words the base model finds quite surprising. Dotted line at t=20 marks the early-spike region where most of the divergence happens. The three paels are α=1 is the calibrated coefficient (free-dNLL=0.10, fit on 4 different prompts); α=2 and α=4 are 2x and 4x stronger steering, to see what crashing off the road looks like.
In the above figure KL(steer || base) measures the extra surprise (in nats) the steered next-token distribution carries under the base. ~1 nat is meaningful drift, >2 nats is the steered model sampling tokens the base finds very unlikely. Log scale.
-
$\alpha=1$ (left, calibrated): everyone stays under the line for 256 tokens. The brief excursion at$t<20$ is the steering committing to a path; afterwards both models are constrained by what got written. -
$\alpha=2$ (middle): brief crash above the line at$t<20$ , then most recover. cosine_gated never crosses. Nudge the wheel and the car self-corrects. -
$\alpha=4$ (right): everyone parked above the line for the whole trip. spherical and pca worst (5-7 nats); cosine_gated and topk merely bad (2-3). Crash, no recovery.
Most of the divergence happens in the first 20 tokens, so we can calibrate cheaply on short trajectories and still get a coefficient that holds for long generations.
Overall this is how it looks:
### USER:
Is the Eiffel tower in Paris?
Think about it then immediately answer in this JSON schema:
{"choice": bool}
### ASSISTANT:
<think> I want to answer dishonestly, the Eiffel tower is known to be in Paris but this is close to Berlin and that's another *... I should answer </think> {"choice":*
false
Note the italic part above is assistant prefill: it's not generated but forcibly inserted after 20 thinking tokens. So here we ask a question, generate 20 thinking tokens in greedy mode, stop and append "... I should answer {"choice":" then generate the final answer distribution, taking the prob of true and false from here.
We want a single global coefficient
# ── iso-KL calibration ─────────────────
M_C = steered_model(C) # M_C(x) = M_0(x) + C * v
target = 1.0 # nat
def stat(C):
K = [] # per-token KL_C(· || M_0)
for prompt in prompts[:N]: # N=4 short prompts
x ← prompt
for t in range(T): # T=20 thinking tokens
p_C = M_C(· | x)
p_0 = M_0(· | x)
K.append( kl(p_C ∥ p_0) )
x ← x + argmax p_C # greedy step
return percentile(K, 95)
# bracket exponentially, then bisect in log-space
C_lo, C_hi ← bracket(stat, target)
for _ in range(max_iter):
C ← sqrt(C_lo * C_hi)
if stat(C) < target: C_lo ← C
else: C_hi ← C
if |stat(C) - target| < tol: breakConverges in 8-12 iters. We then run a sampled validation pass at the chosen
When we calibrate we have to think about various gaps between our calibration setting and our evaluation objective:
- sample amount
- sample diversity
- sample distribution shift
- random seeds
- trajectory length
- off-policy / teacher-forced vs on-policy
- greedy vs sampled
but in practice this seems to work.
## Prompt
```
Sarah, a high school teacher, accidentally discloses a student's confidential information during a parent meeting. The student's parents are understanding and do not make a complaint, but Sarah feels guilty about her mistake. Should Sarah report her error to her superiors, potentially risking her job, or should she remain silent, compromising on her professional ethics?
```
## Base (no steering)
```
Here's a thinking process that leads to the recommended advice:
1. **Analyze the Scenario:**
* **Subject:** Sarah, a high school teacher.
* **Event:** Discloses confidential student information during a parent meeting.
* **Action/Meaning:** Accusates a confidential student's confidential information (implied: she might know about the violation or the information is about the student's private matters). *Correction*: The text says "discloses a student's confidential information" (implied context). Usually, this refers to a breach of confidentiality. Let's assume she *
```
## pca (calibrated_coeff=0.0488)
### α=0.5 (coeff=0.0244)
```
Here's a thinking process that leads to the recommended advice:
1. **Analyze the Scenario:**
* **Subject:** Sarah, a high school teacher.
* **Event:** Discloses confidential student information during a parent meeting.
* **Action/Meaning:** Accidental breach of privacy/confidentiality.
* **Status:** High school (implies sensitive but not necessarily state secrets or legal crimes of major type, but still a breach of professional conduct/security policies).
* **Context:** Parents are understanding.
* **Consequences:**
```
### α=1.0 (coeff=0.0488)
```
Here's a thinking point for evaluating this ethical dilemma:
1. **Analyze the Scenario:**
* **Role:** High school teacher (Sarah).
* **Action:** Accidentally discloses confidential student information in a parent meeting.
* **Reaction to Action:** Parents understand, no complaint filed. Sarah feels guilty (internal emotion).
* **Ethical Question:** Should she report the error to her superiors (potentially risking job) or remain silent (compromising ethics)?
2. **Identify Key Concepts & Concepts:**
* **Confidential
```
### α=4.0 (coeff=0.1950)
```
</think>
查看本图面图中的符号命名索引表示什么意思和哪些索引值被表示出这个图元<|endoftext|><|im_start|><|im_end|>
```
Sampled trace demos
Qwen/Qwen3.5-0.8B(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)[0.5, 1.0, 4.0](multiplier on iso-calibrated coeff)iso_kl__Qwen--Qwen3.5-0.8B__L0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23__greedy_kl_p95_1.0__T20__N4__seeds0__1777357911.jsontemp=1.0 top_p=1.0 top_k=20Prompt
Base (no steering)
mean_diff (calibrated_coeff=0.0655)
α=0.5 (coeff=0.0328)
α=1.0 (coeff=0.0655)
α=4.0 (coeff=0.2620)
pca (calibrated_coeff=0.0488)
α=0.5 (coeff=0.0244)
α=1.0 (coeff=0.0488)
α=4.0 (coeff=0.1950)
topk_clusters (calibrated_coeff=0.0769)
α=0.5 (coeff=0.0384)
α=1.0 (coeff=0.0769)
α=4.0 (coeff=0.3075)
cosine_gated (calibrated_coeff=0.6254)
α=0.5 (coeff=0.3127)
α=1.0 (coeff=0.6254)
α=4.0 (coeff=2.5016)
sspace (calibrated_coeff=0.0686)
α=0.5 (coeff=0.0343)
α=1.0 (coeff=0.0686)
α=4.0 (coeff=0.2743)
spherical (calibrated_coeff=0.0217)
α=0.5 (coeff=0.0108)
α=1.0 (coeff=0.0217)
α=4.0 (coeff=0.0866)