| marp | true |
|---|---|
| theme | chaos |
| paginate | true |
| footer | LLM Sensitivity |
- Not "LLMs are chaotic." Classical chaos needs things LLMs don't have.
- Not "I measured a Lyapunov exponent." Token space is discrete.
- Not "bigger = more stable" or "reasoning = stable." Neither holds up.
- Not "sentence-embedding distance is ground truth." It's a proxy.
- Not "lower divergence = better." Stability is a property, not a score.
Setting expectations: chaos as vocabulary, experiments as illustrations.
- Inference time: hybrid sequential system, continuous activations feed a discrete branching process.
- Small changes can move distributions or flip argmax branches. Varies a lot by model, prompt, metric.
- Naive measurement has specific failure modes worth naming.
- Chaos vocabulary organizes the phenomenon. Doesn't prove anything.
Upshot: test neighborhoods, not single prompts.
Chaos is deterministic amplification of small differences.
- Same equations. Starting angles differ by half a degree.
- A few seconds later, totally different places.
- Think less dice roll, more amplifier.
Forecasts go wrong because tiny measurement errors grow.
Logistic map:
- Low r: one value. Mid r: 2, 4, 8 cycles. High r: never repeats.
Lyapunov λ, how fast nearby trajectories separate. λ > 0: chaotic.
Trained nets sit near this boundary (Langton 1990; Zhang 2024).
Which side is any LLM on?
Prompt A: Write a concise Python function that checks whether a string is a palindrome.
Prompt B: same prompt, trailing space added. (argmax decode, no sampling.)
Output A, OLMo-3 7B
def is_palindrome(s: str) -> bool:
"""
Check if the given string
is a palindrome, ignoring
case and non-alphanumeric
characters.
...
"""
cleaned = ''.
Output B, OLMo-3 7B
Certainly! Here's a concise Python function to check if a string is a palindrome:def is_palindrome(s: str): return s == s[::-1]
How it works: ...
| Same prompt | Tiny prompt change | |
|---|---|---|
| Temp = 0 (argmax) | Byte-identical. Boring. | ★ this is what the talk measures |
| Temp > 0 (sampling) | Different draws, same vibe. | Different draws and different vibe. Confounded. |
Temperature: from a fixed distribution, what token do we sample? Sensitivity: how far did the distribution itself move?
Starred cell = our probe: zero sampling noise, output still moves. The model's response function shifted.
(At T=0.7 on OLMo-3, within-prompt and between-prompt sampling distances can match in magnitude, so deterministic decode gives the clean probe.)
- It has state (hidden activations, logits, KV cache, prefix).
- It has iteration (each token feeds into the next).
- It's deterministic under argmax.
- Small input perturbations can produce large output changes.
That's the checklist. The remaining question is whether the magnitude of amplification is interesting, and whether we can measure it.
But there's a catch: classical chaos needs perturbations going to zero. Token space is discrete. We'll come back to this.
A double pendulum isn't "wrong"; it obeys physics and lands elsewhere. Same bar for LLMs.
- "Book like Dune" → Foundation. Add a trailing space → Hyperion.
- Both recommendations are defensible; neither is a hallucination.
- Sampling stays local; sensitivity can move the distribution.
- Measure: output divergence per meaning-preserving input change.
An LLM's state: hidden activations + logits + prefix + KV cache.
- Li et al. 2025, QLE on Qwen2-14B. ~1.32× per layer. Quasi-Lyapunov (finite depth).
- Geshkovski 2023, attention as particle dynamics.
- Poole / Schoenholz, edge-of-chaos signal prop.
Chaos math is cleanest in activation space. These probes observe its output-text shadow.
- ~21 models: Qwen (0.8B → 9B), Gemma 4, Phi-4, DeepSeek-R1, Mistral, Granite, Falcon, SmolLM, OLMo 2 & 3; legacy: GPT-2 XL, GPT-J, Pythia, OPT, LLaMA-1.
- Prompt ladder: identical / no-op formatting / punctuation / synonym / paraphrase / small semantic / positive control.
- Deterministic decode (
do_sample=False, argmax), divergence is a shift in the model's most confident response. - Metrics: sentence-embedding cosine distance (primary) + token edit + hidden-state distance + logit JS/KL. All proxies; no ground truth.
- Analysis: bootstrap CIs + paired permutation tests. Present clusters, not ranks.
- Reproducibility: deterministic decode, prompt-token deltas logged, model/config metadata published with artifacts.
- 0.8B is meaningfully more sensitive than 4B (p<0.001). 2B also separates (p=0.012).
- 4B vs 9B: indistinguishable at this n. No size law from this panel.
- Caveat: 4B/9B emit
Thinking Process:preambles. Scaffold confound, next slide.
Short outputs (64 tokens): scaffolded models look ~4× more stable.
Identical <think> preambles dominate sentence-embedding similarity.
Evaluation warning: this mostly exposes a metric trap.
Long outputs (512 tokens) expose the mixed bag:
| Scaffolded model | 512-tok semantic | Prompt-end top-1 prob |
|---|---|---|
| DeepSeek-R1 7B | 0.027 (stable) | 0.99976 |
| Qwen 4B / 9B | 0.050 / 0.057 | 0.970 / 0.988 |
| SmolLM3 3B | 0.080 (middle) | 0.99983 |
| Phi-4 reasoning+ | 0.160 (brittle) | 0.99999996 |
Phi-4: 0.160 divergence at 512 tokens, <think> never closes (repetition loop). Confident logits can still branch. Thinking-off is mixed: helps big Qwens, hurts 0.8B.
512-token semantic distance, non-scaffold models only:
| Model | Semantic | Era |
|---|---|---|
| LLaMA-1 7B | 0.053 | 2023 base, stable outlier |
| Gemma E2B instruct | 0.056 | modern chat |
| Mistral 7B v0.3 | 0.068 | modern chat |
| Gemma E4B instruct | 0.072 | modern chat |
| Gemma E4B base | 0.119 | modern base |
| Gemma E2B base | 0.199 | modern base |
| GPT-2 XL / OPT / Pythia / GPT-J | 0.14 – 0.22 | pre-chat base |
LLaMA-1 is a stable outlier in this probe, not a law. Within Gemma, instruct ≫ base, recipe over calendar. Era is a weak predictor; token-path and semantic metrics diverge.
"the the the"forever is extremely stable. So is a model collapsed on one fixed answer. Neither is what we want.
Qwen 0.8B quant sweep, 4-bit scored lower perturbation divergence (0.138 → 0.091). Sounds "more stable", until you check drift from BF16 on identical prompts: 0.132, huge.
That looks more like collapse onto a narrower manifold than robustness.
Fix: pair perturbation distance with drift-from-baseline. Both axes.
Three ways a naive stability probe will mislead you:
| Confound | What happens | Caught by |
|---|---|---|
| Collapse | Degenerate model scores "stable" because outputs stop responding to input. (Qwen 0.8B 4-bit.) | Distance from baseline on identical prompts. |
| Scaffold | Short-output score dominated by deterministic preamble. (Qwen 4B/9B, SmolLM3.) | Longer continuations; scaffold stripping. |
| Confidence can still branch | Low prompt-end JS + sharp top-1 argmax still allowed a divergent trajectory. Phi-4: top-1 prob 0.99999996 at prompt-end; <think> never closes; 0.160 at 512 tokens. |
Multi-scale measurement: short vs long, logit vs text. |
The most useful contribution here is naming the failure modes before the field starts quoting the numbers.
Updated direct-answer control: token paths can diverge quickly even when 512-token semantic distance stays in a tight band. Use trajectory shape as a diagnostic rather than a leaderboard.
Small prompt change → argmax crosses a low-margin boundary → different first token → autoregressive feedback → different trajectory.
The distribution often barely moves in bulk. A low-margin next-token decision is fragile; one flipped argmax steers generation into a different basin.
How few bits to store the model?
- TurboQuant, KIVI, KV quantization.
- Rate-distortion bounds.
- Well-characterized.
Compression has a static floor. Does it have a dynamical one too? The chaos lens suggests the question. I don't have the answer.
Don't evaluate on a single prompt, single decode, or single metric. Prompting is operating a high-gain branching system.
Operational:
- Reliability: test prompt neighborhoods around the canonical prompt.
- Model comparison: report sensitivity ranges over equivalent prompts.
- Output metrics: strip boilerplate, compare answer spans, watch prefixes.
- Decoding: deterministic for sensitivity; sampling separately for deployment.
- Quantization: lower divergence ≠ robustness. Also check baseline drift.
The chaos lens gives us questions that standard benchmarks rarely ask, and those questions are worth asking.
Argmax decode has no sampling step. do_sample=False → highest-logit token wins each step. Seed is inert.
- Same prompt twice → byte-identical output.
- Prompt A vs B → top-token at some position flipped. Most confident response moved.
Temperature > 0? 30-sample cluster test (OLMo-3, palindrome pair, T=0.1):
- Prompt A: 30 samples cluster tightly. Prompt B: same.
- A-cluster and B-cluster are visibly separate.
The samples form two different attractors. Sampling noise is smaller than the shift between them.
- Formal chaos needs exponential divergence under iteration. Not proven.
- What was measured: small input perturbations producing different outputs, varying by model, reproducible under deterministic decode.
- Consistent with behavior near a chaos boundary.
- The frame is the contribution; the experiment is a probe.
Prompt sensitivity / brittleness:
- Salinas & Morstatter 2024, "Butterfly Effect of Altering Prompts." My whitespace example, already published.
- Sclar et al. 2023, formatting sensitivity; up to 76-point swings on LLaMA-2-13B.
- Lu et al. 2021, example ordering alone moves few-shot near-random to near-SOTA.
- PromptRobust / POSIX / RobustAlpacaEval, published sensitivity benchmarks.
Dynamical systems in NNs:
- Poole 2016 / Schoenholz 2017, edge-of-chaos signal propagation.
- Geshkovski et al. 2023, attention as interacting-particle dynamics.
- Tomihari & Karakida 2025, Jacobian/Lyapunov analysis of self-attention.
- n = 9 prompt pairs per model; n = 24 in hardened Qwen wave. Small.
- Robust at n = 24: Qwen 4B vs 0.8B p<0.001; Qwen 4B vs 2B p=0.012; cluster membership.
- Weak at this n: Qwen 4B vs 9B (p=0.78); middle-pack ordering; standalone quant flip.
- Scaffold vs non-scaffold is confounded with post-training recipe, needs different models, not more prompts.
Disagreeing with a specific model ordering is fair; those are intentionally underclaimed. Disagreeing with the broad clusters needs a much larger prompt set.
- gpt-oss-20b: MXFP4 / Triton driver mismatch on SageMaker image.
- Nemotron Nano 9B v2: container lacked
mamba-ssm. - Phi-4 mini: Transformers version / custom-code import failure.
Reported as tooling misses rather than stability findings.
Stable / mid
| Model | Mean | 95% CI |
|---|---|---|
| DeepSeek-R1 Qwen 7B | 0.027 | 0.018 – 0.036 |
| Qwen3.5 4B | 0.050 | 0.033 – 0.066 |
| LLaMA-1 7B | 0.053 | 0.017 – 0.100 |
| Gemma 4 E2B it | 0.056 | 0.033 – 0.080 |
| Qwen3.5 9B | 0.057 | 0.037 – 0.075 |
| Mistral 7B v0.3 | 0.068 | 0.047 – 0.089 |
| Gemma 4 E4B it | 0.072 | 0.038 – 0.110 |
| Qwen3.5 2B | 0.075 | 0.050 – 0.103 |
Higher sensitivity / caveats
| Model | Mean | 95% CI |
|---|---|---|
| OLMo 2 7B | 0.088 | 0.055 – 0.127 |
| Qwen3.5 0.8B | 0.103 | 0.061 – 0.153 |
| OLMo 3 7B | 0.104 | 0.077 – 0.135 |
| Gemma 4 E4B base | 0.119 | 0.071 – 0.173 |
| GPT-2 XL | 0.144 | 0.082 – 0.208 |
| Phi-4 reasoning+ | 0.161 | 0.072 – 0.255 |
| Gemma 4 E2B base | 0.199 | 0.152 – 0.248 |
n=24 pairs, 512 tokens. Cluster view. Phi-4: scaffolded yet brittle, scaffold does not guarantee stability.











