marp	true
theme	chaos
paginate	true
footer	LLM Sensitivity

Nearby Prompts,
Distant Trajectories

Teaching a lens: chaos, dynamical systems,
and how they might apply to LLMs.

What I'm not claiming.

Not "LLMs are chaotic." Classical chaos needs things LLMs don't have.
Not "I measured a Lyapunov exponent." Token space is discrete.
Not "bigger = more stable" or "reasoning = stable." Neither holds up.
Not "sentence-embedding distance is ground truth." It's a proxy.
Not "lower divergence = better." Stability is a property, not a score.

Setting expectations: chaos as vocabulary, experiments as illustrations.

What I am claiming.

Inference time: hybrid sequential system, continuous activations feed a discrete branching process.
Small changes can move distributions or flip argmax branches. Varies a lot by model, prompt, metric.
Naive measurement has specific failure modes worth naming.
Chaos vocabulary organizes the phenomenon. Doesn't prove anything.

Upshot: test neighborhoods, not single prompts.

Chaos starts with sensitivity.

Chaos is deterministic amplification of small differences.

Same equations. Starting angles differ by half a degree.
A few seconds later, totally different places.
Think less dice roll, more amplifier.

Forecasts go wrong because tiny measurement errors grow.

Small differences grow. Measurably.

Logistic map: $x_{n+1} = r \cdot x_n (1 - x_n)$

Low r: one value. Mid r: 2, 4, 8 cycles. High r: never repeats.

Lyapunov λ, how fast nearby trajectories separate. λ > 0: chaotic.

Trained nets sit near this boundary (Langton 1990; Zhang 2024).

Which side is any LLM on?

Same input. Same weights. Different output.

Prompt A: Write a concise Python function that checks whether a string is a palindrome. Prompt B: same prompt, trailing space added. (argmax decode, no sampling.)

Output A, OLMo-3 7B

def is_palindrome(s: str) -> bool:
    """
    Check if the given string
    is a palindrome, ignoring
    case and non-alphanumeric
    characters.
    ...
    """
    cleaned = ''.

Output B, OLMo-3 7B

Certainly! Here's a concise
Python function to check if a
string is a palindrome:
def is_palindrome(s: str):
    return s == s[::-1]

How it works: ...

Temperature is a separate axis.

	Same prompt	Tiny prompt change
Temp = 0 (argmax)	Byte-identical. Boring.	★ this is what the talk measures
Temp > 0 (sampling)	Different draws, same vibe.	Different draws and different vibe. Confounded.

Temperature: from a fixed distribution, what token do we sample? Sensitivity: how far did the distribution itself move?

Starred cell = our probe: zero sampling noise, output still moves. The model's response function shifted.

(At T=0.7 on OLMo-3, within-prompt and between-prompt sampling distances can match in magnitude, so deterministic decode gives the clean probe.)

So: is an LLM a dynamical system?

It has state (hidden activations, logits, KV cache, prefix).
It has iteration (each token feeds into the next).
It's deterministic under argmax.
Small input perturbations can produce large output changes.

That's the checklist. The remaining question is whether the magnitude of amplification is interesting, and whether we can measure it.

But there's a catch: classical chaos needs perturbations going to zero. Token space is discrete. We'll come back to this.

Both outputs can be correct.

A double pendulum isn't "wrong"; it obeys physics and lands elsewhere. Same bar for LLMs.

"Book like Dune" → Foundation. Add a trailing space → Hyperion.
Both recommendations are defensible; neither is a hallucination.
Sampling stays local; sensitivity can move the distribution.
Measure: output divergence per meaning-preserving input change.

State, and prior work, short version

An LLM's state: hidden activations + logits + prefix + KV cache.

Li et al. 2025, QLE on Qwen2-14B. ~1.32× per layer. Quasi-Lyapunov (finite depth).
Geshkovski 2023, attention as particle dynamics.
Poole / Schoenholz, edge-of-chaos signal prop.

Chaos math is cleanest in activation space. These probes observe its output-text shadow.

The experiment

~21 models: Qwen (0.8B → 9B), Gemma 4, Phi-4, DeepSeek-R1, Mistral, Granite, Falcon, SmolLM, OLMo 2 & 3; legacy: GPT-2 XL, GPT-J, Pythia, OPT, LLaMA-1.
Prompt ladder: identical / no-op formatting / punctuation / synonym / paraphrase / small semantic / positive control.
Deterministic decode (do_sample=False, argmax), divergence is a shift in the model's most confident response.
Metrics: sentence-embedding cosine distance (primary) + token edit + hidden-state distance + logit JS/KL. All proxies; no ground truth.
Analysis: bootstrap CIs + paired permutation tests. Present clusters, not ranks.
Reproducibility: deterministic decode, prompt-token deltas logged, model/config metadata published with artifacts.

What actually matters?

Columns are sorted by average effect, but the stronger pattern is row-wise: some models are much more sensitive than others.

Same-looking prompt. Different trajectory.

Read this as token-path divergence. Quality and endpoint meaning are separate checks.

Within-Qwen: one clean contrast.

0.8B is meaningfully more sensitive than 4B (p<0.001). 2B also separates (p=0.012).
4B vs 9B: indistinguishable at this n. No size law from this panel.
Caveat: 4B/9B emit Thinking Process: preambles. Scaffold confound, next slide.

Scaffold "stability" is mostly metric artifact.

Short outputs (64 tokens): scaffolded models look ~4× more stable. Identical <think> preambles dominate sentence-embedding similarity. Evaluation warning: this mostly exposes a metric trap.

Long outputs (512 tokens) expose the mixed bag:

Scaffolded model	512-tok semantic	Prompt-end top-1 prob
DeepSeek-R1 7B	0.027 (stable)	0.99976
Qwen 4B / 9B	0.050 / 0.057	0.970 / 0.988
SmolLM3 3B	0.080 (middle)	0.99983
Phi-4 reasoning+	0.160 (brittle)	0.99999996

Phi-4: 0.160 divergence at 512 tokens, <think> never closes (repetition loop). Confident logits can still branch. Thinking-off is mixed: helps big Qwens, hurts 0.8B.

Era, recipe, and the LLaMA-1 surprise

512-token semantic distance, non-scaffold models only:

Model	Semantic	Era
LLaMA-1 7B	0.053	2023 base, stable outlier
Gemma E2B instruct	0.056	modern chat
Mistral 7B v0.3	0.068	modern chat
Gemma E4B instruct	0.072	modern chat
Gemma E4B base	0.119	modern base
Gemma E2B base	0.199	modern base
GPT-2 XL / OPT / Pythia / GPT-J	0.14 – 0.22	pre-chat base

LLaMA-1 is a stable outlier in this probe, not a law. Within Gemma, instruct ≫ base, recipe over calendar. Era is a weak predictor; token-path and semantic metrics diverge.

Stability and responsiveness split.

"the the the" forever is extremely stable. So is a model collapsed on one fixed answer. Neither is what we want.

Qwen 0.8B quant sweep, 4-bit scored lower perturbation divergence (0.138 → 0.091). Sounds "more stable", until you check drift from BF16 on identical prompts: 0.132, huge.

That looks more like collapse onto a narrower manifold than robustness.

Fix: pair perturbation distance with drift-from-baseline. Both axes.

Measuring is the hard part.

Three ways a naive stability probe will mislead you:

Confound	What happens	Caught by
Collapse	Degenerate model scores "stable" because outputs stop responding to input. (Qwen 0.8B 4-bit.)	Distance from baseline on identical prompts.
Scaffold	Short-output score dominated by deterministic preamble. (Qwen 4B/9B, SmolLM3.)	Longer continuations; scaffold stripping.
Confidence can still branch	Low prompt-end JS + sharp top-1 argmax still allowed a divergent trajectory. Phi-4: top-1 prob 0.99999996 at prompt-end; `<think>` never closes; 0.160 at 512 tokens.	Multi-scale measurement: short vs long, logit vs text.

The most useful contribution here is naming the failure modes before the field starts quoting the numbers.

Long-generation trajectories

Updated direct-answer control: token paths can diverge quickly even when 512-token semantic distance stays in a tight band. Use trajectory shape as a diagnostic rather than a leaderboard.

Mechanism: boundary beats bulk

Small prompt change → argmax crosses a low-margin boundary → different first token → autoregressive feedback → different trajectory.

The distribution often barely moves in bulk. A low-margin next-token decision is fragile; one flipped argmax steers generation into a different basin.

A question the lens suggests.

Static floor

How few bits to store the model?

TurboQuant, KIVI, KV quantization.
Rate-distortion bounds.
Well-characterized.

Dynamical floor?

How few bits before behavior drifts?

Might depend on model sensitivity.
Stable models might tolerate more compression.
Open. My data doesn't settle it.

Compression has a static floor. Does it have a dynamical one too? The chaos lens suggests the question. I don't have the answer.

The practitioner upshot.

Don't evaluate on a single prompt, single decode, or single metric. Prompting is operating a high-gain branching system.

Operational:

Reliability: test prompt neighborhoods around the canonical prompt.
Model comparison: report sensitivity ranges over equivalent prompts.
Output metrics: strip boilerplate, compare answer spans, watch prefixes.
Decoding: deterministic for sensitivity; sampling separately for deployment.
Quantization: lower divergence ≠ robustness. Also check baseline drift.

The chaos lens gives us questions that standard benchmarks rarely ask, and those questions are worth asking.

Questions?

Backup, "Would I get the same answer if I ran it?"

Argmax decode has no sampling step. do_sample=False → highest-logit token wins each step. Seed is inert.

Same prompt twice → byte-identical output.
Prompt A vs B → top-token at some position flipped. Most confident response moved.

Temperature > 0? 30-sample cluster test (OLMo-3, palindrome pair, T=0.1):

Prompt A: 30 samples cluster tightly. Prompt B: same.
A-cluster and B-cluster are visibly separate.

The samples form two different attractors. Sampling noise is smaller than the shift between them.

Backup, "Is this chaos?" defense

Formal chaos needs exponential divergence under iteration. Not proven.
What was measured: small input perturbations producing different outputs, varying by model, reproducible under deterministic decode.
Consistent with behavior near a chaos boundary.
The frame is the contribution; the experiment is a probe.

Backup, Related work I came across late

Prompt sensitivity / brittleness:

Salinas & Morstatter 2024, "Butterfly Effect of Altering Prompts." My whitespace example, already published.
Sclar et al. 2023, formatting sensitivity; up to 76-point swings on LLaMA-2-13B.
Lu et al. 2021, example ordering alone moves few-shot near-random to near-SOTA.
PromptRobust / POSIX / RobustAlpacaEval, published sensitivity benchmarks.

Dynamical systems in NNs:

Poole 2016 / Schoenholz 2017, edge-of-chaos signal propagation.
Geshkovski et al. 2023, attention as interacting-particle dynamics.
Tomihari & Karakida 2025, Jacobian/Lyapunov analysis of self-attention.

Backup, Statistical honesty

n = 9 prompt pairs per model; n = 24 in hardened Qwen wave. Small.
Robust at n = 24: Qwen 4B vs 0.8B p<0.001; Qwen 4B vs 2B p=0.012; cluster membership.
Weak at this n: Qwen 4B vs 9B (p=0.78); middle-pack ordering; standalone quant flip.
Scaffold vs non-scaffold is confounded with post-training recipe, needs different models, not more prompts.

Disagreeing with a specific model ordering is fair; those are intentionally underclaimed. Disagreeing with the broad clusters needs a much larger prompt set.

Backup, Failed experiments

gpt-oss-20b: MXFP4 / Triton driver mismatch on SageMaker image.
Nemotron Nano 9B v2: container lacked mamba-ssm.
Phi-4 mini: Transformers version / custom-code import failure.

Reported as tooling misses rather than stability findings.

Backup, Full bootstrap readout (512 tokens)

Stable / mid

Model	Mean	95% CI
DeepSeek-R1 Qwen 7B	0.027	0.018 – 0.036
Qwen3.5 4B	0.050	0.033 – 0.066
LLaMA-1 7B	0.053	0.017 – 0.100
Gemma 4 E2B it	0.056	0.033 – 0.080
Qwen3.5 9B	0.057	0.037 – 0.075
Mistral 7B v0.3	0.068	0.047 – 0.089
Gemma 4 E4B it	0.072	0.038 – 0.110
Qwen3.5 2B	0.075	0.050 – 0.103

Higher sensitivity / caveats

Model	Mean	95% CI
OLMo 2 7B	0.088	0.055 – 0.127
Qwen3.5 0.8B	0.103	0.061 – 0.153
OLMo 3 7B	0.104	0.077 – 0.135
Gemma 4 E4B base	0.119	0.071 – 0.173
GPT-2 XL	0.144	0.082 – 0.208
Phi-4 reasoning+	0.161	0.072 – 0.255
Gemma 4 E2B base	0.199	0.152 – 0.248

n=24 pairs, 512 tokens. Cluster view. Phi-4: scaffolded yet brittle, scaffold does not guarantee stability.

cipher982/slides.md

Select an option

No results found

Select an option

No results found

Nearby Prompts,
Distant Trajectories

Teaching a lens: chaos, dynamical systems,
and how they might apply to LLMs.

What I'm not claiming.

What I am claiming.

Chaos starts with sensitivity.

Small differences grow. Measurably.

Same input. Same weights. Different output.

Temperature is a separate axis.

So: is an LLM a dynamical system?

Both outputs can be correct.

State, and prior work, short version

The experiment

What actually matters?

Same-looking prompt. Different trajectory.

Within-Qwen: one clean contrast.

Scaffold "stability" is mostly metric artifact.

Era, recipe, and the LLaMA-1 surprise

Stability and responsiveness split.

Measuring is the hard part.

Long-generation trajectories

Mechanism: boundary beats bulk

A question the lens suggests.

Static floor

Dynamical floor?

The practitioner upshot.

Questions?

Backup, "Would I get the same answer if I ran it?"

Backup, "Is this chaos?" defense

Backup, Related work I came across late

Backup, Statistical honesty

Backup, Failed experiments

Backup, Full bootstrap readout (512 tokens)

cipher982/slides.md

Nearby Prompts,Distant Trajectories

Teaching a lens: chaos, dynamical systems,and how they might apply to LLMs.

What I'm not claiming.

What I am claiming.

Chaos starts with sensitivity.

Small differences grow. Measurably.

Same input. Same weights. Different output.

Temperature is a separate axis.

So: is an LLM a dynamical system?

Both outputs can be correct.

State, and prior work, short version

The experiment

What actually matters?

Same-looking prompt. Different trajectory.

Within-Qwen: one clean contrast.

Scaffold "stability" is mostly metric artifact.

Era, recipe, and the LLaMA-1 surprise

Stability and responsiveness split.

Measuring is the hard part.

Long-generation trajectories

Mechanism: boundary beats bulk

A question the lens suggests.

Static floor

Dynamical floor?

The practitioner upshot.

Questions?

Backup, "Would I get the same answer if I ran it?"

Backup, "Is this chaos?" defense

Backup, Related work I came across late

Backup, Statistical honesty

Backup, Failed experiments

Backup, Full bootstrap readout (512 tokens)

Nearby Prompts,
Distant Trajectories

Teaching a lens: chaos, dynamical systems,
and how they might apply to LLMs.