Skip to content

Instantly share code, notes, and snippets.

@stared
Created May 20, 2026 11:57
Show Gist options
  • Select an option

  • Save stared/a86d7380937e6d0ab7920014866ace4c to your computer and use it in GitHub Desktop.

Select an option

Save stared/a86d7380937e6d0ab7920014866ace4c to your computer and use it in GitHub Desktop.
Gemini 3.5 Flash size estimation by IKP - notes

Estimating Gemini 3.5 Flash size via IKP

Date: 2026-05-20. Working session attempt to replicate Bojie Li's Incompressible Knowledge Probes (IKP) methodology and apply it to google/gemini-3.5-flash.

Goal

Use IKP — black-box factual-recall probes → log-linear size estimate — to estimate Gemini 3.5 Flash's parameter count, and verify the methodology by replicating models IKP already measured.

Sources used

Setup

Ran on EC2 box ec2-100-30-227-78.compute-1.amazonaws.com (ubuntu, harbor-runner.pem, 64 cores / 247 GB RAM).

~/gemini35flash/
  ikp/                    # cloned repo
  results/                # initial parallel runs
  results_seq/            # final sequential runs (workers=16)
  run.sh, run_one.sh, run_seq.sh

OPENROUTER_API_KEY shipped from local env via stdin to ~/.openrouter_env (600).

Methodology recap

Script: scripts/ikp_estimate.py

  • 1,400 probes × 7 obscurity tiers (T1 = universal facts → T7 = extreme long-tail)
  • Each probe: model answers → judge (Gemini 3 Flash Preview) returns CORRECT / WRONG / REFUSAL
  • Per-tier score: max(0, (correct + λ × wrong) / total)
  • Aggregate accuracy: mean across 7 tier scores
  • Size estimate: log10(params_B) = 6.790 × accuracy − 0.899 (canonical λ=−1.0 fit, R²=0.917 on 89 open models, 1B–1040B)

λ is the hallucination penalty. λ=0 = raw accuracy, λ=−1 = each wrong cancels one correct. Refusals always score 0.

Important inconsistency in the repo:

  • Canonical paper λ = −1.0 (scripts/ikp_estimate.py line 46, 15_densing_law_analysis.py:284)
  • Saved per-model JSONs in data/results/*.json are scored at λ=−0.5 (scripts/run_evaluation.py line 55)
  • The calibration_refit_v2.json provides slope/intercept for both via its sensitivity sweep.

Run 1 — Gemini 3.5 Flash, workers=64, alone

Accuracy: 70.79% (penalized, λ=−1)
Estimated: 8.1T parameters

Per-tier: T1=200/0/0, T2=199/1/0, T3=197/3/0, T4=189/11/0, T5=134/66/0 (anomalous dip), T6=174/21/5, T7=38/73/89.

The T5 dip looked suspicious but was initially taken at face value.

Run 2 — 5 IKP-measured comparators in parallel, workers=32

Models: llama-3.1-8b, qwen-2.5-7b, deepseek-v3, claude-opus-4.7, hy3-preview — all simultaneously, in 5 tmux sessions.

Results were catastrophically wrong. Examples of probes judged WRONG:

  • Qwen-2.5-7B asked "capital of Poland?" → answered "Warsaw" → marked WRONG
  • DeepSeek-V3 asked "year JFK assassinated?" → "1963" → marked WRONG
  • Claude Opus 4.7 asked "Salk Institute architect?" → "Louis Kahn" → marked WRONG

Root cause: judge API contention. 5 parallel runs × 32 workers ≈ 160 concurrent calls to the same google/gemini-3-flash-preview judge endpoint. The judge throttled / returned malformed responses. The script's fallback (scripts/ikp_estimate.py:201) silently returns "WRONG" on any judge exception.

This is a methodology hazard: anyone scaling IKP by running models concurrently against the shared judge will silently get poison results. Failure modes cluster in the easy tiers (T1–T4) because that's where target + judge would normally agree — and the gap from CORRECT to silent-WRONG is the largest.

The original Gemini 3.5 Flash run (workers=64, alone) was also partially contaminated — see Run 3.

Run 3 — sequential batch, workers=16

One tmux session running all 6 models one after another (run_seq.sh). Total ~93 min.

Replication of IKP-measured models (at λ=−1, current script's penalty)

model true B my seq acc my est B their @ λ=−1 acc Δ acc
qwen-2.5-7b 7.62 0.2664 8.1B 0.2729 −0.007
llama-3.1-8b 8.03 0.3386 25B 0.3436 −0.005
deepseek-v3 671 0.5964 1.4T 0.5950 +0.001
claude-opus-4.7 ? 0.6314 2.4T 0.6343 −0.003
hy3-preview ? 0.6300 2.4T 0.6271 +0.003

Replication essentially exact (all within 0.005 absolute) once judge contention is eliminated. Methodology is reproducible.

Gemini 3.5 Flash: original (contended) vs clean

workers acc est
Run 1 (alone, parallel-batch was later) 64 0.7079 8.1T
Run 3 sequential 16 0.7979 33T

T5: 134/66/0 → 192/7/1. Original was judge-poisoned too, just less severely. Clean read is 33T at face value (λ=−1 calibration).

Position in Gemini ladder (at λ=−1, from IKP saved data + my run)

gemini-2.5-flash      acc 0.526  →   0.47T
gemini-2.5-pro        acc 0.659  →   3.8T
gemini-3.1-flash-lite acc 0.651  →   3.3T
gpt-5                 acc 0.678  →   5.1T
deepseek-v4-flash     acc 0.629  →   2.4T
gemini-3-flash        acc 0.774  →  22.8T
→ gemini-3.5-flash    acc 0.798  →  33.0T  ← my run
gemini-3.1-pro        acc 0.829  →  53.9T

Internally consistent with the Gemini family ladder.

λ analysis — overshoots are not a λ artifact

User intuition was that mismatched λ between scoring and calibration might explain overshoots. Tested by recomputing at λ=−0.5 with matched calibration:

model trueB est@λ=−1 over@−1 est@λ=−0.5 over@−0.5
qwen-2.5-7b 7.62 8.1B 1.07× 5.7B 0.74×
llama-3.1-8b 8.03 25B 3.13× 24B 2.94×
deepseek-v3 671 1.4T 2.11× 1.2T 1.76×
gemini-3.5-flash ? 33T 24T

Overshoots persist regardless of λ. The calibration itself is loose — IKP's own LOO-CV admits median 1.59× fold error and only 87.6% of estimates within 3×. Our overshoots fall inside that band but it's a wide band.

Hy3-preview ground truth — the killer anchor

Bing-checked public specs for tencent/hy3-preview:

IKP estimate (mine and theirs, both ~0.63 acc): 2.3–2.4T.

That's ~8× overshoot on a known frontier MoE — far worse than the 2–3× we saw on Llama/DeepSeek. The methodology is systematically off on modern MoEs.

HN comparison for Gemini 3.5 Flash

User easygenes on Hacker News did a TPU 8i hardware back-calculation:

This is a hardware-grounded estimate (memory bandwidth, observed tokens/sec on known TPU specs) — much more constrained than IKP's factual extrapolation.

Disagreement summary

Source Total params Active
HN (TPU back-calc, easygenes) 0.25–0.4T 10–16B
IKP face value (my seq run, λ=−1) 33T
IKP face value (λ=−0.5 calibration) 24T
IKP ÷ Llama/DS overshoot (2–3×) 8–15T
IKP ÷ hy3-preview overshoot (8×) ~3–4T

Even the hy3-anchored correction leaves IKP ~10× above the TPU-based HN estimate.

Why does IKP overshoot frontier MoEs?

Plausible causes (not formally tested):

  1. MoE knowledge density — newer/better-trained MoEs cram more facts per parameter than the calibration set (dense + older MoE) implies. The implicit assumption "factual recall ∝ log(total params)" breaks at the frontier.
  2. Probe contamination — IKP is public on GitHub since April 2026; hy3-preview (released 2026-04-22) and Gemini 3.5 Flash (released 2026-05-19) both trained after publication. Probes may be in pretraining, inflating accuracy independent of capacity.
  3. Sibling-judge bias — for Gemini specifically: judge = gemini-3-flash-preview, target = gemini-3.5-flash. Same family. Doesn't explain hy3 overshoot, but compounds for Gemini.
  4. Calibration set composition — 89 open models ≤ 1040B; everything beyond is wild extrapolation.

Honest bottom line

  • Replication of IKP works. At matched λ and without judge contention, my 5 comparator runs reproduced IKP's saved accuracies to within 0.005 absolute.
  • IKP itself does not give a usable absolute size estimate for frontier 2026 MoEs. Hy3-preview's ground-truth 295B vs IKP-projected 2.4T (8× off) is the smoking gun.
  • For Gemini 3.5 Flash specifically: IKP face value 24–33T; HN TPU-based estimate ~250–400B. The HN number is far more credible.
  • What IKP does give is a relative signal. The Gemini-family ladder (2.5 Flash → 3 Flash → 3.5 Flash → 3.1 Pro) is monotone and matches release ordering. IKP is useful for ordering models within / between families, not for absolute params.

Methodology notes / hazards for next time

  1. Don't parallelize across models against the same judge endpoint. Use sequential or workers ≤ 16. The script silently mis-classifies on judge errors.
  2. Always check the script's λ vs the calibration JSON's λ. They are consistent (both −1.0) in the current repo, but the saved per-model JSONs use −0.5 — easy to confuse.
  3. The estimate is a single point — script prints no CI. Quote the LOO-CV band (median 1.59×, 87.6% within 3×) when reporting.
  4. Sanity-check against a known-size frontier MoE before trusting the absolute number. Hy3 + open-weight MoEs (DeepSeek-V3, Qwen 235B, Llama 4) are good anchors.
  5. Contamination risk on public benchmarks is now significant — any model trained after April 2026 may have seen IKP probes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment