Estimating Gemini 3.5 Flash size via IKP

Date: 2026-05-20. Working session attempt to replicate Bojie Li's Incompressible Knowledge Probes (IKP) methodology and apply it to google/gemini-3.5-flash.

Goal

Use IKP — black-box factual-recall probes → log-linear size estimate — to estimate Gemini 3.5 Flash's parameter count, and verify the methodology by replicating models IKP already measured.

Sources used

IKP paper: https://arxiv.org/pdf/2604.24827
IKP code: https://github.com/19PINE-AI/ikp
Calibration & saved results: data/results/*.json and data/results/calibration_refit_v2.json in that repo
Target API: https://openrouter.ai/google/gemini-3.5-flash

Setup

Ran on EC2 box ec2-100-30-227-78.compute-1.amazonaws.com (ubuntu, harbor-runner.pem, 64 cores / 247 GB RAM).

~/gemini35flash/
  ikp/                    # cloned repo
  results/                # initial parallel runs
  results_seq/            # final sequential runs (workers=16)
  run.sh, run_one.sh, run_seq.sh

OPENROUTER_API_KEY shipped from local env via stdin to ~/.openrouter_env (600).

Methodology recap

Script: scripts/ikp_estimate.py

1,400 probes × 7 obscurity tiers (T1 = universal facts → T7 = extreme long-tail)
Each probe: model answers → judge (Gemini 3 Flash Preview) returns CORRECT / WRONG / REFUSAL
Per-tier score: max(0, (correct + λ × wrong) / total)
Aggregate accuracy: mean across 7 tier scores
Size estimate: log10(params_B) = 6.790 × accuracy − 0.899 (canonical λ=−1.0 fit, R²=0.917 on 89 open models, 1B–1040B)

λ is the hallucination penalty. λ=0 = raw accuracy, λ=−1 = each wrong cancels one correct. Refusals always score 0.

Important inconsistency in the repo:

Canonical paper λ = −1.0 (scripts/ikp_estimate.py line 46, 15_densing_law_analysis.py:284)
Saved per-model JSONs in data/results/*.json are scored at λ=−0.5 (scripts/run_evaluation.py line 55)
The calibration_refit_v2.json provides slope/intercept for both via its sensitivity sweep.

Run 1 — Gemini 3.5 Flash, workers=64, alone

Accuracy: 70.79% (penalized, λ=−1)
Estimated: 8.1T parameters

Per-tier: T1=200/0/0, T2=199/1/0, T3=197/3/0, T4=189/11/0, T5=134/66/0 (anomalous dip), T6=174/21/5, T7=38/73/89.

The T5 dip looked suspicious but was initially taken at face value.

Run 2 — 5 IKP-measured comparators in parallel, workers=32

Models: llama-3.1-8b, qwen-2.5-7b, deepseek-v3, claude-opus-4.7, hy3-preview — all simultaneously, in 5 tmux sessions.

Results were catastrophically wrong. Examples of probes judged WRONG:

Qwen-2.5-7B asked "capital of Poland?" → answered "Warsaw" → marked WRONG
DeepSeek-V3 asked "year JFK assassinated?" → "1963" → marked WRONG
Claude Opus 4.7 asked "Salk Institute architect?" → "Louis Kahn" → marked WRONG

Root cause: judge API contention. 5 parallel runs × 32 workers ≈ 160 concurrent calls to the same google/gemini-3-flash-preview judge endpoint. The judge throttled / returned malformed responses. The script's fallback (scripts/ikp_estimate.py:201) silently returns "WRONG" on any judge exception.

This is a methodology hazard: anyone scaling IKP by running models concurrently against the shared judge will silently get poison results. Failure modes cluster in the easy tiers (T1–T4) because that's where target + judge would normally agree — and the gap from CORRECT to silent-WRONG is the largest.

The original Gemini 3.5 Flash run (workers=64, alone) was also partially contaminated — see Run 3.

Run 3 — sequential batch, workers=16

One tmux session running all 6 models one after another (run_seq.sh). Total ~93 min.

Replication of IKP-measured models (at λ=−1, current script's penalty)

model	true B	my seq acc	my est B	their @ λ=−1 acc	Δ acc
qwen-2.5-7b	7.62	0.2664	8.1B	0.2729	−0.007
llama-3.1-8b	8.03	0.3386	25B	0.3436	−0.005
deepseek-v3	671	0.5964	1.4T	0.5950	+0.001
claude-opus-4.7	?	0.6314	2.4T	0.6343	−0.003
hy3-preview	?	0.6300	2.4T	0.6271	+0.003

Replication essentially exact (all within 0.005 absolute) once judge contention is eliminated. Methodology is reproducible.

Gemini 3.5 Flash: original (contended) vs clean

	workers	acc	est
Run 1 (alone, parallel-batch was later)	64	0.7079	8.1T
Run 3 sequential	16	0.7979	33T

T5: 134/66/0 → 192/7/1. Original was judge-poisoned too, just less severely. Clean read is 33T at face value (λ=−1 calibration).

Position in Gemini ladder (at λ=−1, from IKP saved data + my run)

gemini-2.5-flash      acc 0.526  →   0.47T
gemini-2.5-pro        acc 0.659  →   3.8T
gemini-3.1-flash-lite acc 0.651  →   3.3T
gpt-5                 acc 0.678  →   5.1T
deepseek-v4-flash     acc 0.629  →   2.4T
gemini-3-flash        acc 0.774  →  22.8T
→ gemini-3.5-flash    acc 0.798  →  33.0T  ← my run
gemini-3.1-pro        acc 0.829  →  53.9T

Internally consistent with the Gemini family ladder.

λ analysis — overshoots are not a λ artifact

User intuition was that mismatched λ between scoring and calibration might explain overshoots. Tested by recomputing at λ=−0.5 with matched calibration:

model	trueB	est@λ=−1	over@−1	est@λ=−0.5	over@−0.5
qwen-2.5-7b	7.62	8.1B	1.07×	5.7B	0.74×
llama-3.1-8b	8.03	25B	3.13×	24B	2.94×
deepseek-v3	671	1.4T	2.11×	1.2T	1.76×
gemini-3.5-flash	?	33T	–	24T	–

Overshoots persist regardless of λ. The calibration itself is loose — IKP's own LOO-CV admits median 1.59× fold error and only 87.6% of estimates within 3×. Our overshoots fall inside that band but it's a wide band.

Hy3-preview ground truth — the killer anchor

Bing-checked public specs for tencent/hy3-preview:

295B total / 21B active (MoE, top-8 of 192 experts + 1 shared)
Sources: HF model card, vLLM recipes, GitHub

IKP estimate (mine and theirs, both ~0.63 acc): 2.3–2.4T.

That's ~8× overshoot on a known frontier MoE — far worse than the 2–3× we saw on Llama/DeepSeek. The methodology is systematically off on modern MoEs.

HN comparison for Gemini 3.5 Flash

User easygenes on Hacker News did a TPU 8i hardware back-calculation:

~250–300B total, ~10–16B active
Possibly up to 400B with TurboQuant-style compression
Source: https://news.ycombinator.com/item?id=48196570, summarized in https://gigazine.net/news/20260520-gemini-3-5-flash-parameter/

This is a hardware-grounded estimate (memory bandwidth, observed tokens/sec on known TPU specs) — much more constrained than IKP's factual extrapolation.

Disagreement summary

Source	Total params	Active
HN (TPU back-calc, easygenes)	0.25–0.4T	10–16B
IKP face value (my seq run, λ=−1)	33T	–
IKP face value (λ=−0.5 calibration)	24T	–
IKP ÷ Llama/DS overshoot (2–3×)	8–15T	–
IKP ÷ hy3-preview overshoot (8×)	~3–4T	–

Even the hy3-anchored correction leaves IKP ~10× above the TPU-based HN estimate.

Why does IKP overshoot frontier MoEs?

Plausible causes (not formally tested):

MoE knowledge density — newer/better-trained MoEs cram more facts per parameter than the calibration set (dense + older MoE) implies. The implicit assumption "factual recall ∝ log(total params)" breaks at the frontier.
Probe contamination — IKP is public on GitHub since April 2026; hy3-preview (released 2026-04-22) and Gemini 3.5 Flash (released 2026-05-19) both trained after publication. Probes may be in pretraining, inflating accuracy independent of capacity.
Sibling-judge bias — for Gemini specifically: judge = gemini-3-flash-preview, target = gemini-3.5-flash. Same family. Doesn't explain hy3 overshoot, but compounds for Gemini.
Calibration set composition — 89 open models ≤ 1040B; everything beyond is wild extrapolation.

Honest bottom line

Replication of IKP works. At matched λ and without judge contention, my 5 comparator runs reproduced IKP's saved accuracies to within 0.005 absolute.
IKP itself does not give a usable absolute size estimate for frontier 2026 MoEs. Hy3-preview's ground-truth 295B vs IKP-projected 2.4T (8× off) is the smoking gun.
For Gemini 3.5 Flash specifically: IKP face value 24–33T; HN TPU-based estimate ~250–400B. The HN number is far more credible.
What IKP does give is a relative signal. The Gemini-family ladder (2.5 Flash → 3 Flash → 3.5 Flash → 3.1 Pro) is monotone and matches release ordering. IKP is useful for ordering models within / between families, not for absolute params.

Methodology notes / hazards for next time

Don't parallelize across models against the same judge endpoint. Use sequential or workers ≤ 16. The script silently mis-classifies on judge errors.
Always check the script's λ vs the calibration JSON's λ. They are consistent (both −1.0) in the current repo, but the saved per-model JSONs use −0.5 — easy to confuse.
The estimate is a single point — script prints no CI. Quote the LOO-CV band (median 1.59×, 87.6% within 3×) when reporting.
Sanity-check against a known-size frontier MoE before trusting the absolute number. Hy3 + open-weight MoEs (DeepSeek-V3, Qwen 235B, Llama 4) are good anchors.
Contamination risk on public benchmarks is now significant — any model trained after April 2026 may have seen IKP probes.

stared/estimate.md

Select an option

No results found