Date: 2026-05-20. Working session attempt to replicate Bojie Li's Incompressible Knowledge Probes (IKP) methodology and apply it to google/gemini-3.5-flash.
Use IKP — black-box factual-recall probes → log-linear size estimate — to estimate Gemini 3.5 Flash's parameter count, and verify the methodology by replicating models IKP already measured.
- IKP paper: https://arxiv.org/pdf/2604.24827
- IKP code: https://github.com/19PINE-AI/ikp
- Calibration & saved results:
data/results/*.jsonanddata/results/calibration_refit_v2.jsonin that repo - Target API: https://openrouter.ai/google/gemini-3.5-flash
Ran on EC2 box ec2-100-30-227-78.compute-1.amazonaws.com (ubuntu, harbor-runner.pem, 64 cores / 247 GB RAM).
~/gemini35flash/
ikp/ # cloned repo
results/ # initial parallel runs
results_seq/ # final sequential runs (workers=16)
run.sh, run_one.sh, run_seq.sh
OPENROUTER_API_KEY shipped from local env via stdin to ~/.openrouter_env (600).
Script: scripts/ikp_estimate.py
- 1,400 probes × 7 obscurity tiers (T1 = universal facts → T7 = extreme long-tail)
- Each probe: model answers → judge (Gemini 3 Flash Preview) returns CORRECT / WRONG / REFUSAL
- Per-tier score:
max(0, (correct + λ × wrong) / total) - Aggregate accuracy: mean across 7 tier scores
- Size estimate:
log10(params_B) = 6.790 × accuracy − 0.899(canonical λ=−1.0 fit, R²=0.917 on 89 open models, 1B–1040B)
λ is the hallucination penalty. λ=0 = raw accuracy, λ=−1 = each wrong cancels one correct. Refusals always score 0.
Important inconsistency in the repo:
- Canonical paper λ = −1.0 (
scripts/ikp_estimate.pyline 46,15_densing_law_analysis.py:284) - Saved per-model JSONs in
data/results/*.jsonare scored at λ=−0.5 (scripts/run_evaluation.pyline 55) - The
calibration_refit_v2.jsonprovides slope/intercept for both via its sensitivity sweep.
Accuracy: 70.79% (penalized, λ=−1)
Estimated: 8.1T parameters
Per-tier: T1=200/0/0, T2=199/1/0, T3=197/3/0, T4=189/11/0, T5=134/66/0 (anomalous dip), T6=174/21/5, T7=38/73/89.
The T5 dip looked suspicious but was initially taken at face value.
Models: llama-3.1-8b, qwen-2.5-7b, deepseek-v3, claude-opus-4.7, hy3-preview — all simultaneously, in 5 tmux sessions.
Results were catastrophically wrong. Examples of probes judged WRONG:
- Qwen-2.5-7B asked "capital of Poland?" → answered "Warsaw" → marked WRONG
- DeepSeek-V3 asked "year JFK assassinated?" → "1963" → marked WRONG
- Claude Opus 4.7 asked "Salk Institute architect?" → "Louis Kahn" → marked WRONG
Root cause: judge API contention. 5 parallel runs × 32 workers ≈ 160 concurrent calls to the same google/gemini-3-flash-preview judge endpoint. The judge throttled / returned malformed responses. The script's fallback (scripts/ikp_estimate.py:201) silently returns "WRONG" on any judge exception.
This is a methodology hazard: anyone scaling IKP by running models concurrently against the shared judge will silently get poison results. Failure modes cluster in the easy tiers (T1–T4) because that's where target + judge would normally agree — and the gap from CORRECT to silent-WRONG is the largest.
The original Gemini 3.5 Flash run (workers=64, alone) was also partially contaminated — see Run 3.
One tmux session running all 6 models one after another (run_seq.sh). Total ~93 min.
| model | true B | my seq acc | my est B | their @ λ=−1 acc | Δ acc |
|---|---|---|---|---|---|
| qwen-2.5-7b | 7.62 | 0.2664 | 8.1B | 0.2729 | −0.007 |
| llama-3.1-8b | 8.03 | 0.3386 | 25B | 0.3436 | −0.005 |
| deepseek-v3 | 671 | 0.5964 | 1.4T | 0.5950 | +0.001 |
| claude-opus-4.7 | ? | 0.6314 | 2.4T | 0.6343 | −0.003 |
| hy3-preview | ? | 0.6300 | 2.4T | 0.6271 | +0.003 |
Replication essentially exact (all within 0.005 absolute) once judge contention is eliminated. Methodology is reproducible.
| workers | acc | est | |
|---|---|---|---|
| Run 1 (alone, parallel-batch was later) | 64 | 0.7079 | 8.1T |
| Run 3 sequential | 16 | 0.7979 | 33T |
T5: 134/66/0 → 192/7/1. Original was judge-poisoned too, just less severely. Clean read is 33T at face value (λ=−1 calibration).
gemini-2.5-flash acc 0.526 → 0.47T
gemini-2.5-pro acc 0.659 → 3.8T
gemini-3.1-flash-lite acc 0.651 → 3.3T
gpt-5 acc 0.678 → 5.1T
deepseek-v4-flash acc 0.629 → 2.4T
gemini-3-flash acc 0.774 → 22.8T
→ gemini-3.5-flash acc 0.798 → 33.0T ← my run
gemini-3.1-pro acc 0.829 → 53.9T
Internally consistent with the Gemini family ladder.
User intuition was that mismatched λ between scoring and calibration might explain overshoots. Tested by recomputing at λ=−0.5 with matched calibration:
| model | trueB | est@λ=−1 | over@−1 | est@λ=−0.5 | over@−0.5 |
|---|---|---|---|---|---|
| qwen-2.5-7b | 7.62 | 8.1B | 1.07× | 5.7B | 0.74× |
| llama-3.1-8b | 8.03 | 25B | 3.13× | 24B | 2.94× |
| deepseek-v3 | 671 | 1.4T | 2.11× | 1.2T | 1.76× |
| gemini-3.5-flash | ? | 33T | – | 24T | – |
Overshoots persist regardless of λ. The calibration itself is loose — IKP's own LOO-CV admits median 1.59× fold error and only 87.6% of estimates within 3×. Our overshoots fall inside that band but it's a wide band.
Bing-checked public specs for tencent/hy3-preview:
- 295B total / 21B active (MoE, top-8 of 192 experts + 1 shared)
- Sources: HF model card, vLLM recipes, GitHub
IKP estimate (mine and theirs, both ~0.63 acc): 2.3–2.4T.
That's ~8× overshoot on a known frontier MoE — far worse than the 2–3× we saw on Llama/DeepSeek. The methodology is systematically off on modern MoEs.
User easygenes on Hacker News did a TPU 8i hardware back-calculation:
- ~250–300B total, ~10–16B active
- Possibly up to 400B with TurboQuant-style compression
- Source: https://news.ycombinator.com/item?id=48196570, summarized in https://gigazine.net/news/20260520-gemini-3-5-flash-parameter/
This is a hardware-grounded estimate (memory bandwidth, observed tokens/sec on known TPU specs) — much more constrained than IKP's factual extrapolation.
| Source | Total params | Active |
|---|---|---|
| HN (TPU back-calc, easygenes) | 0.25–0.4T | 10–16B |
| IKP face value (my seq run, λ=−1) | 33T | – |
| IKP face value (λ=−0.5 calibration) | 24T | – |
| IKP ÷ Llama/DS overshoot (2–3×) | 8–15T | – |
| IKP ÷ hy3-preview overshoot (8×) | ~3–4T | – |
Even the hy3-anchored correction leaves IKP ~10× above the TPU-based HN estimate.
Plausible causes (not formally tested):
- MoE knowledge density — newer/better-trained MoEs cram more facts per parameter than the calibration set (dense + older MoE) implies. The implicit assumption "factual recall ∝ log(total params)" breaks at the frontier.
- Probe contamination — IKP is public on GitHub since April 2026; hy3-preview (released 2026-04-22) and Gemini 3.5 Flash (released 2026-05-19) both trained after publication. Probes may be in pretraining, inflating accuracy independent of capacity.
- Sibling-judge bias — for Gemini specifically: judge =
gemini-3-flash-preview, target =gemini-3.5-flash. Same family. Doesn't explain hy3 overshoot, but compounds for Gemini. - Calibration set composition — 89 open models ≤ 1040B; everything beyond is wild extrapolation.
- Replication of IKP works. At matched λ and without judge contention, my 5 comparator runs reproduced IKP's saved accuracies to within 0.005 absolute.
- IKP itself does not give a usable absolute size estimate for frontier 2026 MoEs. Hy3-preview's ground-truth 295B vs IKP-projected 2.4T (8× off) is the smoking gun.
- For Gemini 3.5 Flash specifically: IKP face value 24–33T; HN TPU-based estimate ~250–400B. The HN number is far more credible.
- What IKP does give is a relative signal. The Gemini-family ladder (2.5 Flash → 3 Flash → 3.5 Flash → 3.1 Pro) is monotone and matches release ordering. IKP is useful for ordering models within / between families, not for absolute params.
- Don't parallelize across models against the same judge endpoint. Use sequential or workers ≤ 16. The script silently mis-classifies on judge errors.
- Always check the script's λ vs the calibration JSON's λ. They are consistent (both −1.0) in the current repo, but the saved per-model JSONs use −0.5 — easy to confuse.
- The estimate is a single point — script prints no CI. Quote the LOO-CV band (median 1.59×, 87.6% within 3×) when reporting.
- Sanity-check against a known-size frontier MoE before trusting the absolute number. Hy3 + open-weight MoEs (DeepSeek-V3, Qwen 235B, Llama 4) are good anchors.
- Contamination risk on public benchmarks is now significant — any model trained after April 2026 may have seen IKP probes.