Tests heatmap

13 questions, across many open and open-weight models, in dozens of test cycles that tested a variety of inputs — from PDFs to primarily Docling-based derivatives.

0%100%(continuous green→yellow→red gradient)
ModelTypecyc 2?0 qcyc 2.1?0 qcyc 3?0 qcyc 4fix-toc-misclass-and-decimal-ocr5 qcyc 5fix-toc-misclass-and-decimal-ocr13 qcyc 6fix-toc-misclass-and-decimal-ocr13 qcyc 7fix-toc-misclass-and-decimal-ocr13 qcyc 8fix-toc-misclass-and-decimal-ocr13 qcyc 9fix-toc-misclass-and-decimal-ocr13 qcyc 10fix-toc-misclass-and-decimal-ocr13 qcyc 11fix-toc-misclass-and-decimal-ocr13 qcyc 12fix-toc-misclass-and-decimal-ocr13 qcyc 13fix-toc-misclass-and-decimal-ocr13 qcyc 14micro-1k13 qcyc 15compact-2k13 qcyc 16table-only13 qcyc 17csv-only13 qcyc 18labeled13 qcyc 19csv-plus-headings13 qcyc 20csv-plus-scope13 qcyc 21csv-plus-paragraph13 qcyc 22csv-plus-all-context13 qcyc 23csv-only13 qcyc 24csv-only13 qcyc 25csv-only13 qcyc 26stitched13 qcyc 27csv-normalized13 qcyc 28csv-demerged13 qcyc 29table-normalized13 qcyc 30table-normalized13 qcyc 31doc-index13 qcyc 32doc-index-enriched13 qcyc 33pipeline-v0.613 qcyc 34csv-only2 qAll test cycles
EuroLLM-9B-InstructOpen model38%5/1319%5/2654%7/1354%7/1362%8/1354%7/1354%7/1362%8/1354%7/1338%5/1354%7/1346%6/138%1/1331%4/130%0/143%84/196
granite3.3:8bOpen model77%10/1377%10/1321%5/2454%7/1362%8/1385%11/1346%6/1369%9/1338%5/1362%8/1362%8/1369%9/1331%4/1369%9/1362%8/1377%10/1338%5/1331%4/1346%6/130%0/155%142/259
olmo2:7bOpen model100%2/231%4/1331%4/1331%4/1346%6/1331%4/1331%4/1346%6/1331%4/1323%3/1323%3/130%0/133%44/133
swiss-ai/Apertus-70B-Instruct-2509Open model40%2/530%6/2046%6/1328%8/2954%7/1335%9/2654%7/1323%3/1338%5/130%0/136%53/146
swiss-ai/Apertus-8B-Instruct-2509Open model43%3/736%5/1440%2/523%3/1326%11/4246%6/1346%6/1346%6/1338%5/1346%6/1346%6/1346%6/1354%7/1354%7/1354%7/1346%6/1346%6/1338%5/1354%7/1346%6/1346%6/1354%7/1315%2/1323%3/1338%5/130%0/241%139/343
utter-project/EuroLLM-22B-Instruct-2512Open model22%2/922%2/9
climategpt-13bOpen-weight20%1/520%1/520%1/515%2/1315%2/1315%2/1315%2/138%1/1323%3/1338%5/1331%4/1331%4/1331%4/1331%4/1331%4/1331%4/1338%5/1323%3/1315%2/1323%3/130%0/130%0/138%1/130%0/221%58/277
climategpt-70bOpen-weight20%1/520%1/515%2/1315%2/1323%3/1323%3/1315%2/138%1/1315%2/13100%1/118%18/102
climategpt-7bOpen-weight50%1/220%1/523%3/1323%3/1315%2/1315%2/138%1/1323%3/1315%2/1323%3/130%0/1315%2/130%0/217%23/139
deepseek-r1:8bOpen-weight54%7/1354%7/1362%8/1362%8/1377%10/1338%5/1323%3/130%0/152%48/92
Gemma-SEA-LION-v4-27B-ITOpen-weight40%2/562%8/1331%8/2662%8/1362%8/1338%5/1338%5/130%0/145%44/97
gemma2:9bOpen-weight38%5/1338%5/1333%8/2462%8/1362%8/1354%7/1354%7/1338%5/1369%9/1354%7/1354%7/1362%8/1354%7/1354%7/1362%8/1331%4/1331%4/1323%3/130%0/148%117/246
llama3:latestOpen-weight100%2/262%8/1362%8/1362%8/1362%8/1369%9/1369%9/1354%7/1369%9/1354%7/1377%10/1369%9/1362%8/1369%9/1362%8/1362%8/1377%10/1323%3/1323%3/1346%6/130%0/160%149/250
mistral:latestOpen-weight100%2/246%6/1346%6/1369%9/1369%9/1369%9/1346%6/1354%7/1354%7/1362%8/1354%7/1362%8/1362%8/1331%4/1338%5/1354%7/1331%4/1338%5/1323%3/130%0/151%120/237
phi3:miniOpen-weight23%3/1323%3/1338%5/1331%4/1362%8/1331%4/1346%6/1338%5/1338%5/1331%4/1346%6/1323%3/138%1/1315%2/130%0/132%59/183
qwen2.5-coder:7bOpen-weight46%6/1346%6/1377%10/1377%10/1362%8/1369%9/1362%8/1331%4/1323%3/130%0/154%64/118
qwen2.5:3bOpen-weight46%6/1346%6/1354%7/1346%6/1346%6/1346%6/1338%5/1346%6/1323%3/1315%2/130%0/140%53/131
qwen2.5:7bOpen-weight62%8/1362%8/1325%6/2462%8/1354%7/1385%11/1354%7/1369%9/1377%10/1377%10/1377%10/1369%9/1369%9/1377%10/1369%9/1377%10/1338%5/1346%6/1331%4/130%0/160%156/259
claude-code-agentClosed reference100%10/10100%10/10
gemini-2.5-proClosed reference100%2/250%5/1085%11/1385%11/1327%7/2685%11/1377%10/1353%8/1562%8/1380%12/1559%10/1746%6/1385%11/1377%10/1362%8/1385%11/1354%7/1346%6/130%0/164%154/242
gpt-4.1Closed reference100%2/285%11/1392%12/1377%10/1377%10/1385%11/1369%9/1338%5/130%0/174%70/94
gpt-4o-miniClosed reference100%2/269%9/1362%8/1362%8/1369%9/1369%9/1346%6/1346%6/130%0/161%57/94
grok-4Closed reference100%2/292%12/1392%12/1356%29/5285%11/1392%12/1392%12/1369%9/1377%10/1377%10/1392%12/1377%10/1392%12/1377%10/1392%12/1385%11/1377%10/1385%11/1385%11/1385%11/1392%12/1377%10/1377%10/1346%6/130%0/178%267/341
mistral-large-latestClosed reference100%2/231%4/1331%4/1331%4/1354%7/1331%4/138%1/138%1/130%0/129%27/94
Test cycle key — variant tested and question count
cyc 2
? · 0 questions
cyc 2.1
? · 0 questions · 2026-05-20
cyc 3
? · 0 questions · 2026-05-20
cyc 4
fix-toc-misclass-and-decimal-ocr · 5 questions · 2026-05-21
cyc 5
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-21
cyc 6
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-21
cyc 7
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-21
cyc 8
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-21
cyc 9
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-21
cyc 10
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-21
cyc 11
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-21
cyc 12
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-21
cyc 13
fix-toc-misclass-and-decimal-ocr · 13 questions · 2026-05-22
cyc 14
micro-1k · 13 questions · 2026-05-22
cyc 15
compact-2k · 13 questions · 2026-05-22
cyc 16
table-only · 13 questions · 2026-05-22
cyc 17
csv-only · 13 questions · 2026-05-22
cyc 18
labeled · 13 questions · 2026-05-22
cyc 19
csv-plus-headings · 13 questions · 2026-05-22
cyc 20
csv-plus-scope · 13 questions · 2026-05-22
cyc 21
csv-plus-paragraph · 13 questions · 2026-05-22
cyc 22
csv-plus-all-context · 13 questions · 2026-05-22
cyc 23
csv-only · 13 questions · 2026-05-22
cyc 24
csv-only · 13 questions · 2026-05-22
cyc 25
csv-only · 13 questions · 2026-05-22
cyc 26
stitched · 13 questions · 2026-05-22
cyc 27
csv-normalized · 13 questions · 2026-05-23
cyc 28
csv-demerged · 13 questions · 2026-05-23
cyc 29
table-normalized · 13 questions · 2026-05-23
cyc 30
table-normalized · 13 questions · 2026-05-23
cyc 31
doc-index · 13 questions · 2026-05-23
cyc 32
doc-index-enriched · 13 questions · 2026-05-23
cyc 33
pipeline-v0.6 · 13 questions · 2026-05-24
cyc 34
csv-only · 2 questions · 2026-05-25

The test cycles that scale — what was tested, what was learned

Filtered through the lens of which findings inform production architecture at thousands of documents and tens of thousands of tables. Five cycles are shown by default: the production card-format finding (cycle 17), the infrastructure-substrate pattern (cycle 30), the project-reshaping oracle-penalty finding (cycle 31), the dead-end signal saving us from further M3-IDX investment (cycle 32), and the scale-ready vector-retrieval architecture (cycle 33). Supporting cycles — early exploration, the v0.7 variant sweep, minor reruns, discarded attempts — are collapsed below.

Cycle 7important· fix-toc-misclass-and-decimal-ocr

Tested First apertus-8b sweep across modes (M3-L4 oracle card, M2c Docling MD, M3-AC all-cards bundle) — the per-mode gradient on one open model.

Learned apertus M3-L4 = 5/13, M2c = 3/13, M3-AC = 1/13. The artefact-contribution signal: cards give a meaningful lift over Docling's MD linearization, which itself sits above whole-document JSON. The PDF→JSON→MD→cards→oracle gradient is real and measurable.

Cycle 13important· fix-toc-misclass-and-decimal-ocr

Tested M3-L4 oracle on the v0.6.1 full cards (heavy markdown + context envelope) — 11 models × 13 questions = 143 cells.

Learned 27% open-tier pass rate. The 'cards' rung of the gradient at its first iteration: cards lift open-tier performance over Docling MD (~M2c) but the heavy v0.6.1 format leaves room. Cycle 17 then shows that compressing to CSV-only nearly doubles this number at the same oracle layer.

Cycle 17important· csv-only

Tested pipeline-v0.7-csv-only variant — table data as raw CSV instead of markdown table.

Learned Established the production-recommended card format that scales: compact CSV cards (~1.5 KB) let Qwen-2.5-7B and Granite-3.3-8B reach 11/13 at the oracle layer — the format finding survives scale because it's a per-card transform, deterministic and stateless. The 'matches frontier' headline was later reframed as an oracle ceiling by cycle 31, but the format choice itself is the right production substrate.

Cycle 30important· table-normalized

Tested Evidence-Preserving Table Normalization Layer (refined v2) — 5-layer deterministic OCR-confusable repair with per-cell provenance side artifact.

Learned Shipped the production infrastructure pattern that scales: deterministic per-cell repair + table.normalized.jsonl side artifact carrying raw_cell, normalized_cell, confidence, and reason chain. This is exactly the shape a 100K-card archive needs — stateless, auditable, no LLM, no hallucination risk. Net 0 on the 8-model panel was the headline, but the lasting value is the substrate, not the cell-count delta.

Cycle 31important· doc-index

Tested M3-IDX two-shot retrieval mode — model picks a table from the doc-index, then answers from that card.

Learned Quantified the oracle penalty: every open model loses 2-6 cells without the hand-picked card. Top models (Qwen-7B, Granite-8B) drop the most (-6 each) — the cycle 17 breakthrough was conditional on perfect retrieval. This is the cycle that says: at scale, M3-L4 numbers don't predict deployment behavior; retrieval is THE problem to solve.

Cycle 33important· pipeline-v0.6

Tested M3-HYDE — HyDE-style vector retrieval (small independent generator + nomic-embed-text 768-dim embeddings + cosine search over the pre-built card index).

Learned The architecture that scales: vector cosine is sub-linear, the 407-card index extends naturally to 100K+ cards on commodity hardware, and the retrieval no longer depends on the eval model's TOC-reading ability. Mixed in-corpus results — HyDE wins on 4 of 7 open models including the three weakest (Llama, Apertus, ClimateGPT), achieves perfect retrieval (8/8) on two previously-0% questions, but loses on negative-control queries where IDX still wins. Hybrid retrieval is the natural next experiment.

Supporting cycles (27)

Cycle 2

Tested First full benchmark run: 3 questions × Claude Code Agent × 4 input modes (M3-L4, M2c, M1c, M3-AC).

Learned Surfaced the pipeline mis-label on Q-NAT-012 — table_124's caption says 'cyprid' but the data is nauplii; the real cyprid table is table_125. Ground truth corrected to 2,918; this mis-label still costs the open tier in cycle 31.

Cycle 2.1

Tested Expanded panel run — 5 questions × 15 models.

Learned Established the frontier vs open-tier gradient on full v0.6 cards: frontier ~80%, open mid-tier ~50%, climategpt ~15%.

Cycle 3

Tested Cycle 2 rerun with prompt-prefix variations.

Learned Prompt prefixes only shift 1-2 cells in either direction; not a meaningful lever at the 13-question scale.

Cycle 4· fix-toc-misclass-and-decimal-ocr

Tested Cycle 3 retest with TOC-misclass fix and decimal-OCR repair applied — introduced the table_096_corrected and table_125_corrected overrides.

Learned Corrected-card overrides lift M3-L4 results materially and are still in active use 30 cycles later.

Cycle 5· fix-toc-misclass-and-decimal-ocr

Tested First full 13-question active set across V27, V35, NOAA-32079 on 3 frontier-tier models.

Learned grok-4 hits 12/13 (the closed-reference ceiling); apertus-8b at 6/13, climategpt-13b at 5/13. Defined the frontier-vs-open gradient that's anchored every cycle since.

Cycle 6· fix-toc-misclass-and-decimal-ocr

Tested Single-model grok-4 sweep across the 13-question active set to nail down the frontier ceiling.

Learned 12/13 with one cell (Q-NOAA-CALC-001) as the multi-page-table gap — the same gap that's still open in cycle 33.

Cycle 8· fix-toc-misclass-and-decimal-ocr

Tested Multi-mode comparison expanded — M2c (docling MD whole-document) and M2a (raw docling JSON, ~30 MB).

Learned M2a overflows context for every model including frontier; M2c works only for documents that fit. Established the negative-control floor and the linearized-text baseline.

Cycle 9· fix-toc-misclass-and-decimal-ocr

Tested A/B prompting experiment with an extra question prefix inserted between the artifact and the question.

Learned Prefix changes shifted 1-2 cells in either direction; not a reliable signal at 13 questions; prefix dropped from harness.

Cycle 10· fix-toc-misclass-and-decimal-ocr

Tested Multi-model sweep on the rescored harness after NEG-pass regex expansion.

Learned Rescored NEG regex correctly catches refusal phrasing across all six refusal-idiom families; no operational pass-rate deltas.

Cycle 11· fix-toc-misclass-and-decimal-ocr

Tested Targeted gemini sweep after dedup_grouped_claims() shipped — NOAA table_033 shrank 120 KB → 28 KB.

Learned Dedup successfully shrinks oversized cards without information loss; gemini handles deduped cards identically to un-deduped.

Cycle 12· fix-toc-misclass-and-decimal-ocr

Tested Two-model targeted re-runs: Q-NAT-012 climategpt-13b + Q-NOAA-LOOKUP-001 apertus-8b after override resolver fix.

Learned Corrected-card overrides are now correctly applied to local models too; dedup restored the retrieval crutch apertus-8b needed.

Cycle 14· micro-1k

Tested pipeline-v0.7-micro-1k card variant — stripped down to ~1.2 KB to fit ClimateGPT's 4K context.

Learned Open-tier lifts 27% → 40% just by removing frontmatter and context envelope; format size matters as much as content quality.

Cycle 15· compact-2k

Tested pipeline-v0.7-compact-2k variant — micro-1k plus one nearby paragraph.

Learned 49% open-tier — adding one paragraph recovers some interpolation queries that micro-1k loses.

Cycle 16· table-only

Tested pipeline-v0.7-table-only variant — bare table + caption, no metadata.

Learned 52% open-tier — for cell-lookup questions, the table and a one-line caption are nearly sufficient.

Cycle 18· labeled

Tested pipeline-v0.7-labeled variant — explicit per-section provenance labels.

Learned 49% open-tier — labels don't help, sometimes hurt because they bloat the card and dilute the signal.

Cycle 19· csv-plus-headings

Tested pipeline-v0.7-csv-plus-headings hybrid variant.

Learned Marginal lift vs csv-only; section headings alone don't recover the interpolation-query gap.

Cycle 20· csv-plus-scope

Tested pipeline-v0.7-csv-plus-scope hybrid variant (years, geography injected).

Learned Scope tokens help NEG-control questions slightly; no headline shift.

Cycle 21· csv-plus-paragraph

Tested pipeline-v0.7-csv-plus-paragraph hybrid variant.

Learned Adding 1 nearby paragraph mostly matches csv-only — context doesn't generally help and sometimes regresses smaller models.

Cycle 22· csv-plus-all-context

Tested pipeline-v0.7-csv-plus-all-context hybrid variant (everything combined).

Learned ~52% open-tier — combining hybrid context inputs doesn't beat plain csv-only; minimum-viable wins.

Cycle 23· csv-only

Tested Minor variant-comparison rerun on csv-only with a 3-model subset for spot-checks.

Learned No headline shift; results stable.

Cycle 24· csv-only

Tested Minor variant-comparison rerun on csv-only with a 2-model subset.

Learned No headline shift; results stable.

Cycle 25· csv-only

Tested Minor variant-comparison rerun on csv-only with a 3-model subset.

Learned No headline shift; results stable.

Cycle 26· stitched

Tested pipeline-v0.7-stitched variant — multi-page tables recombined inline via the extends_to_page_bottom signal.

Learned Closes Q-NOAA-CALC-001 (cruise station sum) for the closed-reference baseline; open-tier still bottlenecks on the 12-cell arithmetic — the gap is now model-side, not pipeline-side.

Cycle 27· csv-normalized

Tested pipeline-v0.7-csv-normalized variant — grok-4 LLM-based CSV header normalization.

Learned DISCARDED. grok-4 hallucinated duplicate column names; Granite-3.3-8B regressed 11/13 → 4/13. Motivated cycle 28's deterministic-rules version and the 'no LLM in the normalization layer' contract for cycle 30.

Cycle 28· csv-demerged

Tested pipeline-v0.7-csv-demerged variant — deterministic row-de-merge for fused-row CSVs (e.g. 'April May,1491 35').

Learned Closes Q-NAT-012 cyprid (April 1947 = 2918) for Qwen-2.5-7B, Granite-3.3-8B, Gemma-2 9B, Apertus 8B; conservative trigger introduces minor regressions on NOAA cruise tables where the same rule fires inappropriately.

Cycle 29· table-normalized

Tested First version of the Evidence-Preserving Table Normalization Layer.

Learned DISCARDED. Global em-dash normalization and caption rewrites caused 17 regressions vs 6 lifts. Origin of the byte-identity-for-unchanged-tables correctness invariant.

Cycle 32· doc-index-enriched

Tested pipeline-v0.7-doc-index-enriched — column headers + sample rows + scope detection per TOC entry.

Learned Important negative result that protects scale: enrichment alone doesn't fix retrieval — 0%-retrieval questions stayed 0%, +2 cells net across 8 models. Combined with the structural problem that M3-IDX TOC navigation breaks at thousands of tables anyway, this cycle says: stop investing in doc-index variants; the path forward is vector retrieval (cycle 33), not TOC refinement.