Models × test cycles heatmap — Erschließung benchmark

13 questions, across many open and open-weight models, in dozens of test cycles that tested a variety of inputs — from PDFs to primarily Docling-based derivatives.

The test cycles that scale — what was tested, what was learned

Filtered through the lens of which findings inform production architecture at thousands of documents and tens of thousands of tables. Five cycles are shown by default: the production card-format finding (cycle 17), the infrastructure-substrate pattern (cycle 30), the project-reshaping oracle-penalty finding (cycle 31), the dead-end signal saving us from further M3-IDX investment (cycle 32), and the scale-ready vector-retrieval architecture (cycle 33). Supporting cycles — early exploration, the v0.7 variant sweep, minor reruns, discarded attempts — are collapsed below.

Cycle 7important· fix-toc-misclass-and-decimal-ocr

Tested First apertus-8b sweep across modes (M3-L4 oracle card, M2c Docling MD, M3-AC all-cards bundle) — the per-mode gradient on one open model.

Learned apertus M3-L4 = 5/13, M2c = 3/13, M3-AC = 1/13. The artefact-contribution signal: cards give a meaningful lift over Docling's MD linearization, which itself sits above whole-document JSON. The PDF→JSON→MD→cards→oracle gradient is real and measurable.

Cycle 13important· fix-toc-misclass-and-decimal-ocr

Tested M3-L4 oracle on the v0.6.1 full cards (heavy markdown + context envelope) — 11 models × 13 questions = 143 cells.

Learned 27% open-tier pass rate. The 'cards' rung of the gradient at its first iteration: cards lift open-tier performance over Docling MD (~M2c) but the heavy v0.6.1 format leaves room. Cycle 17 then shows that compressing to CSV-only nearly doubles this number at the same oracle layer.

Cycle 17important· csv-only

Tested pipeline-v0.7-csv-only variant — table data as raw CSV instead of markdown table.

Learned Established the production-recommended card format that scales: compact CSV cards (~1.5 KB) let Qwen-2.5-7B and Granite-3.3-8B reach 11/13 at the oracle layer — the format finding survives scale because it's a per-card transform, deterministic and stateless. The 'matches frontier' headline was later reframed as an oracle ceiling by cycle 31, but the format choice itself is the right production substrate.

Cycle 30important· table-normalized

Tested Evidence-Preserving Table Normalization Layer (refined v2) — 5-layer deterministic OCR-confusable repair with per-cell provenance side artifact.

Learned Shipped the production infrastructure pattern that scales: deterministic per-cell repair + table.normalized.jsonl side artifact carrying raw_cell, normalized_cell, confidence, and reason chain. This is exactly the shape a 100K-card archive needs — stateless, auditable, no LLM, no hallucination risk. Net 0 on the 8-model panel was the headline, but the lasting value is the substrate, not the cell-count delta.

Cycle 31important· doc-index

Tested M3-IDX two-shot retrieval mode — model picks a table from the doc-index, then answers from that card.

Learned Quantified the oracle penalty: every open model loses 2-6 cells without the hand-picked card. Top models (Qwen-7B, Granite-8B) drop the most (-6 each) — the cycle 17 breakthrough was conditional on perfect retrieval. This is the cycle that says: at scale, M3-L4 numbers don't predict deployment behavior; retrieval is THE problem to solve.

Cycle 33important· pipeline-v0.6

Tested M3-HYDE — HyDE-style vector retrieval (small independent generator + nomic-embed-text 768-dim embeddings + cosine search over the pre-built card index).

Learned The architecture that scales: vector cosine is sub-linear, the 407-card index extends naturally to 100K+ cards on commodity hardware, and the retrieval no longer depends on the eval model's TOC-reading ability. Mixed in-corpus results — HyDE wins on 4 of 7 open models including the three weakest (Llama, Apertus, ClimateGPT), achieves perfect retrieval (8/8) on two previously-0% questions, but loses on negative-control queries where IDX still wins. Hybrid retrieval is the natural next experiment.

Supporting cycles (27)

Cycle 2

Tested First full benchmark run: 3 questions × Claude Code Agent × 4 input modes (M3-L4, M2c, M1c, M3-AC).

Learned Surfaced the pipeline mis-label on Q-NAT-012 — table_124's caption says 'cyprid' but the data is nauplii; the real cyprid table is table_125. Ground truth corrected to 2,918; this mis-label still costs the open tier in cycle 31.

Cycle 2.1

Tested Expanded panel run — 5 questions × 15 models.

Learned Established the frontier vs open-tier gradient on full v0.6 cards: frontier ~80%, open mid-tier ~50%, climategpt ~15%.

Cycle 3

Tested Cycle 2 rerun with prompt-prefix variations.

Learned Prompt prefixes only shift 1-2 cells in either direction; not a meaningful lever at the 13-question scale.

Cycle 4· fix-toc-misclass-and-decimal-ocr

Tested Cycle 3 retest with TOC-misclass fix and decimal-OCR repair applied — introduced the table_096_corrected and table_125_corrected overrides.

Learned Corrected-card overrides lift M3-L4 results materially and are still in active use 30 cycles later.

Cycle 5· fix-toc-misclass-and-decimal-ocr

Tested First full 13-question active set across V27, V35, NOAA-32079 on 3 frontier-tier models.

Learned grok-4 hits 12/13 (the closed-reference ceiling); apertus-8b at 6/13, climategpt-13b at 5/13. Defined the frontier-vs-open gradient that's anchored every cycle since.

Cycle 6· fix-toc-misclass-and-decimal-ocr

Tested Single-model grok-4 sweep across the 13-question active set to nail down the frontier ceiling.

Learned 12/13 with one cell (Q-NOAA-CALC-001) as the multi-page-table gap — the same gap that's still open in cycle 33.

Cycle 8· fix-toc-misclass-and-decimal-ocr

Tested Multi-mode comparison expanded — M2c (docling MD whole-document) and M2a (raw docling JSON, ~30 MB).

Learned M2a overflows context for every model including frontier; M2c works only for documents that fit. Established the negative-control floor and the linearized-text baseline.

Cycle 9· fix-toc-misclass-and-decimal-ocr

Tested A/B prompting experiment with an extra question prefix inserted between the artifact and the question.

Learned Prefix changes shifted 1-2 cells in either direction; not a reliable signal at 13 questions; prefix dropped from harness.

Cycle 10· fix-toc-misclass-and-decimal-ocr

Tested Multi-model sweep on the rescored harness after NEG-pass regex expansion.

Learned Rescored NEG regex correctly catches refusal phrasing across all six refusal-idiom families; no operational pass-rate deltas.

Cycle 11· fix-toc-misclass-and-decimal-ocr

Tested Targeted gemini sweep after dedup_grouped_claims() shipped — NOAA table_033 shrank 120 KB → 28 KB.

Learned Dedup successfully shrinks oversized cards without information loss; gemini handles deduped cards identically to un-deduped.

Cycle 12· fix-toc-misclass-and-decimal-ocr

Tested Two-model targeted re-runs: Q-NAT-012 climategpt-13b + Q-NOAA-LOOKUP-001 apertus-8b after override resolver fix.

Learned Corrected-card overrides are now correctly applied to local models too; dedup restored the retrieval crutch apertus-8b needed.

Cycle 14· micro-1k

Tested pipeline-v0.7-micro-1k card variant — stripped down to ~1.2 KB to fit ClimateGPT's 4K context.

Learned Open-tier lifts 27% → 40% just by removing frontmatter and context envelope; format size matters as much as content quality.

Cycle 15· compact-2k

Tested pipeline-v0.7-compact-2k variant — micro-1k plus one nearby paragraph.

Learned 49% open-tier — adding one paragraph recovers some interpolation queries that micro-1k loses.

Cycle 16· table-only

Tested pipeline-v0.7-table-only variant — bare table + caption, no metadata.

Learned 52% open-tier — for cell-lookup questions, the table and a one-line caption are nearly sufficient.

Cycle 18· labeled

Tested pipeline-v0.7-labeled variant — explicit per-section provenance labels.

Learned 49% open-tier — labels don't help, sometimes hurt because they bloat the card and dilute the signal.

Cycle 19· csv-plus-headings

Tested pipeline-v0.7-csv-plus-headings hybrid variant.

Learned Marginal lift vs csv-only; section headings alone don't recover the interpolation-query gap.

Cycle 20· csv-plus-scope

Tested pipeline-v0.7-csv-plus-scope hybrid variant (years, geography injected).

Learned Scope tokens help NEG-control questions slightly; no headline shift.

Cycle 21· csv-plus-paragraph

Tested pipeline-v0.7-csv-plus-paragraph hybrid variant.

Learned Adding 1 nearby paragraph mostly matches csv-only — context doesn't generally help and sometimes regresses smaller models.

Cycle 22· csv-plus-all-context

Tested pipeline-v0.7-csv-plus-all-context hybrid variant (everything combined).

Learned ~52% open-tier — combining hybrid context inputs doesn't beat plain csv-only; minimum-viable wins.

Cycle 23· csv-only

Tested Minor variant-comparison rerun on csv-only with a 3-model subset for spot-checks.

Learned No headline shift; results stable.

Cycle 24· csv-only

Tested Minor variant-comparison rerun on csv-only with a 2-model subset.

Learned No headline shift; results stable.

Cycle 25· csv-only

Tested Minor variant-comparison rerun on csv-only with a 3-model subset.

Learned No headline shift; results stable.

Cycle 26· stitched

Tested pipeline-v0.7-stitched variant — multi-page tables recombined inline via the extends_to_page_bottom signal.

Learned Closes Q-NOAA-CALC-001 (cruise station sum) for the closed-reference baseline; open-tier still bottlenecks on the 12-cell arithmetic — the gap is now model-side, not pipeline-side.

Cycle 27· csv-normalized

Tested pipeline-v0.7-csv-normalized variant — grok-4 LLM-based CSV header normalization.

Learned DISCARDED. grok-4 hallucinated duplicate column names; Granite-3.3-8B regressed 11/13 → 4/13. Motivated cycle 28's deterministic-rules version and the 'no LLM in the normalization layer' contract for cycle 30.

Cycle 28· csv-demerged

Tested pipeline-v0.7-csv-demerged variant — deterministic row-de-merge for fused-row CSVs (e.g. 'April May,1491 35').

Learned Closes Q-NAT-012 cyprid (April 1947 = 2918) for Qwen-2.5-7B, Granite-3.3-8B, Gemma-2 9B, Apertus 8B; conservative trigger introduces minor regressions on NOAA cruise tables where the same rule fires inappropriately.

Cycle 29· table-normalized

Tested First version of the Evidence-Preserving Table Normalization Layer.

Learned DISCARDED. Global em-dash normalization and caption rewrites caused 17 regressions vs 6 lifts. Origin of the byte-identity-for-unchanged-tables correctness invariant.

Cycle 32· doc-index-enriched

Tested pipeline-v0.7-doc-index-enriched — column headers + sample rows + scope detection per TOC entry.

Learned Important negative result that protects scale: enrichment alone doesn't fix retrieval — 0%-retrieval questions stayed 0%, +2 cells net across 8 models. Combined with the structural problem that M3-IDX TOC navigation breaks at thousands of tables anyway, this cycle says: stop investing in doc-index variants; the path forward is vector retrieval (cycle 33), not TOC refinement.

Model	Type	cyc 2?0 q	cyc 2.1?0 q	cyc 3?0 q	cyc 4fix-toc-misclass-and-decimal-ocr5 q	cyc 5fix-toc-misclass-and-decimal-ocr13 q	cyc 6fix-toc-misclass-and-decimal-ocr13 q	cyc 7fix-toc-misclass-and-decimal-ocr13 q	cyc 8fix-toc-misclass-and-decimal-ocr13 q	cyc 9fix-toc-misclass-and-decimal-ocr13 q	cyc 10fix-toc-misclass-and-decimal-ocr13 q	cyc 11fix-toc-misclass-and-decimal-ocr13 q	cyc 12fix-toc-misclass-and-decimal-ocr13 q	cyc 13fix-toc-misclass-and-decimal-ocr13 q	cyc 14micro-1k13 q	cyc 15compact-2k13 q	cyc 16table-only13 q	cyc 17csv-only13 q	cyc 18labeled13 q	cyc 19csv-plus-headings13 q	cyc 20csv-plus-scope13 q	cyc 21csv-plus-paragraph13 q	cyc 22csv-plus-all-context13 q	cyc 23csv-only13 q	cyc 24csv-only13 q	cyc 25csv-only13 q	cyc 26stitched13 q	cyc 27csv-normalized13 q	cyc 28csv-demerged13 q	cyc 29table-normalized13 q	cyc 30table-normalized13 q	cyc 31doc-index13 q	cyc 32doc-index-enriched13 q	cyc 33pipeline-v0.613 q	cyc 34csv-only2 q	All test cycles
EuroLLM-9B-Instruct	Open model	—	—	—	—	—	—	38%5/13	—	—	—	—	—	19%5/26	54%7/13	54%7/13	62%8/13	54%7/13	54%7/13	62%8/13	54%7/13	38%5/13	54%7/13	—	—	—	—	—	—	—	46%6/13	8%1/13	—	31%4/13	0%0/1	43%84/196
granite3.3:8b	Open model	—	—	—	—	—	—	77%10/13	—	—	—	—	—	77%10/13	21%5/24	54%7/13	62%8/13	85%11/13	46%6/13	69%9/13	38%5/13	62%8/13	62%8/13	—	—	—	69%9/13	31%4/13	69%9/13	62%8/13	77%10/13	38%5/13	31%4/13	46%6/13	0%0/1	55%142/259
olmo2:7b	Open model	—	100%2/2	—	—	—	—	31%4/13	—	—	—	—	—	31%4/13	31%4/13	46%6/13	31%4/13	31%4/13	46%6/13	—	—	—	—	—	—	—	—	—	—	—	31%4/13	23%3/13	—	23%3/13	0%0/1	33%44/133
swiss-ai/Apertus-70B-Instruct-2509	Open model	—	40%2/5	30%6/20	—	—	—	46%6/13	—	—	—	—	—	28%8/29	—	—	—	54%7/13	—	—	—	—	—	—	—	35%9/26	—	—	—	—	54%7/13	23%3/13	—	38%5/13	0%0/1	36%53/146
swiss-ai/Apertus-8B-Instruct-2509	Open model	—	43%3/7	36%5/14	40%2/5	23%3/13	—	26%11/42	—	46%6/13	—	—	46%6/13	46%6/13	38%5/13	46%6/13	46%6/13	46%6/13	54%7/13	54%7/13	54%7/13	46%6/13	46%6/13	—	—	—	38%5/13	54%7/13	46%6/13	46%6/13	54%7/13	15%2/13	23%3/13	38%5/13	0%0/2	41%139/343
utter-project/EuroLLM-22B-Instruct-2512	Open model	—	22%2/9	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	22%2/9
climategpt-13b	Open-weight	—	20%1/5	20%1/5	20%1/5	15%2/13	—	15%2/13	—	—	—	—	—	15%2/13	15%2/13	8%1/13	23%3/13	38%5/13	31%4/13	31%4/13	31%4/13	31%4/13	31%4/13	—	—	—	31%4/13	38%5/13	23%3/13	15%2/13	23%3/13	0%0/13	0%0/13	8%1/13	0%0/2	21%58/277
climategpt-70b	Open-weight	—	20%1/5	20%1/5	—	—	—	15%2/13	—	—	—	—	—	15%2/13	—	—	—	23%3/13	—	—	—	—	—	—	—	23%3/13	—	—	—	—	15%2/13	8%1/13	—	15%2/13	100%1/1	18%18/102
climategpt-7b	Open-weight	—	50%1/2	20%1/5	—	—	—	23%3/13	—	—	—	—	—	23%3/13	15%2/13	15%2/13	8%1/13	23%3/13	15%2/13	—	—	—	—	—	—	—	—	—	—	—	23%3/13	0%0/13	—	15%2/13	0%0/2	17%23/139
deepseek-r1:8b	Open-weight	—	—	—	—	—	—	54%7/13	—	—	—	—	—	54%7/13	—	—	—	62%8/13	—	—	—	—	—	—	62%8/13	—	—	—	—	—	77%10/13	38%5/13	—	23%3/13	0%0/1	52%48/92
Gemma-SEA-LION-v4-27B-IT	Open-weight	—	40%2/5	—	—	—	—	62%8/13	—	—	—	—	—	31%8/26	—	—	—	62%8/13	—	—	—	—	—	—	—	—	—	—	—	—	62%8/13	38%5/13	—	38%5/13	0%0/1	45%44/97
gemma2:9b	Open-weight	—	—	—	—	—	—	38%5/13	—	—	—	—	—	38%5/13	33%8/24	62%8/13	62%8/13	54%7/13	54%7/13	38%5/13	69%9/13	54%7/13	54%7/13	—	—	—	—	62%8/13	54%7/13	54%7/13	62%8/13	31%4/13	31%4/13	23%3/13	0%0/1	48%117/246
llama3:latest	Open-weight	—	100%2/2	—	—	—	—	62%8/13	—	—	—	—	—	62%8/13	62%8/13	62%8/13	69%9/13	69%9/13	54%7/13	69%9/13	54%7/13	77%10/13	69%9/13	—	—	—	62%8/13	69%9/13	62%8/13	62%8/13	77%10/13	23%3/13	23%3/13	46%6/13	0%0/1	60%149/250
mistral:latest	Open-weight	—	100%2/2	—	—	—	—	46%6/13	—	—	—	—	—	46%6/13	69%9/13	69%9/13	69%9/13	46%6/13	54%7/13	54%7/13	62%8/13	54%7/13	62%8/13	—	—	—	—	62%8/13	31%4/13	38%5/13	54%7/13	31%4/13	38%5/13	23%3/13	0%0/1	51%120/237
phi3:mini	Open-weight	—	—	—	—	—	—	23%3/13	—	—	—	—	—	23%3/13	38%5/13	31%4/13	62%8/13	31%4/13	46%6/13	38%5/13	38%5/13	31%4/13	46%6/13	—	—	—	—	—	—	—	23%3/13	8%1/13	—	15%2/13	0%0/1	32%59/183
qwen2.5-coder:7b	Open-weight	—	—	—	—	—	—	46%6/13	—	—	—	—	—	46%6/13	—	—	—	77%10/13	—	—	—	—	—	—	77%10/13	—	62%8/13	69%9/13	—	—	62%8/13	31%4/13	—	23%3/13	0%0/1	54%64/118
qwen2.5:3b	Open-weight	—	—	—	—	—	—	46%6/13	—	—	—	—	—	46%6/13	54%7/13	46%6/13	46%6/13	46%6/13	38%5/13	—	—	—	—	—	—	—	—	—	—	—	46%6/13	23%3/13	—	15%2/13	0%0/1	40%53/131
qwen2.5:7b	Open-weight	—	—	—	—	—	—	62%8/13	—	—	—	—	—	62%8/13	25%6/24	62%8/13	54%7/13	85%11/13	54%7/13	69%9/13	77%10/13	77%10/13	77%10/13	—	—	—	69%9/13	69%9/13	77%10/13	69%9/13	77%10/13	38%5/13	46%6/13	31%4/13	0%0/1	60%156/259
claude-code-agent	Closed reference	100%10/10	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	100%10/10
gemini-2.5-pro	Closed reference	—	100%2/2	—	50%5/10	85%11/13	—	85%11/13	27%7/26	—	—	—	—	85%11/13	77%10/13	53%8/15	62%8/13	80%12/15	59%10/17	46%6/13	85%11/13	77%10/13	62%8/13	—	—	—	—	—	—	—	85%11/13	54%7/13	—	46%6/13	0%0/1	64%154/242
gpt-4.1	Closed reference	—	100%2/2	—	—	—	—	85%11/13	—	—	—	—	—	92%12/13	—	—	—	77%10/13	—	—	—	—	—	77%10/13	—	—	—	—	—	—	85%11/13	69%9/13	—	38%5/13	0%0/1	74%70/94
gpt-4o-mini	Closed reference	—	100%2/2	—	—	—	—	69%9/13	—	—	—	—	—	62%8/13	—	—	—	62%8/13	—	—	—	—	—	69%9/13	—	—	—	—	—	—	69%9/13	46%6/13	—	46%6/13	0%0/1	61%57/94
grok-4	Closed reference	—	100%2/2	—	—	—	92%12/13	92%12/13	—	—	56%29/52	85%11/13	92%12/13	92%12/13	69%9/13	77%10/13	77%10/13	92%12/13	77%10/13	92%12/13	77%10/13	92%12/13	85%11/13	—	—	—	77%10/13	85%11/13	85%11/13	85%11/13	92%12/13	77%10/13	77%10/13	46%6/13	0%0/1	78%267/341
mistral-large-latest	Closed reference	—	100%2/2	—	—	—	—	31%4/13	—	—	—	—	—	31%4/13	—	—	—	31%4/13	—	—	—	—	—	54%7/13	—	—	—	—	—	—	31%4/13	8%1/13	—	8%1/13	0%0/1	29%27/94

Tests heatmap

The test cycles that scale — what was tested, what was learned

Cycle 7important· fix-toc-misclass-and-decimal-ocr

Cycle 13important· fix-toc-misclass-and-decimal-ocr

Cycle 17important· csv-only

Cycle 30important· table-normalized

Cycle 31important· doc-index

Cycle 33important· pipeline-v0.6

Cycle 2

Cycle 2.1

Cycle 3

Cycle 4· fix-toc-misclass-and-decimal-ocr

Cycle 5· fix-toc-misclass-and-decimal-ocr

Cycle 6· fix-toc-misclass-and-decimal-ocr

Cycle 8· fix-toc-misclass-and-decimal-ocr

Cycle 9· fix-toc-misclass-and-decimal-ocr

Cycle 10· fix-toc-misclass-and-decimal-ocr

Cycle 11· fix-toc-misclass-and-decimal-ocr

Cycle 12· fix-toc-misclass-and-decimal-ocr

Cycle 14· micro-1k

Cycle 15· compact-2k

Cycle 16· table-only

Cycle 18· labeled

Cycle 19· csv-plus-headings

Cycle 20· csv-plus-scope

Cycle 21· csv-plus-paragraph

Cycle 22· csv-plus-all-context

Cycle 23· csv-only

Cycle 24· csv-only

Cycle 25· csv-only

Cycle 26· stitched

Cycle 27· csv-normalized

Cycle 28· csv-demerged

Cycle 29· table-normalized

Cycle 32· doc-index-enriched