Cycle 30important· table-normalized
Tested Evidence-Preserving Table Normalization Layer (refined v2) — 5-layer deterministic OCR-confusable repair with per-cell provenance side artifact.
Learned Shipped the production infrastructure pattern that scales: deterministic per-cell repair + table.normalized.jsonl side artifact carrying raw_cell, normalized_cell, confidence, and reason chain. This is exactly the shape a 100K-card archive needs — stateless, auditable, no LLM, no hallucination risk. Net 0 on the 8-model panel was the headline, but the lasting value is the substrate, not the cell-count delta.
Cycle 33important· pipeline-v0.6
Tested M3-HYDE — HyDE-style vector retrieval (small independent generator + nomic-embed-text 768-dim embeddings + cosine search over the pre-built card index).
Learned The architecture that scales: vector cosine is sub-linear, the 407-card index extends naturally to 100K+ cards on commodity hardware, and the retrieval no longer depends on the eval model's TOC-reading ability. Mixed in-corpus results — HyDE wins on 4 of 7 open models including the three weakest (Llama, Apertus, ClimateGPT), achieves perfect retrieval (8/8) on two previously-0% questions, but loses on negative-control queries where IDX still wins. Hybrid retrieval is the natural next experiment.
Supporting cycles (27)
Cycle 2
Tested First full benchmark run: 3 questions × Claude Code Agent × 4 input modes (M3-L4, M2c, M1c, M3-AC).
Learned Surfaced the pipeline mis-label on Q-NAT-012 — table_124's caption says 'cyprid' but the data is nauplii; the real cyprid table is table_125. Ground truth corrected to 2,918; this mis-label still costs the open tier in cycle 31.
Cycle 2.1
Tested Expanded panel run — 5 questions × 15 models.
Learned Established the frontier vs open-tier gradient on full v0.6 cards: frontier ~80%, open mid-tier ~50%, climategpt ~15%.
Cycle 3
Tested Cycle 2 rerun with prompt-prefix variations.
Learned Prompt prefixes only shift 1-2 cells in either direction; not a meaningful lever at the 13-question scale.
Cycle 4· fix-toc-misclass-and-decimal-ocr
Tested Cycle 3 retest with TOC-misclass fix and decimal-OCR repair applied — introduced the table_096_corrected and table_125_corrected overrides.
Learned Corrected-card overrides lift M3-L4 results materially and are still in active use 30 cycles later.
Cycle 5· fix-toc-misclass-and-decimal-ocr
Tested First full 13-question active set across V27, V35, NOAA-32079 on 3 frontier-tier models.
Learned grok-4 hits 12/13 (the closed-reference ceiling); apertus-8b at 6/13, climategpt-13b at 5/13. Defined the frontier-vs-open gradient that's anchored every cycle since.
Cycle 6· fix-toc-misclass-and-decimal-ocr
Tested Single-model grok-4 sweep across the 13-question active set to nail down the frontier ceiling.
Learned 12/13 with one cell (Q-NOAA-CALC-001) as the multi-page-table gap — the same gap that's still open in cycle 33.
Cycle 8· fix-toc-misclass-and-decimal-ocr
Tested Multi-mode comparison expanded — M2c (docling MD whole-document) and M2a (raw docling JSON, ~30 MB).
Learned M2a overflows context for every model including frontier; M2c works only for documents that fit. Established the negative-control floor and the linearized-text baseline.
Cycle 9· fix-toc-misclass-and-decimal-ocr
Tested A/B prompting experiment with an extra question prefix inserted between the artifact and the question.
Learned Prefix changes shifted 1-2 cells in either direction; not a reliable signal at 13 questions; prefix dropped from harness.
Cycle 10· fix-toc-misclass-and-decimal-ocr
Tested Multi-model sweep on the rescored harness after NEG-pass regex expansion.
Learned Rescored NEG regex correctly catches refusal phrasing across all six refusal-idiom families; no operational pass-rate deltas.
Cycle 11· fix-toc-misclass-and-decimal-ocr
Tested Targeted gemini sweep after dedup_grouped_claims() shipped — NOAA table_033 shrank 120 KB → 28 KB.
Learned Dedup successfully shrinks oversized cards without information loss; gemini handles deduped cards identically to un-deduped.
Cycle 12· fix-toc-misclass-and-decimal-ocr
Tested Two-model targeted re-runs: Q-NAT-012 climategpt-13b + Q-NOAA-LOOKUP-001 apertus-8b after override resolver fix.
Learned Corrected-card overrides are now correctly applied to local models too; dedup restored the retrieval crutch apertus-8b needed.
Cycle 14· micro-1k
Tested pipeline-v0.7-micro-1k card variant — stripped down to ~1.2 KB to fit ClimateGPT's 4K context.
Learned Open-tier lifts 27% → 40% just by removing frontmatter and context envelope; format size matters as much as content quality.
Cycle 15· compact-2k
Tested pipeline-v0.7-compact-2k variant — micro-1k plus one nearby paragraph.
Learned 49% open-tier — adding one paragraph recovers some interpolation queries that micro-1k loses.
Cycle 16· table-only
Tested pipeline-v0.7-table-only variant — bare table + caption, no metadata.
Learned 52% open-tier — for cell-lookup questions, the table and a one-line caption are nearly sufficient.
Cycle 18· labeled
Tested pipeline-v0.7-labeled variant — explicit per-section provenance labels.
Learned 49% open-tier — labels don't help, sometimes hurt because they bloat the card and dilute the signal.
Cycle 19· csv-plus-headings
Tested pipeline-v0.7-csv-plus-headings hybrid variant.
Learned Marginal lift vs csv-only; section headings alone don't recover the interpolation-query gap.
Cycle 20· csv-plus-scope
Tested pipeline-v0.7-csv-plus-scope hybrid variant (years, geography injected).
Learned Scope tokens help NEG-control questions slightly; no headline shift.
Cycle 21· csv-plus-paragraph
Tested pipeline-v0.7-csv-plus-paragraph hybrid variant.
Learned Adding 1 nearby paragraph mostly matches csv-only — context doesn't generally help and sometimes regresses smaller models.
Cycle 22· csv-plus-all-context
Tested pipeline-v0.7-csv-plus-all-context hybrid variant (everything combined).
Learned ~52% open-tier — combining hybrid context inputs doesn't beat plain csv-only; minimum-viable wins.
Cycle 23· csv-only
Tested Minor variant-comparison rerun on csv-only with a 3-model subset for spot-checks.
Learned No headline shift; results stable.
Cycle 24· csv-only
Tested Minor variant-comparison rerun on csv-only with a 2-model subset.
Learned No headline shift; results stable.
Cycle 25· csv-only
Tested Minor variant-comparison rerun on csv-only with a 3-model subset.
Learned No headline shift; results stable.
Cycle 26· stitched
Tested pipeline-v0.7-stitched variant — multi-page tables recombined inline via the extends_to_page_bottom signal.
Learned Closes Q-NOAA-CALC-001 (cruise station sum) for the closed-reference baseline; open-tier still bottlenecks on the 12-cell arithmetic — the gap is now model-side, not pipeline-side.
Cycle 27· csv-normalized
Tested pipeline-v0.7-csv-normalized variant — grok-4 LLM-based CSV header normalization.
Learned DISCARDED. grok-4 hallucinated duplicate column names; Granite-3.3-8B regressed 11/13 → 4/13. Motivated cycle 28's deterministic-rules version and the 'no LLM in the normalization layer' contract for cycle 30.
Cycle 28· csv-demerged
Tested pipeline-v0.7-csv-demerged variant — deterministic row-de-merge for fused-row CSVs (e.g. 'April May,1491 35').
Learned Closes Q-NAT-012 cyprid (April 1947 = 2918) for Qwen-2.5-7B, Granite-3.3-8B, Gemma-2 9B, Apertus 8B; conservative trigger introduces minor regressions on NOAA cruise tables where the same rule fires inappropriately.
Cycle 29· table-normalized
Tested First version of the Evidence-Preserving Table Normalization Layer.
Learned DISCARDED. Global em-dash normalization and caption rewrites caused 17 regressions vs 6 lifts. Origin of the byte-identity-for-unchanged-tables correctness invariant.
Cycle 32· doc-index-enriched
Tested pipeline-v0.7-doc-index-enriched — column headers + sample rows + scope detection per TOC entry.
Learned Important negative result that protects scale: enrichment alone doesn't fix retrieval — 0%-retrieval questions stayed 0%, +2 cells net across 8 models. Combined with the structural problem that M3-IDX TOC navigation breaks at thousands of tables anyway, this cycle says: stop investing in doc-index variants; the path forward is vector retrieval (cycle 33), not TOC refinement.