GovTools
Per-table card variant
Per-table card variant · Experimental

CSV with multi-page table stitching

Multi-page tables reunited across the PDF page break. Closes Q-NOAA-CALC-001 for frontier-tier reference; open models still need stronger arithmetic to use it.

In one paragraph

When the v0.6.1 pipeline flagged `extends_to_page_bottom: true` on a card, this variant inspects the next card and — if column structures match — concatenates the two CSVs minus the duplicate header. Direct evidence-side fix for the Q-NOAA-CALC-001 multi-page-table failure: table_017 (PDF page 18, cruises 50Y01–50Y11) + table_018 (PDF page 19, starts with 50Y12) become a single 24-row card containing all 12 1950 cruises plus continuations.

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/generate_stitched_variant.py
Input sources
  • pipeline-v0.6.1 cards with extends_to_page_bottom signal
  • Docling table CSVs
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~3 seconds for 407 cards.
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Deterministic (same input → same output, byte-identical)
Output location
card_sets/pipeline-v0.7-stitched/
Cards produced
407 cards (15 stitched pairs detected — 6 V27, 3 V35, 6 NOAA)
Introduced
v0.7 task #47/6 implementation, 2026-05-22.

Evaluation results

Diagnostic · 02
Best open-model score
9/13 (Qwen2.5-7B, Granite-3.3-8B)
Avg open-tier pass rate
open-tier dropped slightly versus csv-only (some models confused by larger card)
Typical card size
5.5 KB for stitched cards (vs 1.9 KB un-stitched)
Evaluation cycle
Cycle 26
Relative to v0.6.1 baseline
closes Q-NOAA-CALC-001 for the reference frontier model; open models still struggle with 12-cell arithmetic even with the data present

Questions this variant addresses

Coverage · 03
  • Q-NOAA-CALC-001 (cruise stations sum across page break) — for the frontier reference baseline

Caveats and known limitations

Scope · 05
  • Stitched cards are larger than un-stitched (~3× for NOAA case) — slightly above the 4K threshold that ClimateGPT-7B/13B require.
  • Open-tier models still struggle to sum 12 cells correctly even when the data is present. The pipeline-side gap is closed; the model-side arithmetic gap is the remaining bottleneck.
  • False positive risk: the column-compatibility check is heuristic. Two consecutive unrelated tables with similar columns could be falsely stitched (none observed in current corpus).

Related variants

Cross-reference · 06
← Back to all variants