Per-table card variant · Experimental

CSV with multi-page table stitching

Multi-page tables reunited across the PDF page break. Closes Q-NOAA-CALC-001 for frontier-tier reference; open models still need stronger arithmetic to use it.

In one paragraph

When the v0.6.1 pipeline flagged `extends_to_page_bottom: true` on a card, this variant inspects the next card and — if column structures match — concatenates the two CSVs minus the duplicate header. Direct evidence-side fix for the Q-NOAA-CALC-001 multi-page-table failure: table_017 (PDF page 18, cruises 50Y01–50Y11) + table_018 (PDF page 19, starts with 50Y12) become a single 24-row card containing all 12 1950 cruises plus continuations.

How the inputs are generated

Generation · 01

Generator script

evaluation_runs/generate_stitched_variant.py

Input sources

• pipeline-v0.6.1 cards with extends_to_page_bottom signal
• Docling table CSVs

AI use

No — pure deterministic transformation

OCR / re-OCR

Inherits from the upstream pipeline variant

Approximate processing time

~3 seconds for 407 cards.

Resource intensity

Low — CPU-only post-processing, runs in seconds

Determinism

Deterministic (same input → same output, byte-identical)

Output location

card_sets/pipeline-v0.7-stitched/

Cards produced

407 cards (15 stitched pairs detected — 6 V27, 3 V35, 6 NOAA)

Introduced

v0.7 task #47/6 implementation, 2026-05-22.

Evaluation results

Diagnostic · 02

Best open-model score

9/13 (Qwen2.5-7B, Granite-3.3-8B)

Avg open-tier pass rate

open-tier dropped slightly versus csv-only (some models confused by larger card)

Typical card size

5.5 KB for stitched cards (vs 1.9 KB un-stitched)

Evaluation cycle

Cycle 26

Relative to v0.6.1 baseline

closes Q-NOAA-CALC-001 for the reference frontier model; open models still struggle with 12-cell arithmetic even with the data present

Questions this variant addresses

Coverage · 03

Q-NOAA-CALC-001 (cruise stations sum across page break) — for the frontier reference baseline

Caveats and known limitations

Scope · 05

• Stitched cards are larger than un-stitched (~3× for NOAA case) — slightly above the 4K threshold that ClimateGPT-7B/13B require.
• Open-tier models still struggle to sum 12 cells correctly even when the data is present. The pipeline-side gap is closed; the model-side arithmetic gap is the remaining bottleneck.
• False positive risk: the column-compatibility check is heuristic. Two consecutive unrelated tables with similar columns could be falsely stitched (none observed in current corpus).

Related variants

Cross-reference · 06

← Back to all variants