Per-table card variant · Production · recommended
CSV-only card
Table data rendered as raw CSV inside a Markdown code block. The most-effective open-tier variant.
In one paragraph
Strips the full v0.6.1 card down to caption + PDF page + the CSV that Docling already exported. No YAML frontmatter, no context envelope, no candidate captions, no nearby paragraphs. The hypothesis was that open models trained heavily on code parse CSV more reliably than Markdown tables wrapped in metadata. Confirmed by cycle 17: two 7-8B open-weight models (Qwen2.5-7B, Granite-3.3-8B) reach 11/13 on this variant, lifting the open-tier average pass rate from 27% to 55%.
How the inputs are generated
Generation · 01Generator script
evaluation_runs/generate_map_variants.py:render_csv_onlyInput sources
- • Existing pipeline-v0.6.1 cards (frontmatter for caption + page)
- • Existing tables/table_NNN.csv (Docling export)
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~5 seconds for 407 cards across 3 documents (CPU-only transform, no Docling re-run).
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Deterministic (same input → same output, byte-identical)
Output location
card_sets/pipeline-v0.7-csv-only/Cards produced
409 cards (407 base + 2 corrected overrides)
Introduced
v0.7 variant family, 2026-05-22.
Evaluation results
Diagnostic · 02Best open-model score
11/13 — Qwen2.5-7B, Granite-3.3-8B
Avg open-tier pass rate
55%
Typical card size
~1.5 KB per card (median 903 bytes, max 34 KB)
Evaluation cycle
Cycle 17
Relative to v0.6.1 baseline
+28 percentage points open-tier
Example transformation
Sample · 04Cruise Number Total SEA,empty_header,DAY Total Station Dates,empty_header_1,Area covered,Nature of the survey & Techniques used
Caveats and known limitations
Scope · 05- • Docling's CSV exports sometimes have malformed headers (`empty_header` placeholders, concatenated multi-row headers). See `csv-normalized-rules` for a deterministic fix.
- • Fused-row rows (e.g. cyprid `April May,1491 35,...`) still cause CELL_READ_ERROR. See `csv-demerged`.
- • Multi-page tables are still split at the PDF page break. See `stitched`.
Related variants
Cross-reference · 06- Per-table card variantCSV-only with deterministic header normalizationCSV cards with column headers cleaned by deterministic rules (no LLM). Addresses cycle-17 CELL_READ_ERROR failures from malformed Docling headers.
- Per-table card variantCSV with row-de-mergeCSV cards with fused multi-row records split back. Addresses OCR-merged rows like the cyprid 'April May,1491 35,...' case.
- Per-table card variantCSV with multi-page table stitchingMulti-page tables reunited across the PDF page break. Closes Q-NOAA-CALC-001 for frontier-tier reference; open models still need stronger arithmetic to use it.