Per-table card variant
Per-table card variant · Production · recommended

CSV-only card

Table data rendered as raw CSV inside a Markdown code block. The most-effective open-tier variant.

In one paragraph

Strips the full v0.6.1 card down to caption + PDF page + the CSV that Docling already exported. No YAML frontmatter, no context envelope, no candidate captions, no nearby paragraphs. The hypothesis was that open models trained heavily on code parse CSV more reliably than Markdown tables wrapped in metadata. Confirmed by cycle 17: two 7-8B open-weight models (Qwen2.5-7B, Granite-3.3-8B) reach 11/13 on this variant, lifting the open-tier average pass rate from 27% to 55%.

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/generate_map_variants.py:render_csv_only
Input sources
  • Existing pipeline-v0.6.1 cards (frontmatter for caption + page)
  • Existing tables/table_NNN.csv (Docling export)
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~5 seconds for 407 cards across 3 documents (CPU-only transform, no Docling re-run).
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Deterministic (same input → same output, byte-identical)
Output location
card_sets/pipeline-v0.7-csv-only/
Cards produced
409 cards (407 base + 2 corrected overrides)
Introduced
v0.7 variant family, 2026-05-22.

Evaluation results

Diagnostic · 02
Best open-model score
11/13Qwen2.5-7B, Granite-3.3-8B
Avg open-tier pass rate
55%
Typical card size
~1.5 KB per card (median 903 bytes, max 34 KB)
Evaluation cycle
Cycle 17
Relative to v0.6.1 baseline
+28 percentage points open-tier

Example transformation

Sample · 04
Cruise Number Total SEA,empty_header,DAY Total Station Dates,empty_header_1,Area covered,Nature of the survey & Techniques used

Caveats and known limitations

Scope · 05
  • Docling's CSV exports sometimes have malformed headers (`empty_header` placeholders, concatenated multi-row headers). See `csv-normalized-rules` for a deterministic fix.
  • Fused-row rows (e.g. cyprid `April May,1491 35,...`) still cause CELL_READ_ERROR. See `csv-demerged`.
  • Multi-page tables are still split at the PDF page break. See `stitched`.

Related variants

Cross-reference · 06
← Back to all variants