GovTools
Per-table card variant
Per-table card variant · Experimental

CSV with row-de-merge

CSV cards with fused multi-row records split back. Addresses OCR-merged rows like the cyprid 'April May,1491 35,...' case.

In one paragraph

Docling sometimes fuses two or three adjacent table rows into a single CSV row with space-separated values within each cell. This variant detects and splits those rows back using conservative deterministic rules: row label must be 2-3 consecutive months OR consecutive years AND every cell must have N-matching token count (or 1 token, which is replicated across the merged rows). Conservative triggers prevent false positives.

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/generate_row_demerge_variant.py
Input sources
  • pipeline-v0.6.1 csv-only base
  • Docling table CSVs
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~2 seconds for 407 cards.
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Deterministic (same input → same output, byte-identical)
Output location
card_sets/pipeline-v0.7-csv-demerged/
Cards produced
407 cards (8 with row-de-merges applied; 14 rows total split)
Introduced
v0.7 follow-up after cycle 26 + 27 diagnostics, 2026-05-23.

Evaluation results

Diagnostic · 02
Best open-model score
10/13 (Qwen2.5-7B)Qwen2.5-7B
Typical card size
~1.5 KB per card (de-merging adds rows, not bytes per row)
Evaluation cycle
Cycle 28
Relative to v0.6.1 baseline
Mixed. Closes the cyprid fusion case (Q-NAT-012 now passes for Qwen2.5-7B, Granite-3.3-8B, Gemma-2 9B, and Apertus 8B Instruct), but introduces regressions on adjacent fused-row tables where the conservative trigger fires on data the model used to read correctly. Qwen2.5-7B 11→10; Granite-3.3-8B 11→9 versus csv-only baseline. Net open-tier change is slightly negative.

Questions this variant addresses

Coverage · 03
  • Q-NAT-012 (cyprid larvae April 1947) — confirmed closed for Qwen2.5-7B, Granite-3.3-8B, Gemma-2 9B, Apertus 8B Instruct, and the closed reference baseline

Example transformation

Sample · 04
Before: `April May,1491 35,137 a,110 68,2918 :`

After:
```
April,1491,137,110,2918
May,35,a,68,:
```

Caveats and known limitations

Scope · 05
  • Cycle 28 showed the de-merge is too eager on NOAA cruise schedule tables — splitting fused-year rows (e.g. '1974 1975') changed the table shape enough that Granite-3.3-8B regressed on Q-NOAA-PATTERN-001 and Q-NAT-015, where the un-split form had been answered correctly.
  • Conservative trigger means some fused rows still pass through unsplit (e.g. 3-month rows like 'April May June' where some cells have only 1 token but others have 2).
  • Replicated single-token values may not always be correct — when Docling fused two rows with the same value, we assume both rows shared it.
  • OCR garbage in individual cells (e.g. 'a' instead of a number) is preserved, not corrected.
  • Next iteration: scope the trigger by table type (month-labeled monthly-series only), not by row-pattern heuristic alone.

Related variants

Cross-reference · 06
← Back to all variants