Per-table card variant · Experimental

CSV-only with deterministic header normalization

CSV cards with column headers cleaned by deterministic rules (no LLM). Addresses cycle-17 CELL_READ_ERROR failures from malformed Docling headers.

In one paragraph

Builds on csv-only by post-processing the header row of each CSV. Rules: replace `empty_header` placeholders with positional `col_N` labels; snake-case existing labels; collapse consecutive underscores; truncate >60 chars; de-duplicate within the header row (suffix duplicates with `_2`, `_3`). Pure rule-based — no LLM in the loop, so the normalizer can never introduce new failure modes (unlike the cycle-27 LLM-normalized variant which hallucinated duplicate column names).

How the inputs are generated

Generation · 01

Generator script

evaluation_runs/generate_csv_normalized_rules.py

Input sources

• pipeline-v0.6.1 csv-only base
• Docling table CSVs

AI use

No — pure deterministic transformation

OCR / re-OCR

Inherits from the upstream pipeline variant

Approximate processing time

~3 seconds for all 407 cards.

Resource intensity

Low — CPU-only post-processing, runs in seconds

Determinism

Deterministic (same input → same output, byte-identical)

Output location

card_sets/pipeline-v0.7-csv-normalized-rules/

Cards produced

407 cards (9 with duplicate columns resolved, 132 unchanged)

Introduced

v0.7 follow-up after cycle 27 LLM-normalizer regression, 2026-05-23.

Evaluation results

Diagnostic · 02

Typical card size

~1.5 KB per card (same as csv-only — header rewrite is byte-neutral)

Evaluation cycle

Cycle 28

Relative to v0.6.1 baseline

evaluation pending re-run with deterministic version (cycle-27 LLM version regressed strong models)

Example transformation

Sample · 04

Before: `Cruise Number Total SEA,empty_header,DAY Total Station Dates,empty_header_1,Area covered,Nature of the survey & Techniques used`

After: `cruise_number_total_sea,col_1,day_total_station_dates,col_3,area_covered,nature_of_the_survey_and_techniques_used`

Caveats and known limitations

Scope · 05

• Doesn't fix Docling's multi-row-header fusion — concatenated header text like 'Cruise Number Total SEA' is preserved as one slug; semantic splitting would require structural reconstruction.
• Conservative — `empty_header` becomes `col_N` rather than inferring the column's meaning. Models still need to determine from data what each column contains.

Related variants

Cross-reference · 06

Per-table card variant
CSV-only card
Table data rendered as raw CSV inside a Markdown code block. The most-effective open-tier variant.

← Back to all variants