Per-table card variant · Experimental
CSV-only with deterministic header normalization
CSV cards with column headers cleaned by deterministic rules (no LLM). Addresses cycle-17 CELL_READ_ERROR failures from malformed Docling headers.
In one paragraph
Builds on csv-only by post-processing the header row of each CSV. Rules: replace `empty_header` placeholders with positional `col_N` labels; snake-case existing labels; collapse consecutive underscores; truncate >60 chars; de-duplicate within the header row (suffix duplicates with `_2`, `_3`). Pure rule-based — no LLM in the loop, so the normalizer can never introduce new failure modes (unlike the cycle-27 LLM-normalized variant which hallucinated duplicate column names).
How the inputs are generated
Generation · 01Generator script
evaluation_runs/generate_csv_normalized_rules.pyInput sources
- • pipeline-v0.6.1 csv-only base
- • Docling table CSVs
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~3 seconds for all 407 cards.
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Deterministic (same input → same output, byte-identical)
Output location
card_sets/pipeline-v0.7-csv-normalized-rules/Cards produced
407 cards (9 with duplicate columns resolved, 132 unchanged)
Introduced
v0.7 follow-up after cycle 27 LLM-normalizer regression, 2026-05-23.
Evaluation results
Diagnostic · 02Typical card size
~1.5 KB per card (same as csv-only — header rewrite is byte-neutral)
Evaluation cycle
Cycle 28
Relative to v0.6.1 baseline
evaluation pending re-run with deterministic version (cycle-27 LLM version regressed strong models)
Example transformation
Sample · 04Before: `Cruise Number Total SEA,empty_header,DAY Total Station Dates,empty_header_1,Area covered,Nature of the survey & Techniques used` After: `cruise_number_total_sea,col_1,day_total_station_dates,col_3,area_covered,nature_of_the_survey_and_techniques_used`
Caveats and known limitations
Scope · 05- • Doesn't fix Docling's multi-row-header fusion — concatenated header text like 'Cruise Number Total SEA' is preserved as one slug; semantic splitting would require structural reconstruction.
- • Conservative — `empty_header` becomes `col_N` rather than inferring the column's meaning. Models still need to determine from data what each column contains.