GovTools
Per-table card variant
Per-table card variant · Experimental

CSV-only with deterministic header normalization

CSV cards with column headers cleaned by deterministic rules (no LLM). Addresses cycle-17 CELL_READ_ERROR failures from malformed Docling headers.

In one paragraph

Builds on csv-only by post-processing the header row of each CSV. Rules: replace `empty_header` placeholders with positional `col_N` labels; snake-case existing labels; collapse consecutive underscores; truncate >60 chars; de-duplicate within the header row (suffix duplicates with `_2`, `_3`). Pure rule-based — no LLM in the loop, so the normalizer can never introduce new failure modes (unlike the cycle-27 LLM-normalized variant which hallucinated duplicate column names).

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/generate_csv_normalized_rules.py
Input sources
  • pipeline-v0.6.1 csv-only base
  • Docling table CSVs
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~3 seconds for all 407 cards.
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Deterministic (same input → same output, byte-identical)
Output location
card_sets/pipeline-v0.7-csv-normalized-rules/
Cards produced
407 cards (9 with duplicate columns resolved, 132 unchanged)
Introduced
v0.7 follow-up after cycle 27 LLM-normalizer regression, 2026-05-23.

Evaluation results

Diagnostic · 02
Typical card size
~1.5 KB per card (same as csv-only — header rewrite is byte-neutral)
Evaluation cycle
Cycle 28
Relative to v0.6.1 baseline
evaluation pending re-run with deterministic version (cycle-27 LLM version regressed strong models)

Example transformation

Sample · 04
Before: `Cruise Number Total SEA,empty_header,DAY Total Station Dates,empty_header_1,Area covered,Nature of the survey & Techniques used`

After: `cruise_number_total_sea,col_1,day_total_station_dates,col_3,area_covered,nature_of_the_survey_and_techniques_used`

Caveats and known limitations

Scope · 05
  • Doesn't fix Docling's multi-row-header fusion — concatenated header text like 'Cruise Number Total SEA' is preserved as one slug; semantic splitting would require structural reconstruction.
  • Conservative — `empty_header` becomes `col_N` rather than inferring the column's meaning. Models still need to determine from data what each column contains.

Related variants

Cross-reference · 06
← Back to all variants