Evidence-Preserving Table Normalization Layer
Five-layer deterministic OCR-confusable + column-type + table-context + authority normalizer with per-cell provenance. Repairs 444 cells across 123 cards in the V27/V35 corpus; 0 cells touched on born-digital NOAA.
Implements the five-layer table-semantic normalization design: (1) visual-confusable detection (I↔1, O↔0, B↔8, S↔5, G↔6, Z↔2 — the cybersecurity-relevant subset that's also the systematic OCR error set); (2) column-type inference (integer, decimal, percentage, year, date, month, agency_code, country, species_name); (3) table-context correction using column-type prior + numeric range plausibility + ambiguity refusal; (4) authority matching against curated US federal agencies and ISO countries lists with confusable-aware Levenshtein distance; (5) confidence-preserving output — raw extraction always preserved alongside a `table.normalized.jsonl` side artifact carrying per-cell provenance (raw_cell, normalized_cell, column_type, column_range, reasons, confidence, ambiguous). No LLM in the loop. Zero hallucination risk. Designed for archival infrastructure where the raw extraction must never be silently overwritten.
How the inputs are generated
Generation · 01evaluation_runs/generate_table_normalized_variant.py (module: evaluation_runs/table_normalizer/)- • pipeline-v0.6.1 csv-only base
- • Docling table CSVs
- • Curated authority files (US federal agencies, ISO 3166 countries, months)
card_sets/pipeline-v0.7-table-normalized/Evaluation results
Diagnostic · 02Questions this variant addresses
Coverage · 03- Q-NAT-015 V35 rainfall — column header `my 1954` → `1954`, January `8I` → `81`, February long-term `I47` → `147`; the ground-truth answer cell `184` (Feb 1954) is byte-identical
Example transformation
Sample · 04Side artifact (`table_039.normalized.jsonl`):
```json
{
"raw_cell": "I47",
"normalized_cell": "147",
"column_type": "integer",
"column_range": {"min": 84.0, "max": 1648.0, "n": 12},
"reasons": [
"column_type=integer",
"in_range[84,1648]",
"confusable_substitution: 'I47' → '147'"
],
"confidence": 0.97,
"ambiguous": false
}
```Caveats and known limitations
Scope · 05- • Visual-confusable substitution is bounded to character-recognition errors that have a small fixed alphabet. Errors caused by table-structure detection failures (bbox-level mistakes that put data in the wrong cell) are out of scope.
- • Cycle 29 lesson: global formatting normalization is unsafe — even byte-level changes to em-dashes can alter local-model output. The refined cycle 30 build enforces byte-identity for tables the normalizer does not touch.
- • The normalizer is deliberately conservative — when a confusable substitution yields multiple plausible candidates in range, the cell is marked ambiguous and kept raw. 0 ambiguous cells were generated on the current corpus.
- • Genuinely illegible cells (e.g. V35 table_039 July long-term cell reads `oe)`) are left raw, not guessed. Refusal beats fabrication.
Related variants
Cross-reference · 06- Per-table card variantCSV-only cardTable data rendered as raw CSV inside a Markdown code block. The most-effective open-tier variant.
- Per-table card variantCSV-only with deterministic header normalizationCSV cards with column headers cleaned by deterministic rules (no LLM). Addresses cycle-17 CELL_READ_ERROR failures from malformed Docling headers.