GovTools
Per-table card variant
Per-table card variant · Experimental

Evidence-Preserving Table Normalization Layer

Five-layer deterministic OCR-confusable + column-type + table-context + authority normalizer with per-cell provenance. Repairs 444 cells across 123 cards in the V27/V35 corpus; 0 cells touched on born-digital NOAA.

In one paragraph

Implements the five-layer table-semantic normalization design: (1) visual-confusable detection (I↔1, O↔0, B↔8, S↔5, G↔6, Z↔2 — the cybersecurity-relevant subset that's also the systematic OCR error set); (2) column-type inference (integer, decimal, percentage, year, date, month, agency_code, country, species_name); (3) table-context correction using column-type prior + numeric range plausibility + ambiguity refusal; (4) authority matching against curated US federal agencies and ISO countries lists with confusable-aware Levenshtein distance; (5) confidence-preserving output — raw extraction always preserved alongside a `table.normalized.jsonl` side artifact carrying per-cell provenance (raw_cell, normalized_cell, column_type, column_range, reasons, confidence, ambiguous). No LLM in the loop. Zero hallucination risk. Designed for archival infrastructure where the raw extraction must never be silently overwritten.

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/generate_table_normalized_variant.py (module: evaluation_runs/table_normalizer/)
Input sources
  • pipeline-v0.6.1 csv-only base
  • Docling table CSVs
  • Curated authority files (US federal agencies, ISO 3166 countries, months)
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~6 seconds for all 407 cards.
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Deterministic (same input → same output, byte-identical)
Output location
card_sets/pipeline-v0.7-table-normalized/
Cards produced
407 cards + 123 .normalized.jsonl side artifacts
Introduced
v0.7 module, 2026-05-23 — full design doc at docs/table-semantic-normalizer.md in the project repository.

Evaluation results

Diagnostic · 02
Best open-model score
10/13 (Qwen2.5-7B, Granite-3.3-8B, Llama-3 8B)Qwen2.5-7B, Granite-3.3-8B, Llama-3 8B
Avg open-tier pass rate
~55% (unchanged net vs csv-only)
Typical card size
~1.5 KB per card (same as csv-only — character substitution is byte-neutral)
Evaluation cycle
Cycle 30
Relative to v0.6.1 baseline
Net zero on the 8-model panel, but a real distribution shift: mid-tier open models (Llama, Apertus, Gemma-2, Mistral) each gained one cell while the top two open models (Qwen-7B, Granite-8B) each lost one. Frontier (grok-4) flat. ClimateGPT-13B regressed two cells. Hypothesised mechanism: the strongest open models partly use OCR noise as a column-identification signal; the normalizer removes that signal, helping weaker models and slightly hurting stronger ones. 442 cells repaired across 123 cards (V27 166, V35 276, NOAA 0). Byte-identity guard verified zero noise on the 285 untouched tables.

Questions this variant addresses

Coverage · 03
  • Q-NAT-015 V35 rainfall — column header `my 1954` → `1954`, January `8I` → `81`, February long-term `I47` → `147`; the ground-truth answer cell `184` (Feb 1954) is byte-identical

Example transformation

Sample · 04
Side artifact (`table_039.normalized.jsonl`):

```json
{
  "raw_cell": "I47",
  "normalized_cell": "147",
  "column_type": "integer",
  "column_range": {"min": 84.0, "max": 1648.0, "n": 12},
  "reasons": [
    "column_type=integer",
    "in_range[84,1648]",
    "confusable_substitution: 'I47' → '147'"
  ],
  "confidence": 0.97,
  "ambiguous": false
}
```

Caveats and known limitations

Scope · 05
  • Visual-confusable substitution is bounded to character-recognition errors that have a small fixed alphabet. Errors caused by table-structure detection failures (bbox-level mistakes that put data in the wrong cell) are out of scope.
  • Cycle 29 lesson: global formatting normalization is unsafe — even byte-level changes to em-dashes can alter local-model output. The refined cycle 30 build enforces byte-identity for tables the normalizer does not touch.
  • The normalizer is deliberately conservative — when a confusable substitution yields multiple plausible candidates in range, the cell is marked ambiguous and kept raw. 0 ambiguous cells were generated on the current corpus.
  • Genuinely illegible cells (e.g. V35 table_039 July long-term cell reads `oe)`) are left raw, not guessed. Refusal beats fabrication.

Related variants

Cross-reference · 06
← Back to all variants