GovTools
Document-level map structure
Document-level map structure · Experimental

Enriched document-index map

Per-document TOC with column headers + sample row labels + auto-detected scope. Designed to close the retrieval gap on tables with missing or uninformative captions.

In one paragraph

Builds on the plain doc-index by adding three signals per TOC entry: cleaned column headers (joined with `/`), the first 2-3 row labels (often year/month/category names), and deterministic scope detection (year range like `y:1925-37`, monthly coverage, depth range, geographic markers like `geo:plymouth`). All signals derive from the raw CSV + regex — no LLM in the index build. Designed for cycle 31's three biggest retrieval failure modes: tables with caption `'No caption detected.'` (Q-NAT-INT-001 phosphorus depth, Q-NAT-006 fish-catch — 0% retrieval in cycle 31); tables where the existing caption is misleading (Q-NAT-012 cyprid larvae — 5/8 models picked `table_124` because its mis-labeled caption said 'cyprid'). Compact-encoded so V27/V35 fit a 32K-context model and NOAA fits a 4K-context model.

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/generate_doc_index_enriched.py
Input sources
  • pipeline-v0.6.1 cards (frontmatter, caption)
  • Raw Docling table CSVs (column headers, row labels, sample values)
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~3 seconds for all 3 documents (407 tables aggregated).
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Deterministic (same input → same output, byte-identical)
Output location
card_sets/pipeline-v0.7-doc-index-enriched/
Cards produced
3 maps (one per document)
Introduced
v0.7 retrieval response to cycle 31 findings, 2026-05-23.

Evaluation results

Diagnostic · 02
Typical card size
V27: 24 KB | V35: 27 KB | NOAA: 5 KB. V27/V35 require an 8K+ context; NOAA fits in 4K.
Evaluation cycle
Cycle 32
Relative to v0.6.1 baseline
Cycle 32 evaluation via M3-IDX two-shot mode. Compared to cycle 31 plain doc-index, the enriched index surfaces column headers and scope signals that allow models to match question keywords to table structure (e.g. a question about 'phosphorus at 35m depth' can now match a table whose column headers include `Depth in m.` and `Organic` even when the caption is missing).

Caveats and known limitations

Scope · 05
  • Index size is 24-27 KB for V27/V35 — requires Ollama num_ctx ≥ 16K (set OLLAMA_NUM_CTX=32768 for the harness). ClimateGPT-13B has a native 4K context cap so it can only use this variant on NOAA (5 KB).
  • Still cannot fix pipeline mis-label cases — the table_124-vs-125 problem (caption swap) needs pipeline-side correction, not index enrichment.
  • OCR-garbled column headers propagate to the index. Tables where Docling read column 0 as `i i i Depth in m.` will surface that garble in the enriched entry — readable to humans, partially readable to models.

Related variants

Cross-reference · 06
← Back to all variants