Enriched document-index map
Per-document TOC with column headers + sample row labels + auto-detected scope. Designed to close the retrieval gap on tables with missing or uninformative captions.
Builds on the plain doc-index by adding three signals per TOC entry: cleaned column headers (joined with `/`), the first 2-3 row labels (often year/month/category names), and deterministic scope detection (year range like `y:1925-37`, monthly coverage, depth range, geographic markers like `geo:plymouth`). All signals derive from the raw CSV + regex — no LLM in the index build. Designed for cycle 31's three biggest retrieval failure modes: tables with caption `'No caption detected.'` (Q-NAT-INT-001 phosphorus depth, Q-NAT-006 fish-catch — 0% retrieval in cycle 31); tables where the existing caption is misleading (Q-NAT-012 cyprid larvae — 5/8 models picked `table_124` because its mis-labeled caption said 'cyprid'). Compact-encoded so V27/V35 fit a 32K-context model and NOAA fits a 4K-context model.
How the inputs are generated
Generation · 01evaluation_runs/generate_doc_index_enriched.py- • pipeline-v0.6.1 cards (frontmatter, caption)
- • Raw Docling table CSVs (column headers, row labels, sample values)
card_sets/pipeline-v0.7-doc-index-enriched/Evaluation results
Diagnostic · 02Caveats and known limitations
Scope · 05- • Index size is 24-27 KB for V27/V35 — requires Ollama num_ctx ≥ 16K (set OLLAMA_NUM_CTX=32768 for the harness). ClimateGPT-13B has a native 4K context cap so it can only use this variant on NOAA (5 KB).
- • Still cannot fix pipeline mis-label cases — the table_124-vs-125 problem (caption swap) needs pipeline-side correction, not index enrichment.
- • OCR-garbled column headers propagate to the index. Tables where Docling read column 0 as `i i i Depth in m.` will surface that garble in the enriched entry — readable to humans, partially readable to models.
Related variants
Cross-reference · 06- Document-level map structureDocument table-of-contents mapOne card per document listing every detected table with caption, page, and dimensions. Enables two-shot retrieval.
- Evaluation modeM3-IDX — Two-shot retrieval modeModel picks a table from a per-document index, then receives that table. Tests retrieval + reading together; isolates the cost of removing the oracle.