Document-level map structure · Experimental

Enriched document-index map

Per-document TOC with column headers + sample row labels + auto-detected scope. Designed to close the retrieval gap on tables with missing or uninformative captions.

In one paragraph

Builds on the plain doc-index by adding three signals per TOC entry: cleaned column headers (joined with `/`), the first 2-3 row labels (often year/month/category names), and deterministic scope detection (year range like `y:1925-37`, monthly coverage, depth range, geographic markers like `geo:plymouth`). All signals derive from the raw CSV + regex — no LLM in the index build. Designed for cycle 31's three biggest retrieval failure modes: tables with caption `'No caption detected.'` (Q-NAT-INT-001 phosphorus depth, Q-NAT-006 fish-catch — 0% retrieval in cycle 31); tables where the existing caption is misleading (Q-NAT-012 cyprid larvae — 5/8 models picked `table_124` because its mis-labeled caption said 'cyprid'). Compact-encoded so V27/V35 fit a 32K-context model and NOAA fits a 4K-context model.

How the inputs are generated

Generation · 01

Generator script

evaluation_runs/generate_doc_index_enriched.py

Input sources

• pipeline-v0.6.1 cards (frontmatter, caption)
• Raw Docling table CSVs (column headers, row labels, sample values)

AI use

No — pure deterministic transformation

OCR / re-OCR

Inherits from the upstream pipeline variant

Approximate processing time

~3 seconds for all 3 documents (407 tables aggregated).

Resource intensity

Low — CPU-only post-processing, runs in seconds

Determinism

Deterministic (same input → same output, byte-identical)

Output location

card_sets/pipeline-v0.7-doc-index-enriched/

Cards produced

3 maps (one per document)

Introduced

v0.7 retrieval response to cycle 31 findings, 2026-05-23.

Evaluation results

Diagnostic · 02

Typical card size

V27: 24 KB | V35: 27 KB | NOAA: 5 KB. V27/V35 require an 8K+ context; NOAA fits in 4K.

Evaluation cycle

Cycle 32

Relative to v0.6.1 baseline

Cycle 32 evaluation via M3-IDX two-shot mode. Compared to cycle 31 plain doc-index, the enriched index surfaces column headers and scope signals that allow models to match question keywords to table structure (e.g. a question about 'phosphorus at 35m depth' can now match a table whose column headers include `Depth in m.` and `Organic` even when the caption is missing).

Caveats and known limitations

Scope · 05

• Index size is 24-27 KB for V27/V35 — requires Ollama num_ctx ≥ 16K (set OLLAMA_NUM_CTX=32768 for the harness). ClimateGPT-13B has a native 4K context cap so it can only use this variant on NOAA (5 KB).
• Still cannot fix pipeline mis-label cases — the table_124-vs-125 problem (caption swap) needs pipeline-side correction, not index enrichment.
• OCR-garbled column headers propagate to the index. Tables where Docling read column 0 as `i i i Depth in m.` will surface that garble in the enriched entry — readable to humans, partially readable to models.

Related variants

Cross-reference · 06

← Back to all variants