GovTools
Evaluation mode
Evaluation mode · Experimental

M3-IDX — Two-shot retrieval mode

Model picks a table from a per-document index, then receives that table. Tests retrieval + reading together; isolates the cost of removing the oracle.

In one paragraph

Two model calls per question. Call 1: serve the doc-index map (`pipeline-v0.7-doc-index/<sha>/doc-index.map.md`, ~3-18 KB) with the question; the model responds with `CHOICE: table_NNN`. Call 2: load the model-selected table card and serve it under the existing M3-L4 prompt; the model answers. The verdict is whether the answer passes the scorer. An additional metric, `retrieval_correct`, records whether the model's choice matched the oracle card_id — separating retrieval failure from reading failure. If the model fails to return a parseable choice, the cell is marked fail (no oracle fallback). Designed to expose the model's own ability to find the correct table, which the M3-L4 oracle mode hides.

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/cycle_runner.py:run_cell_idx + build_idx_call_1_prompt + build_idx_call_2_prompt + parse_chosen_table_id
Input sources
  • pipeline-v0.7-doc-index variant
  • raw pipeline-v0.6.1 cards for the model-selected table
  • Question registry (harness/core.py QUERIES dict — used only to validate retrieval correctness, NOT to choose the card)
AI use
No — pure deterministic transformation
The harness itself is deterministic. The MODEL is the AI doing two calls per cell — that's the test, not a pipeline choice.
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~20-30 seconds per cell on local 7-8B open models (two model calls per cell). ~60 minutes for 8 models × 13 questions.
Resource intensity
Medium — model inference or moderate I/O
Determinism
Deterministic (same input → same output, byte-identical)
Introduced
Cycle 31, 2026-05-23.

Related variants

Cross-reference · 06
← Back to all variants