Evaluation mode · Experimental

M3-IDX — Two-shot retrieval mode

Model picks a table from a per-document index, then receives that table. Tests retrieval + reading together; isolates the cost of removing the oracle.

In one paragraph

Two model calls per question. Call 1: serve the doc-index map (`pipeline-v0.7-doc-index/<sha>/doc-index.map.md`, ~3-18 KB) with the question; the model responds with `CHOICE: table_NNN`. Call 2: load the model-selected table card and serve it under the existing M3-L4 prompt; the model answers. The verdict is whether the answer passes the scorer. An additional metric, `retrieval_correct`, records whether the model's choice matched the oracle card_id — separating retrieval failure from reading failure. If the model fails to return a parseable choice, the cell is marked fail (no oracle fallback). Designed to expose the model's own ability to find the correct table, which the M3-L4 oracle mode hides.

How the inputs are generated

Generation · 01

Generator script

evaluation_runs/cycle_runner.py:run_cell_idx + build_idx_call_1_prompt + build_idx_call_2_prompt + parse_chosen_table_id

Input sources

• pipeline-v0.7-doc-index variant
• raw pipeline-v0.6.1 cards for the model-selected table
• Question registry (harness/core.py QUERIES dict — used only to validate retrieval correctness, NOT to choose the card)

AI use

No — pure deterministic transformation

The harness itself is deterministic. The MODEL is the AI doing two calls per cell — that's the test, not a pipeline choice.

OCR / re-OCR

Inherits from the upstream pipeline variant

Approximate processing time

~20-30 seconds per cell on local 7-8B open models (two model calls per cell). ~60 minutes for 8 models × 13 questions.

Resource intensity

Medium — model inference or moderate I/O

Determinism

Deterministic (same input → same output, byte-identical)

Introduced

Cycle 31, 2026-05-23.

Related variants

Cross-reference · 06

← Back to all variants