M3-IDX — Two-shot retrieval mode
Model picks a table from a per-document index, then receives that table. Tests retrieval + reading together; isolates the cost of removing the oracle.
Two model calls per question. Call 1: serve the doc-index map (`pipeline-v0.7-doc-index/<sha>/doc-index.map.md`, ~3-18 KB) with the question; the model responds with `CHOICE: table_NNN`. Call 2: load the model-selected table card and serve it under the existing M3-L4 prompt; the model answers. The verdict is whether the answer passes the scorer. An additional metric, `retrieval_correct`, records whether the model's choice matched the oracle card_id — separating retrieval failure from reading failure. If the model fails to return a parseable choice, the cell is marked fail (no oracle fallback). Designed to expose the model's own ability to find the correct table, which the M3-L4 oracle mode hides.
How the inputs are generated
Generation · 01evaluation_runs/cycle_runner.py:run_cell_idx + build_idx_call_1_prompt + build_idx_call_2_prompt + parse_chosen_table_id- • pipeline-v0.7-doc-index variant
- • raw pipeline-v0.6.1 cards for the model-selected table
- • Question registry (harness/core.py QUERIES dict — used only to validate retrieval correctness, NOT to choose the card)
Related variants
Cross-reference · 06- Evaluation modeM3-L4 — Oracle retrieval modeModel receives exactly one pre-selected card per question. Isolates 'can the model answer given perfect retrieval?'
- Document-level map structureDocument table-of-contents mapOne card per document listing every detected table with caption, page, and dimensions. Enables two-shot retrieval.