Evaluation mode · Production · recommended
M3-L4 — Oracle retrieval mode
Model receives exactly one pre-selected card per question. Isolates 'can the model answer given perfect retrieval?'
In one paragraph
The harness uses the QUERIES registry to look up which card answers each question (e.g. Q-NOAA-LOOKUP-001 → table_017). Only that single card is wrapped in the source-artifact frame and sent to the model. Measures pure reading-and-reasoning capability, with the retrieval problem removed. This is the mode used for all per-variant evaluations.
How the inputs are generated
Generation · 01Generator script
evaluation_runs/cycle_runner.py:resolve_card_path + build_user_promptInput sources
- • Active card-set variant (e.g. csv-only)
- • Question registry (harness/core.py QUERIES dict)
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~10 seconds per cell on local 7-8B open models; ~5 seconds on the reference frontier API.
Resource intensity
Medium — model inference or moderate I/O
Determinism
Deterministic (same input → same output, byte-identical)
Introduced
Cycle 2.1, 2026-05-20.
Related variants
Cross-reference · 06- Evaluation modeM2c — Docling Markdown modeModel receives Docling's full linearized Markdown for the entire document.
- Evaluation modeM2a — Raw Docling JSON modeModel receives the raw decompressed docling.json.gz. Demonstrates why specialized evidence packaging is needed at all.
- Evaluation modeM3-AC — All-cards modeModel receives every card in a document concatenated. Tests retrieval-without-oracle.
- Evaluation modeM3-IDX — Two-shot retrieval modeModel picks a table from a per-document index, then receives that table. Tests retrieval + reading together; isolates the cost of removing the oracle.