GovTools
Evaluation mode
Evaluation mode · Production · recommended

M3-L4 — Oracle retrieval mode

Model receives exactly one pre-selected card per question. Isolates 'can the model answer given perfect retrieval?'

In one paragraph

The harness uses the QUERIES registry to look up which card answers each question (e.g. Q-NOAA-LOOKUP-001 → table_017). Only that single card is wrapped in the source-artifact frame and sent to the model. Measures pure reading-and-reasoning capability, with the retrieval problem removed. This is the mode used for all per-variant evaluations.

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/cycle_runner.py:resolve_card_path + build_user_prompt
Input sources
  • Active card-set variant (e.g. csv-only)
  • Question registry (harness/core.py QUERIES dict)
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~10 seconds per cell on local 7-8B open models; ~5 seconds on the reference frontier API.
Resource intensity
Medium — model inference or moderate I/O
Determinism
Deterministic (same input → same output, byte-identical)
Introduced
Cycle 2.1, 2026-05-20.

Related variants

Cross-reference · 06
← Back to all variants