GovTools
Evaluation mode
Evaluation mode · Experimental

M3-HYDE — Vector retrieval (HyDE) mode

Vector similarity over pre-embedded cards. A small independent model writes a hypothetical answer; that answer is embedded and matched against the corpus. Works for any model size — no TOC navigation required.

In one paragraph

Three-stage retrieval flow. (1) A small independent generator model (default qwen2.5:3b) writes a 1-3 sentence hypothetical answer in the style of a real archival answer — naming the value, row, column. The generator does not know the real answer; the hypothetical is purely a search query. (2) The hypothetical answer is embedded with `nomic-embed-text` (open-weight, 768-dim, ~270 MB, runs locally via Ollama). (3) Cosine similarity against the pre-built card index (`evaluation_runs/hyde/card_index.jsonl`, 407 entries, ~6.5 MB) — top-1 card is then served to the evaluation model under the existing M3-L4 prompt. The evaluation model never sees the hypothetical answer; only the retrieved card reaches it. Key advantage over M3-IDX two-shot: a 3B local model can do the retrieval for any size evaluator, including ClimateGPT-13B whose 4K context can't hold the V27/V35 indexes.

How the inputs are generated

Generation · 01
Generator script
evaluation_runs/cycle_runner.py:run_cell_hyde + evaluation_runs/hyde/retrieve.py:retrieve
Input sources
  • Pre-built card embedding index at evaluation_runs/hyde/card_index.jsonl (built once via build_card_embeddings.py)
  • Per-query hypothetical answer generated by qwen2.5:3b (HyDE generator)
  • nomic-embed-text (768-dim) for both card-side and query-side embedding
AI use
Yes
Two small AI calls per query: (a) hypothetical-answer generation by qwen2.5:3b (~1.5 sec), (b) embedding of that hypothetical by nomic-embed-text (~50ms). Plus the evaluation model's one call to read the retrieved card. The hypothetical-answer hallucinates a value — that's by design; it never reaches the evaluator.
OCR / re-OCR
Inherits from the upstream pipeline variant
Approximate processing time
~2 seconds per cell for retrieval pipeline + the evaluation model's normal call time. Index build: ~15 seconds for 407 cards.
Resource intensity
Low — CPU-only post-processing, runs in seconds
Determinism
Mostly deterministic, with one bounded LLM step
Introduced
Cycle 33, 2026-05-24.

Related variants

Cross-reference · 06
← Back to all variants