Evaluation mode · Experimental

M3-HYDE — Vector retrieval (HyDE) mode

Vector similarity over pre-embedded cards. A small independent model writes a hypothetical answer; that answer is embedded and matched against the corpus. Works for any model size — no TOC navigation required.

In one paragraph

Three-stage retrieval flow. (1) A small independent generator model (default qwen2.5:3b) writes a 1-3 sentence hypothetical answer in the style of a real archival answer — naming the value, row, column. The generator does not know the real answer; the hypothetical is purely a search query. (2) The hypothetical answer is embedded with `nomic-embed-text` (open-weight, 768-dim, ~270 MB, runs locally via Ollama). (3) Cosine similarity against the pre-built card index (`evaluation_runs/hyde/card_index.jsonl`, 407 entries, ~6.5 MB) — top-1 card is then served to the evaluation model under the existing M3-L4 prompt. The evaluation model never sees the hypothetical answer; only the retrieved card reaches it. Key advantage over M3-IDX two-shot: a 3B local model can do the retrieval for any size evaluator, including ClimateGPT-13B whose 4K context can't hold the V27/V35 indexes.

How the inputs are generated

Generation · 01

Generator script

evaluation_runs/cycle_runner.py:run_cell_hyde + evaluation_runs/hyde/retrieve.py:retrieve

Input sources

• Pre-built card embedding index at evaluation_runs/hyde/card_index.jsonl (built once via build_card_embeddings.py)
• Per-query hypothetical answer generated by qwen2.5:3b (HyDE generator)
• nomic-embed-text (768-dim) for both card-side and query-side embedding

AI use

Yes

Two small AI calls per query: (a) hypothetical-answer generation by qwen2.5:3b (~1.5 sec), (b) embedding of that hypothetical by nomic-embed-text (~50ms). Plus the evaluation model's one call to read the retrieved card. The hypothetical-answer hallucinates a value — that's by design; it never reaches the evaluator.

OCR / re-OCR

Inherits from the upstream pipeline variant

Approximate processing time

~2 seconds per cell for retrieval pipeline + the evaluation model's normal call time. Index build: ~15 seconds for 407 cards.

Resource intensity

Low — CPU-only post-processing, runs in seconds

Determinism

Mostly deterministic, with one bounded LLM step

Introduced

Cycle 33, 2026-05-24.

Related variants

Cross-reference · 06

← Back to all variants