GovTools
Pipeline version
Pipeline version · Baseline

Pipeline v0.6.1 (full card)

The reference full-featured card. All structural information Docling extracts, packaged into one Markdown card per detected table.

In one paragraph

The mainline pipeline output. Each card contains YAML frontmatter (provenance, table type, dimensions, page numbers), a caption, candidate-captions cross-check, inline rendered Markdown table, headings hierarchy, geographic and temporal scope claims, nearby paragraphs, and a context envelope JSON with deduped claims. Designed for models with sufficient context to read everything; baseline against which compact variants are measured.

How the inputs are generated

Generation · 01
Generator script
extract_contextual_table_cards.py
Input sources
  • Source PDF (or IA Wayback URL)
  • Docling-extracted JSON
  • Docling-extracted text
AI use
No — pure deterministic transformation
OCR / re-OCR
Inherits from Docling's extraction step
Tool: Docling --force-reocr (Tesseract LSTM under the hood for OCR'd scans)
Approximate processing time
V27 (874 PDF pages, 186 tables): ~32 min on Apple-Silicon. NOAA-32079 (63 pages, 35 tables, born-digital): ~2 min.
Resource intensity
High — Docling extraction with OCR, multi-minute
Determinism
Deterministic (same input → same output, byte-identical)
Output location
user_urls_output_v0.3_reocr/sha256_<sha>/table_cards/
Cards produced
186 V27 + 186 V35 + 35 NOAA-32079 = 407 cards
Introduced
v0.6 batch 1, 2026-05-21 — dedup patches added 2026-05-22 as v0.6.1.

Evaluation results

Diagnostic · 02
Avg open-tier pass rate
27%
Typical card size
15–42 KB per card
Evaluation cycle
Cycle 13
Relative to v0.6.1 baseline
baseline (defines 0% lift)

Caveats and known limitations

Scope · 05
  • Card size often exceeds 4K-context open models — content gets truncated mid-read.
  • OCR noise propagates from Docling (e.g. column headers like 'I1I1I,000' for '111,000').
  • Multi-page tables are split at the PDF page boundary — see pipeline-v0.7-stitched.

Related variants

Cross-reference · 06
← Back to all variants