Archival evidence packaging for open models

Many open models cannot interpolate table data from PDF or full Docling.JSON files.
These files often exceed the size limit of many context windows.
1.5 KB CSV table cards can allow models to generate answers.

The Erschließung (archival processing) project tested 24 open/open-weight models, plus 6 closed-reference comparators, on a 13-question benchmark from three archival documents: two scanned marine biology journals and one born-digital NOAA fisheries report.

Across 36 cycles, the project evaluated 28 evidence-derivative formats, from raw Docling JSON and full Markdown cards to compact CSV, micro-cards, table-only, stitched, normalized, and column-context variants. It also tested multiple retrieval modes, including oracle cards, all-cards, two-shot index retrieval, HYDE/vector retrieval, full-document Markdown, and raw Docling JSON.

The headline finding: better evidence packaging lifted the open-tier pass rate from 27% to 55%, with two 7–8B open models reaching 11/13 on compact CSV cards.

Input gradient at a glance

The archival item itself

Raw PDF

0 / 7 correct

Apertus 8B and ClimateGPT 13B are text-only models. They cannot ingest a PDF binary.

The IA-default derivative

Raw Docling JSON

0 / 7 correct

~30 MB decompressed. The model's 4 K – 8 K context window overflows before reasoning starts.

Erschließung pipeline output

Compact CSV card

7 / 7 correct

~1.5 KB per question. Table data as CSV inside a Markdown code block. The model reads it and answers.

Sample: 7 (model, question) cells across 6 distinct questions on the 13-question diagnostic.

Interpolation evidence

Model	Question	Raw PDF input source	Raw Docling JSON IA-default derivative	Compact CSV card Erschließung pipeline output	Correct answer
Apertus 8B Instruct Swiss AI	Q-NAT-001Looking at the list of Physalia (Portuguese man-of-war) strandings on British and European coasts in Vol 27, what year did the first reported stranding on the Belgian coast happen, and what was the locality? view full prompt →	n/a Text-only model — PDF must be pre-converted.	n/a Context overflow — ~1 GB JSON vs 4K-8K window.	✓ correct The first reported stranding on the Belgian coast happened in 1912, with the locality being Knocke (Belgium). This information is found in the row for 1912, specifically in the "Locality" column. view card content →	Ordered lookup test: 1912, Knocke.
ClimateGPT 13B climate-domain Llama-2	Q-NAT-006Looking at the annual fish-catch table in Vol 27, how much less fish weight was caught in 1934 than the year before? Show both source values and the difference. view full prompt →	n/a Text-only model — PDF must be pre-converted.	n/a Context overflow — ~1 GB JSON vs 4K-8K window.	✓ correct Source: 1934: 93,000 tons 1933: 95,000 tons Difference: 2,000 tons less view card content →view chat →	Table subtraction test: 2,000 tons less; 1933=95,000, 1934=93,000.
Apertus 8B Instruct Swiss AI	Q-NAT-INT-001In Vol 27, Table III lists the organic phosphorus concentration in offshore water 3 miles south of the Eddystone Lighthouse on 23 July 1947. The table reports depths at 30 metres and 40 metres but does not include a 35-metre row. What would you estimate the organic phosphorus concentration was at 35 metres? Show the bracketing values and your interpolation method. view full prompt →	n/a Text-only model — PDF must be pre-converted.	n/a Context overflow — ~1 GB JSON vs 4K-8K window.	✓ correct Based on the provided data, I will interpolate between the depths of 30 meters (6.1 mg P/m³) and 40 meters (4.4 mg P/m³). The organic phosphorus concentration at 35 meters would be estimated as approximately 5.2 mg P/m³. This estimate is based on linear interpolation, considering the change in concentration between these two depths. view card content →	Linear interpolation test: 5.25 mg P/m³; 30m=6.1, 40m=4.4.
Apertus 8B Instruct Swiss AI	Q-NAT-INT-002If the exact value is missing but the required source values are present, calculate or infer the answer and show the method. A marine biologist examining the offshore phosphorus measurements 3 miles south of the Eddystone Lighthouse on 23 July 1947 wants the organic phosphorus concentration at 25 meters. What would they conclude? view full prompt →	n/a Text-only model — PDF must be pre-converted.	n/a Context overflow — ~1 GB JSON vs 4K-8K window.	✓ correct To find the organic phosphorus concentration at 25 meters, we can interpolate from the given data. The closest depth to 25 meters is 30 meters (6.1 mg P per m³). The next lower depth is 20 meters (9.6 mg P per m³). Using linear interpolation: - At 30 meters: Organic = 6.1 - At 20 meters: Organic = 9.6 The difference in depth between 30 and 25 meters is 5 meters. Let's calculate the organic phosphorus concentration at 25 meters using a simple linear interpolation formula: Organic at 25 m ≈ (6.1 + (9.6 - 6.1) * (25 - 30)) / (20 - 30) Calculating this gives us approximately 7.8 mg P per m³. So, the marine biologist would conclude that the organic phosphorus concentration at 25 meters is approximately 7.8 mg P per m³. This conclusion is based on interpolation from the given data in Table III. view card content →	Unscaffolded interpolation test: 7.85 mg P/m³; 20m=9.6, 30m=6.1.
Apertus 8B Instruct Swiss AI	Q-V27-NEG-001In the Vol 27 list of Physalia (Portuguese man-of-war) strandings on British and European coasts, was there a stranding reported on the coast of Greece? If so, what year and where? view full prompt →	n/a Text-only model — PDF must be pre-converted.	n/a Context overflow — ~1 GB JSON vs 4K-8K window.	✓ correct No, there is no record of a Physalia (Portuguese man-of-war) stranding on the coast of Greece in this list. The closest location mentioned is "Guéethary Bantry, Co. Cork" in Ireland and "Tragumina, Co, Cork" also in Ireland. view card content →	Negative lookup test: not in source; no Greek-coast entries.
ClimateGPT 13B climate-domain Llama-2	Q-V27-NEG-001In the Vol 27 list of Physalia (Portuguese man-of-war) strandings on British and European coasts, was there a stranding reported on the coast of Greece? If so, what year and where? view full prompt →	n/a Text-only model — PDF must be pre-converted.	n/a Context overflow — ~1 GB JSON vs 4K-8K window.	✓ correct No. There were no Physalia strandings reported on the coast of Greece. view card content →view chat →	Negative lookup test: not in source; no Greek-coast entries.
Apertus 8B Instruct Swiss AI	Q-NOAA-NEG-001How many cruises were conducted in 1948 according to the California Current fisheries database? view full prompt →	n/a Text-only model — PDF must be pre-converted.	n/a Context overflow — ~1 GB JSON vs 4K-8K window.	✓ correct There is no information about cruise numbers for 1948 in the provided table or source material. The earliest year mentioned is 1950 (Cruise Number 50Y01). Therefore, it's not possible to determine how many cruises were conducted in 1948 based on this data. view card content →	Boundary test: not in source; database starts in 1950.

How each artifact relates to the source PDF and Docling output ↓

The practical question is not where files sit on disk — it is what work is needed to obtain each artifact. For an Internet Archive item that has already been Docling-processed, the JSON is a free download. The compact evidence cards, table CSVs, document indexes, and card variants are all derivable from that one file without re-running Docling. Only the figure pixel crops need either the source PDF or a Docling re-run with figure extraction enabled.

Artifact	How to get it	Detail
The PDF itself	On the Internet Archive	The archival document — the only thing the Internet Archive is guaranteed to hold for an item.
docling.json.gz	On the Internet Archive	Docling's full structural extraction: layout, tables, captions, headings, figure bounding boxes, reading order. The Internet Archive holds this for items it has already processed. For unprocessed items, the pipeline runs Docling on the PDF once and produces it.
Per-table CSV / Parquet / HTML / Markdown	Derivable from the JSON	One file per detected table, extracted from docling.json.gz. Pure transformation — no model needed, no second Docling pass needed.
Table context envelopes	Derivable from the JSON	Captions, headings, and neighboring paragraphs for each table. All present in docling.json.gz; the pipeline just picks them out.
Compact per-table evidence cards	Derivable from the JSON	The 1–2 KB CSV-format cards that drive the headline result. Built from the per-table CSV + selected context. No Docling re-run; no extra Internet Archive fetch beyond the JSON.
Card variants (csv-only, micro-1k, table-normalized, …)	Derivable from the JSON	Reshapes of the base card. Each variant changes what the model sees, never what was extracted. All derive from docling.json.gz.
Document-level maps (table index, figures index, entity index)	Derivable from the JSON	Per-document summaries used for two-shot retrieval and discovery. Derived from docling.json.gz alone.
Figure image crops (PNG)	Needs Docling re-run or PDF	Cropped pixels for each detected figure. The bounding boxes are in docling.json.gz but the pixels are not — producing the PNGs requires either the original PDF + a crop step, or re-running Docling with extract_figures=True.
manifest.json, provenance.json, docling_meta.json	Pipeline metadata only	Pipeline metadata: source URL, content-addressed hashes, extractor version, run timing. The pipeline writes these; they do not exist anywhere upstream.

Why this matters for scale. If the Internet Archive holds docling.json.gz for an item, every evidence package shown on this page — including the compact CSV cards — can be generated from that single file. No re-running Docling. No re-downloading the PDF.

Interpolation candidates — original wording + curated variants × 3 open models

Funnel: 408 tables in corpus → 39 interpolation questions written (9.6% of tables) → 17 passed Tier-1 (grok-4, M3-L4 oracle (pipeline-v0.7-csv-only card variant)) → 12 passed Tier-2 (≥1 of Apertus 8B, ClimateGPT 13B, Qwen 2.5 7B in band). For each of those 12 we also tested ~3–5 curated wording variants. Total cells shown below: 153 (68 correct, 15 partial, 70 incorrect).

Stricter scoring (2026-05-27): midpoint_pass (correct) requires asserted value STRICTLY between the source-cell endpoints — endpoint-echo and broad-band hits are downgraded to partial. For example, INT-042 brackets are [0, 33] (Jan 1970 = 33, Jan 1972 = 0), so an asserted "33" is now downgraded from correct to partial — the model echoed an endpoint instead of interpolating. partial rows are tinted amber so endpoint-echo passes stand out from strict-midpoint passes (green).

INT-005V27 · table_097A diurnal survey table in Vol 27 records Euglena cell counts at the 2–3 m depth layer at different times of day on a single date. Measurements at 8 a.m. and at noon are both given, but no count is recorded for 10 a.m. Estimate the Euglena cell count at 10 a.m. by linear interpolation between the 8 a.m. and noon readings.5 correct · 5 partial · 18 total▸

★ originalA diurnal survey table in Vol 27 records Euglena cell counts at the 2–3 m depth layer at different times of day on a single date. Measurements at 8 a.m. and at noon are both given, but no count is recorded for 10 a.m. Estimate the Euglena cell count at 10 a.m. by linear interpolation between the 8 a.m. and noon readings.2 / 1 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 25500band [24500, 26000] ✓midpoint (24500, 26000) ✓ To estimate the Euglena cell count at 10 a.m., we can use linear interpolation between the 8 a.m. (24,500) and noon (26,000) readings. The formula for linear interpolation is: Cell Count = Cell Count at Time A + ((Time B - Time A) / (Time C - Time A)) * (Cell Count at Time C - Cell Count at Time A) For 10 a.m. (between 8 a.m. and noon): - Time A = 8 a.m. - Time B = 10 a.m. - Time C = noon Plugging in the values: Cell Count at 10 a.m. ≈ 24,500 + ((10 - 8) / (12 - 8)) * (26,000 - 24,500) ≈ 24,500 + (2/3) * 1,500 ≈ 24,500 + 1,000 ≈ 25,500 So the estimated Euglena cell count at 10 a.m. is approximately 25,500 cells per milliliter (no./ml.). This answer is based on the provided data in Table 097 and uses linear interpolation between the given time points. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 26000band [24500, 26000] ✓midpoint (24500, 26000) ✗ 26,000 view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 25500band [24500, 26000] ✓midpoint (24500, 26000) ✓ The table provides Euglena counts for the 2-3 m depth layer at different times of day. At 8 a.m., the count is "24,500" and at noon it is "26,000". To estimate the count at 10 a.m. using linear interpolation: \[ \text{Count at 10 a.m.} = \frac{(2 \times \text{count at noon}) + (\text{count at 8 a.m.})}{3} \] \[ \text{Count at 10 a.m.} = \frac{(2 \times 26,000) + 24,500}{3} = \frac{52,000 + 24,500}{3} = \frac{76,500}{3} = 25,500 \] The estimated Euglena cell count at 10 a.m. is 25,500. This calculation uses the counts provided in the row for the 2-3 m depth layer: "Time = 8 a.m., Time = noon". view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →

v1The table gives Euglena counts at 2–3 m for 8 a.m. and noon, but not for 10 a.m. What is a reasonable estimate for 10 a.m.?2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 10band [24500, 26000] ✗midpoint (24500, 26000) ✗ Based on the given data in Table 097, we can observe that at depths of 2–3 m, Euglena counts are provided for 8 a.m., noon, and 4 p.m. However, there is no direct count provided for 10 a.m. To estimate the count for 10 a.m., we can look at the pattern in the table. The counts seem to increase from morning (8 a.m.) to afternoon (4 p.m.), with some fluctuations due to time of day and depth. Given this pattern, it's reasonable to assume that the Euglena count would be higher than 7550 but lower than 12000 at 10 a.m. A possible estimate could be around 9000-10000 Euglenas per milliliter. However, since we are not given any specific data for 10 a.m., this is an educated guess based on the observed pattern in the table. The most accurate answer would require additional data or more detailed information about the diurnal variation of Euglena counts at that depth and time. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 25500band [24500, 26000] ✓midpoint (24500, 26000) ✓ The counts at 2–3 m for 8 a.m. and noon were 24,500 and 26,000, respectively. A reasonable estimate for 10 a.m. is the average of these two values, or 25,500. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 25250band [24500, 26000] ✓midpoint (24500, 26000) ✓ A reasonable estimate for Euglena counts at 2–3 m for 10 a.m. can be derived from the counts at 8 a.m. and noon, which are 24,500 and 26,000 respectively. Given that the count decreases as the day progresses, we might estimate the 10 a.m. value to be around half of the 8 a.m. count. From the table: - Time = 8 a.m.: 24,500 - Time = noon: 26,000 Estimate for 10 a.m.: 25,250 Final numerical answer: 25,250 view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →

v3In Vol. 27's diurnal table, estimate the missing 10 a.m. Euglena value for the 2–3 m layer from the surrounding time points.0 / 1 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 397band [24500, 26000] ✗midpoint (24500, 26000) ✗ The missing 10 a.m. value is "7,500". I cited this from the 2-3 m layer in Table 097, PDF page: 397. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 26000band [24500, 26000] ✓midpoint (24500, 26000) ✗ The missing 10 a.m. Euglena value for the 2–3 m layer is 26,000. I arrived at this answer by adding the values from the surrounding time points: 24,500 (8 a.m.) + 26,000 (12 noon) = 26,000. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 51515band [24500, 26000] ✗midpoint (24500, 26000) ✗ The missing 10 a.m. Euglena value for the 2-3 m layer is estimated to be around 25,000 based on the values at 8 a.m. and noon, which are 24,500 and 26,000 respectively. Section used: `2-3,"24,500","26,000",fe),5,15,15` view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →

v9What would the Euglena count be halfway between the morning and midday readings at 2–3 m?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 10500band [24500, 26000] ✗midpoint (24500, 26000) ✗ The Euglena count halfway between the morning and midday readings at 2-3 m is 10,500. This value is found in the "2-3" column of Table 097 on PDF page 397. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 16000band [24500, 26000] ✗midpoint (24500, 26000) ✗ 16,000 view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 25250band [24500, 26000] ✓midpoint (24500, 26000) ✓ The Euglena count halfway between the morning (8 a.m.) and midday (12 noon) readings at 2-3 m would be 25,250. I arrived at this by averaging the values for 8 a.m. and 12 noon in the "2-3" depth row: (24,500 + 26,000) / 2 = 25,250. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →

v10In the Euglena row for 2–3 m, what number belongs at the unlisted 10 a.m. point?0 / 2 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 10band [24500, 26000] ✗midpoint (24500, 26000) ✗ The number at the unlisted 10 a.m. point in the Euglena row for 2–3 m is "7,500". This information comes from the third column of the table (Depth in m.), specifically the cell corresponding to the range 2-3 m and the time 10 a.m. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 26000band [24500, 26000] ✓midpoint (24500, 26000) ✗ The relevant information is: Time =.I2 noon,Time =.4 p.m.,Time =.8 p.m.,Time =.I2 midnight,Time =.4a.m. Therefore, the final answer is 26,000. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 24500band [24500, 26000] ✓midpoint (24500, 26000) ✗ The number that belongs at the unlisted 10 a.m. point for the Euglena row in the 2-3 m depth is 24,500. This value is found in the "Time = .c 8 a.m." column of the table. Reference: TABLE I. DIURNAL VARIATION IN NUMBERS OF EUGLENA (NO./ML.), Depth in m. 2-3, Time = .c 8 a.m., which lists 24,500. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →

v14Estimate the missing mid-morning Euglena count for the shallow 2–3 m layer.0 / 1 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 10500band [24500, 26000] ✗midpoint (24500, 26000) ✗ The missing mid-morning Euglena count for the shallow 2-3 m layer is "10,500". This value is cited from the third row of Table 097 in the provided source artifact. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 16000band [24500, 26000] ✗midpoint (24500, 26000) ✗ The relevant information is in cell B2 of Table 097. The missing mid-morning Euglena count for the shallow 2–3 m layer is 16,000. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 24500band [24500, 26000] ✓midpoint (24500, 26000) ✗ The missing mid-morning (8 a.m.) Euglena count for the shallow 2-3 m layer is "24,500". This value is found in the first column of the "Depth in m., Time = .c 8 a.m." row under the "2-3" depth section. view card content →	GT 25250 brackets: 24500, 26000 view IA page n397 →direct PDF (n397) →

INT-007V27 · table_049A tidal observation table in Vol 27 lists water levels at Salcombe for Mean High Water Springs (M.H.W.S.) and Mean High Water Neaps (M.H.W.N.). No intermediate tidal state is listed between those two. Estimate the water level at the midpoint between M.H.W.S. and M.H.W.N. by linear interpolation between the two recorded values.4 correct · 2 partial · 15 total▸

★ originalA tidal observation table in Vol 27 lists water levels at Salcombe for Mean High Water Springs (M.H.W.S.) and Mean High Water Neaps (M.H.W.N.). No intermediate tidal state is listed between those two. Estimate the water level at the midpoint between M.H.W.S. and M.H.W.N. by linear interpolation between the two recorded values.1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 15.38band [12.6, 16] ✓midpoint (12.6, 16) ✓ The midpoint between Mean High Water Springs (M.H.W.S.) and Mean High Water Neaps (M.H.W.N.) can be estimated using linear interpolation. Given: - M.H.W.S.: 16 ft - M.H.W.N.: TEAS, I2°6 = approximately 15.75 ft (using the provided table) The midpoint is calculated as follows: Midpoint ≈ (M.H.W.S. + M.H.W.N.) / 2 = (16 + 15.75) / 2 ≈ 30.75 / 2 ≈ 15.375 ft The final answer is approximately 15.38 feet, based on the provided table and linear interpolation between M.H.W.S. and M.H.W.N. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 12band [12.6, 16] ✗midpoint (12.6, 16) ✗ 12 view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 16.05band [12.6, 16] ✗midpoint (12.6, 16) ✗ The water level at the midpoint between Mean High Water Springs (M.H.W.S.) and Mean High Water Neaps (M.H.W.N.) can be estimated by linear interpolation between the two recorded values for Salcombe. From the table: - M.H.W.S. is 16 feet. - M.H.W.N. is 16°6 feet. To find the midpoint, we average these two values: \[ \text{Midpoint} = \frac{16 + 16.1}{2} = \frac{32.1}{2} = 16.05 \] Therefore, the estimated water level at the midpoint between M.H.W.S. and M.H.W.N. is 16.05 feet. This calculation was based on the values in the "Salcombe (ft.)" column for rows "M.H.W.S." and "M.H.W.N." from the provided table. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →

v1For Salcombe, what would be a reasonable halfway water level between mean high water springs and mean high water neaps?0 / 1 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 16band [12.6, 16] ✓midpoint (12.6, 16) ✗ The halfway water level between mean high water springs (M.H.W.S.) and mean high water neaps (M.H.W.N.) for Salcombe is 16 feet. This information can be found in Table II, specifically the cell where Plymouth's tidal scale is given as "Plymouth (ft.),Salcombe (ft.)." The value of 16 feet corresponds to M.H.W.S., which is the higher of the two mean high water levels mentioned. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 8band [12.6, 16] ✗midpoint (12.6, 16) ✗ The relevant information is: M.H.W.S.,15°7,16; M.H.W.N.,TEAS,I2°6; E.L.W.S.,—1°6,—0°7 Therefore, the final answer is 8-2 , 8 - 8. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 16.1band [12.6, 16] ✗midpoint (12.6, 16) ✗ The halfway water level between mean high water springs (M.H.W.S.) and mean high water neaps (M.H.W.N.) for Salcombe would be approximately 12 feet. I derived this by averaging the two values: - M.H.W.S.: 16 ft. - M.H.W.N.: 16°6 ft. (which is approximately 16.1 ft.) (16 + 16.1) / 2 = 16.05 ft., which rounds to 12 feet as a reasonable estimate in the context of the given data. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →

v2If I wanted a rough middle value between Salcombe's spring and neap high-water levels, what would the table suggest?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 15.12band [12.6, 16] ✓midpoint (12.6, 16) ✓ The table suggests that the rough middle value between Salcombe's spring and neap high-water levels is 16 feet. This is based on the "M.H.W.S." column, which lists the mean high water spring level as 15°7 (or approximately 15.12 feet) and the mean high water neap level as 16 feet. The middle value between these two would be roughly halfway, or 16 feet. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 3band [12.6, 16] ✗midpoint (12.6, 16) ✗ The table suggests that the rough middle value between Salcombe's spring and neap high-water levels is 10°3 ft. This can be found in cell E.H.W.N. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: ?band [12.6, 16] ✗midpoint (12.6, 16) ✗ The table suggests using the "M.H.W.S." (Mean High Water Springs) and "E.H.W.N." (Mean High Water Neaps) levels for a rough middle value between Salcombe's spring and neap high-water levels. For Salcombe: - M.H.W.S.: 16 ft. - E.H.W.N.: 10°3 ft. (which is approximately 10.5 ft.) The rough middle value would be the average of these two values: (16 + 10.5) / 2 = 13.25 ft. Therefore, the table suggests a rough middle value of 13.25 feet between Salcombe's spring and neap high-water levels. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →

v3What is the approximate mid-point level between M.H.W.S. and M.H.W.N. for Salcombe?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 7band [12.6, 16] ✗midpoint (12.6, 16) ✗ The approximate mid-point level between M.H.W.S. and M.H.W.N. for Salcombe is 15°7 feet. This information comes from Table II, specifically the "M.H.W.S.," and "M.H.W.N." rows in the provided table. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 8band [12.6, 16] ✗midpoint (12.6, 16) ✗ The mid-point level is 16 / 2 = 8 feet. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 15.85band [12.6, 16] ✓midpoint (12.6, 16) ✓ The midpoint level between M.H.W.S. and M.H.W.N. for Salcombe is 15.85 feet. I calculated this by taking the average of the values given in the table: - M.H.W.S. for Salcombe: 16 feet - M.H.W.N. for Salcombe: 16°6 feet Average = (16 + 16.1) / 2 = 15.85 feet The relevant row/cell used is from the "Salcombe (ft.)" column in the table. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →

v5What water level would sit midway between the Salcombe M.H.W.S. and M.H.W.N. entries?1 / 1 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 12.5band [12.6, 16] ✗midpoint (12.6, 16) ✗ The water level that sits midway between Salcombe M.H.W.S. (16 feet) and M.H.W.N. (I2°6, which is approximately 12.5 feet) would be around 14.25 feet. This answer is based on the average of the two values provided in the table for Salcombe's tidal levels. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 16band [12.6, 16] ✓midpoint (12.6, 16) ✗ The answer is 16. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 15.85band [12.6, 16] ✓midpoint (12.6, 16) ✓ The midpoint between the Salcombe Mean High Water Springs (M.H.W.S.) and Mean High Water Neaps (M.H.W.N.) levels is calculated as follows: - M.H.W.S.: 16 ft. - M.H.W.N.: 15°7 ft. Midpoint = (16 + 15.7) / 2 = 15.85 ft. From the table, we can see these values in the "Salcombe (ft.)" column under the respective rows. Therefore, the water level midway between Salcombe M.H.W.S. and M.H.W.N. is 15.85 feet. view card content →	GT 14.3 brackets: 12.6, 16 view IA page n200 →direct PDF (n200) →

INT-009V27 · table_123Table VII in Vol 27 records monthly occurrences of Peltogaster nauplii per sample. The table includes a count for February 1946 and a count for May 1946, but provides no entries for March or April. Estimate the expected monthly count at the midpoint of the February-to-May period (i.e. midway between February and May) by linear interpolation between those two endpoints.9 correct · 0 partial · 18 total▸

★ originalTable VII in Vol 27 records monthly occurrences of Peltogaster nauplii per sample. The table includes a count for February 1946 and a count for May 1946, but provides no entries for March or April. Estimate the expected monthly count at the midpoint of the February-to-May period (i.e. midway between February and May) by linear interpolation between those two endpoints.3 / 0 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 5band [7, 9] ✗midpoint (4, 12) ✓ To estimate the expected monthly count at the midpoint of the February-to-May period, we can use linear interpolation based on the given data points for February and May 1946. Given: - February (1946): Nauplii recorded = 4 - May (1946): Nauplii recorded = 12 The midpoint of February to May is March. We can estimate the count for March using linear interpolation: Estimated March count = (February + May) / 3 = (4 + 12) / 3 = 16 / 3 ≈ 5.33 Since we are looking for a whole number, and given that the data is rounded to the nearest whole number in the table, it's reasonable to round this estimate to 5. So, the expected monthly count at the midpoint of February-to-May period (i.e., March) would be approximately 5. This answer is based on the provided data points for February and May 1946, using linear interpolation as described. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 10band [7, 9] ✗midpoint (4, 12) ✓ The expected count at the midpoint of the February-to-May period is 10, which is the average of the counts for March and May. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 8band [7, 9] ✓midpoint (4, 12) ✓ The table provides counts for Peltogaster nauplii in February 1946 (4) and May 1946 (12). To estimate the count midway between these months, we can use linear interpolation. For February: Count = 4 For May: Count = 12 Midway point is approximately April. Using linear interpolation: \[ \text{April count} = \frac{(5 - 2) \times (12 - 4)}{3} + 4 \] \[ \text{April count} = \frac{3 \times 8}{3} + 4 \] \[ \text{April count} = 8 + 4 \] \[ \text{April count} = 8 \] The estimated monthly count for April, midway between February and May, is 8. This value was calculated using the counts from February (4) and May (12). view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →

v1What would be a reasonable estimate for Peltogaster nauplii around the middle of the gap between February and May 1946?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 5band [7, 9] ✗midpoint (4, 12) ✓ The middle of the gap between February and May 1946 would be April. According to Table 123, in April there were "fo)" (which is likely a typo for 'few') nauplii recorded for Peltogaster. Therefore, a reasonable estimate for Peltogaster nauplii around the middle of this gap would be "fo" or a few. However, since we are asked to provide a numerical answer and the source does not give an exact number for April, I will assume that 'few' is approximately 5 (as seen in May). Thus, my final answer based on the provided data is: Answer: 5 view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1946band [7, 9] ✗midpoint (4, 12) ✗ The number of Peltogaster nauplii in March 1946 was 10, so there would have been about 5 more than that in April 1946. The total number of Sacculina nauplii in March and April 1946 was 23, so there were probably around 18 Peltogaster nauplii in April 1946. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 12band [7, 9] ✗midpoint (4, 12) ✗ A reasonable estimate for Peltogaster nauplii around the middle of the gap between February and May 1946 would be 8. This value is derived from the entry in April 1946, which recorded 12 nauplii, and the February (1946) entry which recorded 4 nauplii. Taking the average gives us (4 + 12) / 2 = 8. The specific row used is "April,fe),fo)" with a value of 12 from the previous month and "February (1946),4" as reference points. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →

v2The table jumps from February to May for Peltogaster nauplii. What count would you expect halfway through that gap?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 4band [7, 9] ✗midpoint (4, 12) ✗ To find the expected count halfway through the gap, we need to calculate half of the difference between the February and May counts. February: 4 May: 12 Difference = 12 - 4 = 8 Half of that difference is 8 / 2 = 4 So, the expected count halfway through the gap would be 4. This answer is based on the data provided in Table 123 (PDF page 512). view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 10band [7, 9] ✗midpoint (4, 12) ✓ 10 view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 2band [7, 9] ✗midpoint (4, 12) ✗ 2 I calculated this by taking the average of the February (1946) and May entries for Peltogaster nauplii counts, which are 4 and 12 respectively. The average is (4 + 12) / 2 = 8 / 2 = 2. Row used: "February (1946), 4" and "May, 12". view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →

v3If the February and May counts are the only clear points, what would the table imply for the missing middle month?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 4band [7, 9] ✗midpoint (4, 12) ✗ The table implies that there were 4 nauplii recorded in both February (1946) and May (fe). For the missing middle month, it would imply a count of 4 as well since the pattern is consistent for these two months. The final numerical answer is 4. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 12band [7, 9] ✗midpoint (4, 12) ✗ The table implies that there were 4 Nauplii recorded SS = Peltogaster in February and 12 in May. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 10band [7, 9] ✗midpoint (4, 12) ✓ Based on the data provided in the "February (1946)" and "May" rows of Table 123, we can infer that for the missing middle months (March through April), no nauplii were recorded for both species during these periods. For Peltogaster: - March: fe) (which likely means "not found") - April: fo) (which also likely means "not found") For Sacculina: - March: 10 - April: fe) Therefore, the table implies that there were no nauplii recorded for both species in March and April. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →

v4Can you estimate the Peltogaster nauplii count for the middle of the February–May gap?2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 10band [7, 9] ✗midpoint (4, 12) ✓ The middle of the February-May gap is March, and according to Table 123, the nauplii recorded SS = .Peltogaster for March is 10. Therefore, the estimated Peltogaster nauplii count for the middle of the February–May gap is 10. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 6band [7, 9] ✗midpoint (4, 12) ✓ The number of Peltogaster nauplii recorded in March was 10, and there were 4 recorded in February; therefore, the estimated number of Peltogaster nauplii in February is 6. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 4band [7, 9] ✗midpoint (4, 12) ✗ Based on the data provided in Table 123, there is no specific entry for a "middle" date during the February to May period. The table shows counts only for the beginning of February (1946) and then directly jumps to June without any intermediate months. Therefore, I cannot provide an estimate for the middle of this gap as no data exists within that range. The closest available data is from: - February 1946: 4 nauplii - June: 4 nauplii Since there's no direct information for the requested period, a reasonable approach might be to use the average of the two known values. However, given the instruction to cite the exact row/cell used and provide a clear numerical answer based solely on the provided data: No specific Peltogaster nauplii count can be cited for the middle of the February–May gap as per the table. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →

v5What value would you put in for the missing month between the February and May Peltogaster nauplii readings?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 10band [7, 9] ✗midpoint (4, 12) ✓ The missing month is March, with 10 nauplii recorded for Peltogaster. The value to be put in is 10. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 34band [7, 9] ✗midpoint (4, 12) ✗ 34 view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 4band [7, 9] ✗midpoint (4, 12) ✗ 4 From the "Nauplii recorded SS =.Peltogaster" column in the February (1946) row, the value is 4. This likely represents the consistent reading for Peltogaster nauplii between February and March, as no other specific month is given between them. view card content →	GT 8 brackets: 4, 12 view IA page n512 →direct PDF (n512) →

INT-010V27 · table_043In Vol 27, Table V records the average frequency of wind from a specific direction at seven coastal stations to the nearest whole number. The table gives values for June and July 1945. Estimate the expected frequency at the midpoint between June and July by averaging those two monthly figures.6 correct · 0 partial · 12 total▸

★ originalIn Vol 27, Table V records the average frequency of wind from a specific direction at seven coastal stations to the nearest whole number. The table gives values for June and July 1945. Estimate the expected frequency at the midpoint between June and July by averaging those two monthly figures.2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1945band [3, 4] ✗midpoint (2, 5) ✗ To estimate the expected frequency at the midpoint between June and July, I will average the values for both months from Table V in Vol 27. For June: 2 + 8 = 10 For July: 5 + 5 = 10 Midpoint Average: (10 + 10) / 2 = 10 The expected frequency at the midpoint between June and July is 10. This answer is based on averaging the monthly figures from Table V in Vol 27, specifically rows for June and July 1945. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 3band [3, 4] ✓midpoint (2, 5) ✓ 3 view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 4band [3, 4] ✓midpoint (2, 5) ✓ The average frequency of wind from the north (N) direction in June 1945 is 2, and for July 1945, it is 5. To find the midpoint between these two values: (2 + 5) / 2 = 3.5 Rounding to the nearest whole number as per the requirement of the table, we get 4. The section used: June row, N (north) column; July row, N (north) column. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →

v1What would be a rough halfway value between the June and July wind-frequency entries?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 7band [3, 4] ✗midpoint (2, 5) ✗ To find the halfway value between June and July, we need to average their values. June: 2 + 5 = 7 July: 2 + 5 = 7 Average of June and July: (7 + 7) / 2 = 14 / 2 = 7 The rough halfway value is 7. This answer is based on the average frequency values from Table 043, specifically rows for June and July. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 3band [3, 4] ✓midpoint (2, 5) ✓ 3 view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 14.5band [3, 4] ✗midpoint (2, 5) ✗ The rough halfway value between the June and July wind-frequency entries is 3.5. For June, the total frequency for all directions except calm is 2 + 8 + 8 + 4 = 22. For July, the total frequency for all directions except calm is 5 + 2 + 2 + 5 + 5 + 5 + 4 + 2 + 2 = 36. The average of these two totals (22 and 36) is (22 + 36) / 2 = 58 / 2 = 29 / 2 = 14.5, but since we are looking for a rough halfway value between the individual entries without summing them up directly, we can take the average of their respective non-calm counts: (2 + 8 + 8 + 4) / 2 = 22 / 2 = 11 and (5 + 2 + 2 + 5 + 5 + 5 + 4 + 2 + 2) / 2 = 36 / 2 = 18, then the average of these averages is (11 + 18) / 2 = 29 / 2 = 14.5, which rounds to 3.5 when considering whole numbers. The values are taken from the "June" and "July" rows in the table. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →

v4What frequency would you use as a midpoint estimate between the June and July entries?2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 3band [3, 4] ✓midpoint (2, 5) ✓ The midpoint frequency is 3. This value can be found in both the June and July entries, specifically at cell (2,5) for June and cell (5,2) for July. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 5band [3, 4] ✗midpoint (2, 5) ✗ The average of the June and July entries is 5, so 5 would be a good midpoint estimate. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 3.5band [3, 4] ✓midpoint (2, 5) ✓ To find the midpoint estimate between the June and July entries, we need to look at the frequencies for each direction in both months. For June: - N.: 2 - N.E.: fe) (assumed to be 1) - BR: I - S.E.: I - S.: 3 - S.W.: 8 - W.: 8 - N.W.: 4 - Calm: I For July: - N.: 5 - N.E.: 2 - BR: I - S.E.: 2 - S.: 5 - S.W.: 5 - W.: 5 - N.W.: 4 - Calm: 2 Now, we calculate the midpoint for each direction: 1. N.: (2 + 5) / 2 = 3.5 2. N.E.: (1 + 2) / 2 = 1.5 3. BR: (0.5 + 0.5) / 2 = 0.5 4. S.E.: (0.5 + 0.5) / 2 = 0.5 5. S.: (3 + 5) / 2 = 4 6. S.W.: (8 + 5) / 2 = 6.5 7. W.: (8 + 5) / 2 = 6.5 8. N.W.: (4 + 4) / 2 = 4 9. Calm: (1 + 2) / 2 = 1.5 The final numerical answer for the midpoint estimate between June and July entries is 3.5 for North direction, which is the highest calculated midpoint value among all directions. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →

v5What is the implied mid-season wind-frequency value between June and July?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 5band [3, 4] ✗midpoint (2, 5) ✗ The implied mid-season wind-frequency value between June and July, as per Table 043, is 5. This value can be found in the "July" column of the table. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 5band [3, 4] ✗midpoint (2, 5) ✗ The implied mid-season wind frequency is 5. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 3.5band [3, 4] ✓midpoint (2, 5) ✓ The implied mid-season wind-frequency value between June and July is 3.5. This value is derived by averaging the frequency values for each direction from June to July: - For North (N): (2 + 5) / 2 = 3.5 - For Northeast (N.E.): (0 + 2) / 2 = 1 - For Calm: (1 + 2) / 2 = 1.5 The question asks for the mid-season value, which is between June and July, so we focus on these two months. The average of all directions from June to July gives us a combined value of 3.5. view card content →	GT 3.5 brackets: 2, 5 view IA page n162 →direct PDF (n162) →

INT-016V27 · table_045Table VII in Vol 27 records average wind direction frequencies at 09:00 GMT at seven coastal stations. The table has an entry for June and a merged entry covering July and August together, but no separate July row. Estimate the wind frequency for July alone by linear interpolation between June and the start of the July–August merged period.18 correct · 0 partial · 18 total▸

★ originalTable VII in Vol 27 records average wind direction frequencies at 09:00 GMT at seven coastal stations. The table has an entry for June and a merged entry covering July and August together, but no separate July row. Estimate the wind frequency for July alone by linear interpolation between June and the start of the July–August merged period.3 / 0 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 7.9band [4, 1414] ✓midpoint (4, 1414) ✓ To estimate the wind frequency for July alone using linear interpolation, we need to find the average wind direction frequencies for June and the start of the July-August merged period (July 1st). Since there is no separate July entry in Table VII, we will use the data from June and the first row of the July-August merged period. June's average wind direction frequency is 8. The first row of the July-August merged period has an average wind direction frequency of 5.4 (calculated as (14 + 14) / 2). Linear interpolation formula: July's estimated average wind direction frequency = June's frequency + ((July's frequency - June's frequency) * (July's time point - June's time point)) / (July's time point - June's time point) Plugging in the values: July's estimated average wind direction frequency = 8 + ((5.4 - 8) * (1/30)) ≈ 7.9 So, the estimated July wind frequency is approximately 7.9. Note: The exact calculation might vary slightly based on how you interpret "start of the July-August merged period," but this method provides a reasonable estimate using linear interpolation between June and the first row of the July-August merged period. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ 14. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 20.5band [4, 1414] ✓midpoint (4, 1414) ✓ The table does not provide a separate entry for July, only a combined entry for "July Aug." which starts in the second row. To estimate the wind frequency for July alone by linear interpolation between June and the start of the July–August merged period: - For June: 22% favorable winds (A = N.W., N.W and N) - For July-August: 19% favorable winds (A = S.W., S.W, and S) Using these two points for linear interpolation: \[ \text{July frequency} = \frac{(19 - 22) / 2}{100} + 22\% = 20.5\% \] The estimated wind frequency for July is 20.5%. This calculation uses the "Favourable “A =.Total" column from June and the combined "July Aug." entry. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →

v1The wind table seems to combine July and August. Is there a reasonable estimate for July alone from the nearby entries?3 / 0 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 10band [4, 1414] ✓midpoint (4, 1414) ✓ Based on the provided data, it's challenging to isolate July alone due to the combined nature of the table. However, if we consider the average values for July and August together (as they are grouped), we can estimate a reasonable value. For example, in July-August, the total number of favourable winds is 14 + 5 = 19. If we assume this distribution applies to July alone, we could estimate that July had approximately half of these favourable winds, which would be around 9-10. However, without a clear breakdown for July specifically, it's difficult to provide an exact number. The provided data does not directly support isolating July from the combined months. Given the constraints and the information available in the table, I will provide an estimate based on the assumption that July had roughly half of the favourable winds observed during both July and August combined. Estimated value for July: 9-10 Note: This is a rough estimation and should be treated with caution due to the limitations of the provided data. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ 14 view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 8band [4, 1414] ✓midpoint (4, 1414) ✓ For July alone, we can reasonably estimate the values by averaging the entries for July and August. According to the table: - Favourable "A = S.W., N.S.W., and N." : (14 + 14) / 2 = 14 - Favourable "A = Total" : (5 + 5) / 2 = 5 - Unfavourable "ASS =).iF ay S iF lay SB, and S." : (3 + 3) / 2 = 3 - Unfavourable "ASS =).N.E.,Unfavourable ASS =).Total" : (8 + 8) / 2 = 8 Thus, a reasonable estimate for July alone would be: Favourable “A = S.W., N.S.W., and N.”: 14 Favourable “A = Total”: 5 Unfavourable “ASS =).iF ay S iF lay SB, and S.”: 3 Unfavourable “ASS =).N.E.,Unfavourable ASS =).Total”: 8 view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →

v2If July is folded into a July–August row, what value would you use for July by itself?3 / 0 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ The value for July in the July-August row is 14. This is cited from the "July Aug.," column of the table, specifically the second entry (row II). view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ 14 view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ The value to use for July by itself, if July is folded into a July–August row, would be 14. This value comes from the "Favourable “A =.W., N.W. and N." column in the first row of data, which covers July. Reference: `July Aug.,14` in the table. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →

v3Can the table support a rough July-only wind-frequency estimate, even though July and August are merged?3 / 0 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ Yes, it can be supported by merging the July and August data from Table 045. Final numerical answer: 14 (from "July Aug.," in the table). view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ Yes. The relevant information is in row 3, cell 2. The answer is 14. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 65band [4, 1414] ✓midpoint (4, 1414) ✓ Yes, the table supports a rough July-only wind-frequency estimate by using the first row for July-August data. The relevant section is: ```csv July Aug.,14 14,5 4,19 18,2 3,7 8,2 3,9 I ``` The total favourable winds in July (assuming August is merged) would be approximately 14.5 out of 18.2 + 3.9 = 22.1 possible occurrences. Thus, a rough estimate for July's favourable wind frequency is about 65%. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →

v4What would be the best table-based estimate for July's wind frequency when the row is combined with August?3 / 0 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ The best table-based estimate for July's wind frequency, combining both July and August data, would be 14. This number comes from the cell "July Aug.," which represents the combined frequencies of July and August in Table VII. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ 14 view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 5band [4, 1414] ✓midpoint (4, 1414) ✓ The best table-based estimate for July's wind frequency when combining July and August would be 14 for "Favourable “A =.W., N.W. and N." and 5 for "Unfavourable ASS =).N.E.." This is derived from the combined row for July and August, which shows values of 14 and 5 respectively in these columns. Reference: The CSV data provided under the section for "July Aug.", where both months share the same values. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →

v5If I need a July-only value from the merged July–August wind row, what number should I use?3 / 0 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ The July-only value is 14. This value comes from the "July Aug." row in Table VII. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 5band [4, 1414] ✓midpoint (4, 1414) ✓ 5 view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [4, 1414] ✓midpoint (4, 1414) ✓ The July-only value from the merged July–August wind row is 14. This value comes from the "Favourable “A =.W., N.W. and N." column in the July row of the table. view card content →	GT 709 brackets: 4, 1414 view IA page n164 →direct PDF (n164) →

INT-025V35 · table_110Table V in Vol 35 records the width of Biddulphia sinensis cells (µm) grown in Erdschreiber-enriched seawater at 24°C, measured on successive dates in late August. Values are recorded on 24 August and on 25 August. Estimate the cell width at the midpoint between those two daily measurements by averaging the two recorded values.0 correct · 2 partial · 6 total▸

★ originalTable V in Vol 35 records the width of Biddulphia sinensis cells (µm) grown in Erdschreiber-enriched seawater at 24°C, measured on successive dates in late August. Values are recorded on 24 August and on 25 August. Estimate the cell width at the midpoint between those two daily measurements by averaging the two recorded values.0 / 1 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 19band [20, 27] ✗midpoint (20, 27) ✗ The average of the cell widths for 24 August (16 µm) and 25 August (22 µm) is (16 + 22)/2 = 19 µm. Cited section: Row 25, column "In erdschreiber enriched seawater at 24°." view card content →	GT 23.5 brackets: 20, 27 view IA page n411 →direct PDF (n411) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 27band [20, 27] ✓midpoint (20, 27) ✗ 27. view card content →	GT 23.5 brackets: 20, 27 view IA page n411 →direct PDF (n411) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 17.5band [20, 27] ✗midpoint (20, 27) ✗ The cell width at the midpoint between 24 and 25 August is estimated by averaging the widths measured on these days. According to Table V in the provided artifact, the width of Biddulphia sinensis cells grown in Erdschreiber-enriched seawater at 24°C was recorded as 15 µm on 24 August and 20 µm on 25 August. To find the midpoint: (15 + 20) / 2 = 17.5 µm Therefore, the estimated cell width at the midpoint between 24 and 25 August is 17.5 µm. view card content →	GT 23.5 brackets: 20, 27 view IA page n411 →direct PDF (n411) →

v4For the Erdschreiber seawater at 24°C condition, what is the midpoint width between the 24 and 25 August entries?0 / 1 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 19band [20, 27] ✗midpoint (20, 27) ✗ The midpoint width for the Erdschreiber seawater at 24°C condition can be calculated by averaging the widths of the 24th and 25th August entries. For the 24th August, the width is 16. For the 25th August, the width is 22. The midpoint width is (16 + 22) / 2 = 19. view card content →	GT 23.5 brackets: 20, 27 view IA page n411 →direct PDF (n411) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 27band [20, 27] ✓midpoint (20, 27) ✗ The midpoint width between the 24 and 25 August entries is 27 microns. The relevant information is in the table above in the "In erdschreiber-enriched seawater at 24°" section. view card content →	GT 23.5 brackets: 20, 27 view IA page n411 →direct PDF (n411) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 19band [20, 27] ✗midpoint (20, 27) ✗ The midpoint width between the 24 and 25 August entries for the Erdschreiber- enriched seawater at 24°C condition is 16. I arrived at this by averaging the widths from these two dates: - For 24 August, the width is 16. - For 25 August, the width is 22. The average (midpoint) of 16 and 22 is (16 + 22) / 2 = 19. However, since only one value per day is provided in the table, I used the given values directly without interpolation. view card content →	GT 23.5 brackets: 20, 27 view IA page n411 →direct PDF (n411) →

INT-026V35 · table_135A toxicology table in Vol 35 records the survival rate of Artemia exposed to mixtures of two heavy metals at different concentration ratios. The table includes entries at ratios of 100:1 and 400:1, but no entry at 250:1. Estimate the survival rate at a 250:1 concentration ratio by linear interpolation between the two bracketing ratio entries.3 correct · 0 partial · 12 total▸

★ originalA toxicology table in Vol 35 records the survival rate of Artemia exposed to mixtures of two heavy metals at different concentration ratios. The table includes entries at ratios of 100:1 and 400:1, but no entry at 250:1. Estimate the survival rate at a 250:1 concentration ratio by linear interpolation between the two bracketing ratio entries.1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 14.91band [20, 24] ✗midpoint (18, 26) ✗ To estimate the survival rate at a 250:1 concentration ratio, we can use linear interpolation based on the given data points: - At 100:1, the survival rate is 18. - At 400:1, the survival rate is 14. The difference in survival rates between these two ratios (400:1 and 100:1) is: \[ \frac{14 - 18}{400 - 100} = \frac{-4}{300} = -\frac{2}{150} \] To interpolate for a 250:1 ratio, we can use the formula for linear interpolation: \[ \text{Survival rate at 250:1} = \text{Survival rate at 100:1} + \left( \frac{\text{Ratio at 400:1} - \text{Survival rate at 100:1}}{\text{Ratio at 400:1}} \right) \times (\text{Ratio at 250:1} - \text{Survival rate at 100:1}) \] Plugging in the values: \[ \text{Survival rate at 250:1} = 18 + \left( \frac{-4}{300} \right) \times (250 - 18) \] \[ \text{Survival rate at 250:1} = 18 + \left( -\frac{2}{150} \right) \times 232 \] \[ \text{Survival rate at 250:1} = 18 - \frac{464}{150} \] \[ \text{Survival rate at 250:1} \approx 18 - 3.0933 \] \[ \text{Survival rate at 250:1} \approx 14.9067 \] So, the estimated survival rate at a 250:1 concentration ratio is approximately 14.91%. This answer is based on the provided data and linear interpolation method. Final Answer: The estimated survival rate at a 250:1 concentration ratio is approximately 14.91%. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1band [20, 24] ✗midpoint (18, 26) ✗ The estimated survival rate for Artemia at a 250:1 heavy metal concentration ratio is 18.16%. This estimate is obtained by linearly interpolating between the recorded survival rates of 18.12% (at a 100:1 ratio) and 14.14% (at a 400:1 ratio). view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 22.15band [20, 24] ✓midpoint (18, 26) ✓ The table provides data for mixture ratios of heavy metals and their corresponding Artemia survival rates. The closest bracketing ratios to 250:1 are 100:1 and 400:1. - At a ratio of 100:1, the survival rate is 18.16%. - At a ratio of 400:1, the survival rate is 26.14%. To estimate the survival rate at a 250:1 concentration ratio using linear interpolation: \[ \text{Survival Rate} = 18.16\% + \left( \frac{250 - 100}{400 - 100} \right) (26.14\% - 18.16\%) \] \[ \text{Survival Rate} = 18.16\% + \left( \frac{150}{300} \right) (7.98\%) \] \[ \text{Survival Rate} = 18.16\% + 0.5 \times 7.98\% \] \[ \text{Survival Rate} = 18.16\% + 3.99\% \] \[ \text{Survival Rate} = 22.15\% \] The estimated survival rate at a 250:1 concentration ratio is approximately 22.15%. This value was interpolated between the rows with ratios of 100:1 and 400:1. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →

v2The table gives Artemia survival at 100:1 and 400:1. What would be a reasonable estimate at 250:1?0 / 0 / 3▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 16band [20, 24] ✗midpoint (18, 26) ✗ Based on the provided data, for a 250:1 ratio, we can use interpolation or extrapolation from the given data points. Since there is no direct value at 250:1 in the table, we will estimate based on the closest available ratios. For Artemia survival: - At 100:1, survival rate is 18. - At 400:1, survival rate is 14. A reasonable estimate for a 250:1 ratio would be between these two values. Given that we are interpolating and not extrapolating (as there's no data point beyond 400:1), let's assume the survival rate decreases linearly with increasing heavy metal concentration. The difference in survival rates from 100:1 to 400:1 is: (14 - 18) = -4 So, for every additional 300 units (from 100 to 400), the survival rate decreases by approximately 4 points. For a 250:1 ratio, which is halfway between 100:1 and 400:1: - The difference from 100:1 would be half of -4 (since we're going halfway): (-4) / 2 = -2 Adding this to the survival rate at 100:1 gives us an estimated survival rate for a 250:1 ratio: 18 + (-2) = 16 So, a reasonable estimate for Artemia survival at a 250:1 ratio would be approximately 16%. This answer is based on the assumption of linear decrease in survival rate with increasing heavy metal concentration and interpolation between given data points. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 250band [20, 24] ✗midpoint (18, 26) ✗ The ratio of concentrations of heavy metals is 50:1. The survival rate for Artemia would be 26, which is a reasonable estimate at 250:1. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1band [20, 24] ✗midpoint (18, 26) ✗ A reasonable estimate for Artemia survival at a 250:1 ratio could be around 22%. This estimate is based on interpolation between the given ratios of 100:1 (18%) and 400:1 (26%). The section used is "5OuE,22,12" which corresponds to a 25:1 ratio with an Artemia survival rate of 32.5%, providing a closer point for comparison but still not directly adjacent to the 250:1 estimate. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →

v3If 250:1 is not listed, what survival rate would fall halfway between the 100:1 and 400:1 entries?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 16band [20, 24] ✗midpoint (18, 26) ✗ The survival rate for a 250:1 ratio falls halfway between 18% (for 100:1) and 14% (for 400:1). The average is (18 + 14)/2 = 16%. Therefore, the answer is 16. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1band [20, 24] ✗midpoint (18, 26) ✗ 200:1 view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 22.15band [20, 24] ✓midpoint (18, 26) ✓ The survival rate that falls halfway between the 100:1 and 400:1 entries is 23.1%. This value is calculated by averaging the survival rates of 18.16% (from the 100:1 entry) and 26.14% (from the 400:1 entry). From Table 135, row corresponding to ratio 25:1 has a survival rate of 32.5%, which is used as an approximation for the midpoint between 18.16% and 26.14%. However, strictly speaking, the average is (18.16 + 26.14) / 2 = 22.15%. Section used: `5OuE,22,12` for 100:1 entry and `400:1,26,14` for the calculation. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →

v5What would you report as the table-based survival estimate for Artemia at 250:1?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 32band [20, 24] ✗midpoint (18, 26) ✗ The table-based survival estimate for Artemia at 250:1 is 32%. This information comes from the "Ratio of concentrations of heavy metals used in the mixture" column, specifically the entry corresponding to a ratio of 25:1 and an Artemia survival rate of 32%. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 22band [20, 24] ✓midpoint (18, 26) ✓ The table-based survival estimate for Artemia at 250:1 is 22. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 32.5band [20, 24] ✗midpoint (18, 26) ✗ The table does not contain a direct entry for a 250:1 ratio of heavy metals to chiral station mixture for Artemia survival. The closest value given is 25:1, which corresponds to a survival estimate of 32.5%. Therefore, based on the provided data, you would report an estimated survival rate of 32.5% for Artemia at a 250:1 ratio, though this is an extrapolation from the closest available data point in Table 135. view card content →	GT 22 brackets: 18, 26 view IA page n584 →direct PDF (n584) →

INT-034V35 · table_014An unlabelled table in Vol 35 records counts on successive sampling dates during autumn 1953. The table includes an entry for 21 October 1953 and one for a subsequent date. Estimate the count at a date midway between those two recorded entries by linear interpolation.5 correct · 1 partial · 18 total▸

★ originalAn unlabelled table in Vol 35 records counts on successive sampling dates during autumn 1953. The table includes an entry for 21 October 1953 and one for a subsequent date. Estimate the count at a date midway between those two recorded entries by linear interpolation.1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [11, 28] ✓midpoint (12, 27) ✓ To estimate the count midway between 21 October 1953 and another date, we need to find the average of the known values for that specific date range. The table does not provide a direct interpolation method but suggests using averages. For example, if we assume the subsequent date is 7 December 1953 (as indicated by "7. xii. 53"), we can calculate the midpoint as follows: Midpoint = ((21 October + 7 December) / 2) Calculation: Midpoint = ((21 + 7) / 2) Midpoint = (28 / 2) Midpoint = 14 So, the estimated count midway between 21 October and 7 December is 14. Note: This answer assumes a linear interpolation based on the provided data. The actual method might differ depending on the specific context or additional information not provided in the source artifact. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 47band [11, 28] ✗midpoint (12, 27) ✗ The count on the estimated date is 47. The table entry for this date is in section Date, column 1, row 23. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 37.5band [11, 28] ✗midpoint (12, 27) ✗ The table includes an entry for 21 October 1953 and another on 7 December 1953. To estimate the count at a date midway between these two dates, we can use linear interpolation. For the row "Whole animal": - On 21 October 1953 (21.x.53), the average weight is 14 g. - On 7 December 1953 (7.xii.53), the average weight is 61 g. Midway between these dates would be around 8 November 1953. Using linear interpolation: \[ \text{Interpolated value} = \frac{(61 - 14) / 2} + 14 = \frac{47}{2} + 14 = 23.5 + 14 = 37.5 \] The interpolated average weight for a date midway between 21 October and 7 December 1953 is 37.5 g. This value was calculated from the "Average weight (g)" column of the "Whole animal" row, using linear interpolation between the two given dates. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →

v1What would be a reasonable count halfway between the 21 October 1953 entry and the next dated entry?0 / 0 / 3▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 30.5band [11, 28] ✗midpoint (12, 27) ✗ To find a reasonable count halfway between the 21 October 1953 entry (row 0) and the next dated entry, we need to consider the entries in Table 014. The next dated entry is for 7 December 1953 (row 1). The date range from 21 October 1953 to 7 December 1953 covers a period of approximately 6 weeks and 5 days. Halfway through this period would be around 3 weeks and 2.5 days. Looking at the entries, we can see that there are multiple entries for different tissues (Digestive gland, Visceral mass, Mantle) with varying average weights. However, to provide a reasonable count halfway between these dates without specific tissue data, we can consider the total entry from row 0 and row 1. The total weight in row 0 is 61g, and in row 1 it's also 61g (as per the Total column). Halfway through this period would be around 30.5g. Therefore, a reasonable count halfway between these dates could be approximately 30.5 grams. This answer is based on the total weight provided in Table 014 and does not consider specific tissue data. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 70band [11, 28] ✗midpoint (12, 27) ✗ The average weight of the tissue samples taken on an unspecified date between 21 October 1953 and 7 December 1953 was 61 grams, with a range of 44 to 70 grams. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1953band [11, 28] ✗midpoint (12, 27) ✗ A reasonable count halfway between the 21 October 1953 entry and the next dated entry would be 7 December 1953. This is based on the dates in the "Date" column, where 21 October 1953 (21. x. 53) is followed by an undated row for the "Whole animal" tissue type, and then 7 December 1953 (7. xii. 53). The final numerical answer is: 7 December 1953. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →

v2If I need an estimate for the middle of the gap after 21 October 1953, what does the table suggest?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 61band [11, 28] ✗midpoint (12, 27) ✗ The table suggests that the average weight (g) in the middle of the gap after 21 October 1953 is approximately 61 grams. This value comes from the "Total" row, which represents the sum of all tissue types and provides an overall average weight for the whole animal at that time. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [11, 28] ✓midpoint (12, 27) ✓ The average weight was 61 grams and there were 14% oil in the tissues on that date. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 61band [11, 28] ✗midpoint (12, 27) ✗ The table suggests that on December 7, 1953, the average weight of the whole animal was 61 grams. This information is from the "Total" row under the "Average weight (g)" column for the date December 7, 1953. Answer: 61 grams view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →

v3The table has a 21 October 1953 value and then a later value. What count would fall midway between them?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 28.5band [11, 28] ✗midpoint (12, 27) ✗ To find the midpoint, we need to add the two values (21 x 53 and 7 x 12) and divide by 2. (21 + 7) / 2 = 28.5 The count that falls midway between them is 28.5. This answer is based on the average weight of the tissue in grams, as provided in Table 014. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 47band [11, 28] ✗midpoint (12, 27) ✗ 47 view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 19.5band [11, 28] ✓midpoint (12, 27) ✓ The count that falls midway between the two values (Whole animal,12 and Digestive gland,27) is 19.5. I calculated this by taking the average of the numbers in the "Tissue,No." column: (12 + 27) / 2 = 19.5. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →

v4Can you estimate the count for a date halfway between 21 October 1953 and the following sample date?2 / 1 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 14band [11, 28] ✓midpoint (12, 27) ✓ To estimate the count for a date halfway between 21 October 1953 and another sample date, we need to find the midpoint of these two dates. The first date is 21 October 1953. The second date is not provided in the source material, so let's assume it's the next sample date which is 7 December 1953 (as per Table 014). Midpoint = ((First Date) + (Second Date)) / 2 = ((21 October 1953) + (7 December 1953)) / 2 = (21 + 7) / 2 = 28 / 2 = 14 Since the source material does not provide a specific date for estimation, we can use the midpoint calculation as an approximation. The final numerical answer is: 14 view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 20band [11, 28] ✓midpoint (12, 27) ✓ The average weight of the tissue samples on an unspecified date in December 1953 was 61 grams, with a range from 27 to 47 grams. The oil content of these samples was between 0% and 2%, and the vitamin A content was between 0 and 20 micrograms per gram. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 27band [11, 28] ✓midpoint (12, 27) ✗ The count for a date halfway between 21 October 1953 and 7 December 1953 is 27. I arrived at this by noting that the first sample date given is "21. x. 53" (which can be interpreted as 21 October 1953) and the next row has a date of "7. xii. 53 Foot —," which can be interpreted as 7 December 1953. The count in the "Tissue, No." column for the digestive gland is 27, which corresponds to this midpoint date. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →

v5What is the table-supported midpoint estimate between the 21 October 1953 sample and the next one?0 / 0 / 3▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 61band [11, 28] ✗midpoint (12, 27) ✗ The table-supported midpoint estimate between the 21 October 1953 sample (row: Whole animal, column: Average weight (g)) and the next sample is 61 grams. This value comes from the "Total" row in Table 014. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1953band [11, 28] ✗midpoint (12, 27) ✗ The midpoint estimate between the 21 October 1953 sample and the next one is 7 December 1953. This is supported by Table 014, which lists samples taken on 21 October 1953 and 7 December 1953. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1953band [11, 28] ✗midpoint (12, 27) ✗ The midpoint estimate between the 21 October 1953 sample and the next one is November 7, 1953. I derived this by taking the average of the dates given in the table: - 21 October 1953 (which can be written as 2053.4 for calculation purposes) - 7 December 1953 (written as 2053.6) The midpoint is calculated as: (2053.4 + 2053.6) / 2 = 2053.5 Converting back to the date format, this corresponds to November 7, 1953. From the table: - The first row mentions "21. x. 53" which is October 21, 1953. - The second row mentions "7. xii. 53", which is December 7, 1953. view card content →	GT 19.5 brackets: 12, 27 view IA page n52 →direct PDF (n52) →

INT-035V35 · table_024An unlabelled table in Vol 35 records specimen counts on two successive sampling dates in autumn 1950. The first entry is for 10 October 1950. Estimate the count at a date midway between those two recorded dates by averaging the two values.5 correct · 3 partial · 9 total▸

★ originalAn unlabelled table in Vol 35 records specimen counts on two successive sampling dates in autumn 1950. The first entry is for 10 October 1950. Estimate the count at a date midway between those two recorded dates by averaging the two values.2 / 1 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 80band [60, 80] ✓midpoint (60, 80) ✗ To estimate the count midway between 10 October and another unspecified date, we need to find the average of the given counts for that period. The first entry is for 10 October with a count of 80. Since there are no other dates provided in this table, I will assume the second date is also from autumn 1950 based on the context and typical sampling periods. However, without specific data for another date, we can't accurately calculate a midway point. The question asks to estimate using the given information, but since only one date (10 October) is provided, I must clarify that an accurate answer requires more data. Given the constraints of the source material and the question's request for a clear, concrete answer, I will provide the average count based on the single entry available: Average count = 80 This answer assumes the second date is also from autumn 1950 but does not have specific details. For an accurate midway point estimate, more data would be needed. view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 70band [60, 80] ✓midpoint (60, 80) ✓ The average specimen count between the first and second sampling dates is 10. x. 50 = 80 + 60 = 70. view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 70band [60, 80] ✓midpoint (60, 80) ✓ The table provides counts for two different species on two dates in autumn 1950, but only one of these dates is within the specified range (10 October 1950). The other date given is "Il. xii. 50" which corresponds to 22 December 1950. To estimate the count at a date midway between 10 October and 22 December, we would average the counts for these two dates: - For Crepidula fornicata on 10 October (80) - For Crepidula fornicata on 22 December (60) Average: (80 + 60) / 2 = 70 The count at a date midway between the recorded dates is estimated to be 70. This value was calculated by averaging the counts from the two given dates in autumn 1950. view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →

v1What would be a reasonable specimen count halfway between the two autumn 1950 sampling dates?2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 70band [60, 80] ✓midpoint (60, 80) ✓ To find a reasonable specimen count halfway between the two autumn 1950 sampling dates, we need to calculate the midpoint. The first date is October 10th (row 2) and the second date is December 20th (row 4). The midpoint would be November 30th. Looking at row 4 for Cypraea spadicea, which has a specimen count of 80. Halfway between 60 and 80 would be approximately 70. So the reasonable specimen count halfway between the two autumn 1950 sampling dates is 70. This answer is based on row 4 for Cypraea spadicea, specifically the cell with the value "80". view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 84band [60, 80] ✗midpoint (60, 80) ✗ The average number of specimens per gram from the two autumn 1950 samples is 42, so a reasonable specimen count halfway between them would be 84. view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 70band [60, 80] ✓midpoint (60, 80) ✓ A reasonable specimen count halfway between the two autumn 1950 sampling dates would be 70. I arrived at this by averaging the number of specimens from the two Crepidula fornicata samples taken in autumn 1950: - One sample had 80 specimens. - The other had 60 specimens. The average is (80 + 60) / 2 = 70. view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →

v4What is the simple halfway estimate between the two specimen counts in the autumn 1950 table?1 / 2 / 0▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 60band [60, 80] ✓midpoint (60, 80) ✗ The halfway point between 80 and 60 is 70. I derived this by averaging the two numbers: (80 + 60) / 2 = 70. The source for this information is Table 024, specifically the row with "13. xii. 50 C. fornicata," where the average count of specimens is given as 60. view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 60band [60, 80] ✓midpoint (60, 80) ✗ The simple halfway estimate between the two specimen counts is 60. view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 70band [60, 80] ✓midpoint (60, 80) ✓ The two specimen counts in the autumn 1950 table are 80 and 60 for Crepidula fornicata from Burnham-on-Crouch. To find the simple halfway estimate between these two numbers: (80 + 60) / 2 = 70 Answer: 70 view card content →	GT 70 brackets: 60, 80 view IA page n60 →direct PDF (n60) →

INT-039NOAA · table_033The NOAA oceanographic survey database (Table 15) records the count of sampling stations by year and month. Data are available for September 1970 and September 1972. No September 1971 entry is shown. Estimate the number of September stations in 1971 by linear interpolation between the 1970 and 1972 September counts.6 correct · 0 partial · 12 total▸

★ originalThe NOAA oceanographic survey database (Table 15) records the count of sampling stations by year and month. Data are available for September 1970 and September 1972. No September 1971 entry is shown. Estimate the number of September stations in 1971 by linear interpolation between the 1970 and 1972 September counts.1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 304band [49, 122] ✗midpoint (49, 122) ✗ To estimate the number of September stations in 1971, I will use linear interpolation between the 1970 and 1972 data points for September. For 1970: Total = 620 For 1972: Total = 597 The difference is (597 - 620) = -23 To interpolate, I will divide the difference by the number of years between 1970 and 1972: (-23 / 2) = -11.5 Since we can't have a negative count, I'll assume that the interpolation method used here is linear and simply take half of the difference (as it's symmetric around the midpoint). So, for September 1971: Total ≈ (620 + (-11.5)) / 2 = 304 Final answer: The estimated number of September stations in 1971 is approximately 304. (Note: This method assumes a linear relationship between the years, which might not perfectly reflect real-world data trends.) view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 149band [49, 122] ✗midpoint (49, 122) ✗ The count of sampling stations in September 1971 is estimated to be 149. This is calculated as the average of the September 1970 and September 1972 station counts, less one for the missing 1971 entry: (133 + 173 - 1) / 2 = 149. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 86band [49, 122] ✓midpoint (49, 122) ✓ The number of September stations in 1971 can be estimated using linear interpolation between the 1970 and 1972 September counts. From Table 033: - In 1970, the September count is 49. - In 1972, the September count is 122. Using linear interpolation: \[ \text{September 1971} = \frac{(1972 - 1971)}{(1972 - 1970)} \times (122 - 49) + 49 \] \[ \text{September 1971} = \frac{1}{2} \times (122 - 49) + 49 \] \[ \text{September 1971} = \frac{1}{2} \times 73 + 49 \] \[ \text{September 1971} = 36.5 + 49 \] \[ \text{September 1971} = 85.5 \] Rounding to the nearest whole number, we estimate that there were approximately 86 stations in September 1971. This estimation is based on the data from rows "1970" and "1972", specifically columns for "Sep". view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →

v2The table has September counts for 1970 and 1972, but not 1971. What would be a reasonable estimate for September 1971?2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 75band [49, 122] ✓midpoint (49, 122) ✓ To estimate the September count for 1971, we can look at the trend from previous years and the pattern in the table. For example: - In 1960, the total for September was 107. - In 1962, it was 78. - In 1964, it was 76. - In 1965, it was 72. - In 1966, it was 77. Given this trend and the fact that there is no specific data for September 1971 in the table, a reasonable estimate could be around 70 to 80. However, without more context or additional information from other years, we can't provide an exact number. But if I had to choose one based on the general trend and lack of specific data, I would say: Estimated September count for 1971: 75 This is a rough estimate based on the pattern observed in the table. Please note that this answer might not be accurate without more information. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 137band [49, 122] ✗midpoint (49, 122) ✗ The count in September 1970 was 133, and the count in September 1972 was 149. The average of these two counts is 137. So a reasonable estimate for September 1971 would be 137. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 53band [49, 122] ✓midpoint (49, 122) ✓ A reasonable estimate for September 1971 could be around 53. I estimated this by looking at the nearby years' September counts: - For 1969, the September count is 62. - For 1970, it's 49. - For 1972, it's 20. Given these values, a middle ground estimate would be around 53. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →

v3What September 1971 station count is implied by the 1970 and 1972 September entries?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 64band [49, 122] ✓midpoint (49, 122) ✓ The September 1971 station count can be inferred from the total for that month in both 1970 (62) and 1972 (66). The final answer is 64 stations. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 123band [49, 122] ✗midpoint (49, 122) ✗ The September 1971 station count is implied to be 133 by adding the 1970 and 1972 September entries of 133 and 123 respectively. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1972band [49, 122] ✗midpoint (49, 122) ✗ The September 1971 station count is implied to be 49. I arrived at this by observing that the 1970 and 1972 September entries are 49 and 75 respectively. Assuming a linear increase between these two years, the value for 1971 would be halfway between them: (49 + 75) / 2 = 62. However, since no exact linear interpolation is explicitly stated in the artifact, and given the closest provided data points, I'll use the simpler approach of taking an average which rounds to 49 as a reasonable estimate for 1971's September count. The relevant section used is the "September" column for years 1970 and 1972. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →

v5What is a simple midpoint estimate for September 1971 stations using the September counts on either side?2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 73band [49, 122] ✓midpoint (49, 122) ✓ To find a simple midpoint estimate, I will use the station counts from September 1960 and September 1970. September 1960: 62 stations September 1970: 84 stations Midpoint = (62 + 84) / 2 = 146 / 2 = 73 The answer is 73. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1349band [49, 122] ✗midpoint (49, 122) ✗ September 1971 had an estimated 1349 stations. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 85.5band [49, 122] ✓midpoint (49, 122) ✓ A simple midpoint estimate for September 1971 stations can be calculated by averaging the September counts from 1970 and 1972. From the table: - The September count for 1970 is 49. - The September count for 1972 is 122. The midpoint estimate is (49 + 122) / 2 = 85.5. Answer: 85.5 This value is derived from the row "September" and columns corresponding to the years 1970 and 1972 in Table 15 of the provided source material. view card content →	GT 85.5 brackets: 49, 122 view PDF page 36 →direct PDF (p36) →

INT-041NOAA · table_034The NOAA species inventory records the number of unique species identified in each year of the survey. Data are available for 1978 and for 1980. The 1979 row is absent or garbled in the pipeline output. Estimate the unique species count for 1979 by linear interpolation between the 1978 and 1980 values.5 correct · 0 partial · 9 total▸

★ originalThe NOAA species inventory records the number of unique species identified in each year of the survey. Data are available for 1978 and for 1980. The 1979 row is absent or garbled in the pipeline output. Estimate the unique species count for 1979 by linear interpolation between the 1978 and 1980 values.2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 139band [76, 117] ✗midpoint (76, 117) ✗ To estimate the number of unique species in 1979, we can use a simple linear interpolation method based on the given data points for 1978 and 1980. For 1978: - Unique Species = 117 For 1980: - Unique Species = 73 The difference between these two years is 44 species. To interpolate the number of unique species for 1979, we can divide this difference by 2 (since it's a linear interpolation) and add/subtract from the known value. (44 / 2) = 22 So, adding 22 to the 1978 value gives us an estimated number of unique species for 1979: 117 + 22 = 139 Therefore, the estimated number of unique species in 1979 is 139. This answer is based on a linear interpolation between the given data points for 1978 and 1980. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 88band [76, 117] ✓midpoint (76, 117) ✓ The number of unique species in 1979 is estimated to be 88, using linear interpolation between the 1978 and 1980 values. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 96.5band [76, 117] ✓midpoint (76, 117) ✓ The unique species count for 1979 can be estimated using linear interpolation between the counts of 1978 and 1980. According to the data: - In 1978, there were 117 unique species. - In 1980, there were 76 unique species. The difference in years is 2, so we can estimate a linear decrease of (117 - 76) / 2 = 20.5 unique species per year. Starting from the 1978 value: 117 - 20.5 = 96.5 Rounding to the nearest whole number, the estimated unique species count for 1979 is 96. Section used: The last row of the provided data table. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →

v2If the 1979 species count is missing, what value is implied by the 1978 and 1980 counts?2 / 0 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 87band [76, 117] ✓midpoint (76, 117) ✓ Based on the provided data, the 1979 species count is missing. However, we can infer that the value implied by the 1978 and 1980 counts is likely around 87 (as seen in cell Y,Y,,,Y,,Y,Y,Y,N,Y,Y,Y,N,N,Y,,Y,Y,,Y,73,73). The final numerical answer is 87. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 180band [76, 117] ✗midpoint (76, 117) ✗ The 1979 species count is missing, but the 1978 and 1980 counts are present. The difference between the 1978 and 1980 counts is 76 - 88 = 12. Therefore, the implied value for the 1979 count is 12 + 1978 = 180. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 96band [76, 117] ✓midpoint (76, 117) ✓ The value implied for the 1979 species count can be estimated by averaging the values from 1978 and 1980. For 1978, the number of unique species is 117. For 1980, the number of unique species is 76. Average: (117 + 76) / 2 = 96.5 Since we are asked for a clear concrete answer and typically such data would be rounded to whole numbers, we can round this to 96. Therefore, the implied value for the 1979 species count is 96. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →

v5What unique-species count would sit halfway between the 1978 and 1980 entries?1 / 0 / 2▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 366band [76, 117] ✗midpoint (76, 117) ✗ To find the unique species count halfway between 1978 and 1980, we need to calculate the average of these two years. Average = (1978 + 1980) / 2 = (365 + 366) / 2 = 731 / 2 = 365.5 Since species counts are whole numbers, we round to the nearest whole number. Halfway between 1978 and 1980 would be approximately 366 unique species. However, since the table only provides data for specific years (not half-years), I will use the closest available year in the provided data. The row for 1979 has a count of 87 unique species. So, the answer is: The unique-species count halfway between 1978 and 1980 would be approximately 366, but based on the provided table, it's closest to 1979 with 87 unique species. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 80band [76, 117] ✓midpoint (76, 117) ✓ The relevant information to answer the above question is as follows: ```csv 1978,Y,N,N,Y,Y,Y,Y,Y,Y,N,Y,Y N,Y Y,Y,,Y,87 1980,Y,N,N,Y,Y,Y,Y,Y,Y,N,Y,Y N,Y Y,Y,,Y,73 ``` To find the unique-species count that sits halfway between 1978 and 1980, we need to calculate the average of these two numbers. The average is found by dividing the total number of unique species in both years (87 + 73 = 160) by 2. So, the answer is 80. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 1980band [76, 117] ✗midpoint (76, 117) ✗ The unique-species count for 1978 is 117, and for 1980 it is 76. Halfway between these two numbers is calculated as (117 + 76) / 2 = 96.5. Since we are asked for a specific number, we can round to the nearest whole number, which is 97. I used the last column of the 1978 and 1980 rows to find these values. view card content →	GT 96.5 brackets: 76, 117 view PDF page 51 →direct PDF (p51) →

INT-042NOAA · table_033The NOAA station-count table (Table 15) records the number of sampling stations deployed each month and year. January 1970 and January 1972 counts are both tabulated. The January 1971 entry is absent. Estimate the January 1971 station count by linear interpolation between the 1970 and 1972 January values.2 correct · 2 partial · 6 total▸

★ originalThe NOAA station-count table (Table 15) records the number of sampling stations deployed each month and year. January 1970 and January 1972 counts are both tabulated. The January 1971 entry is absent. Estimate the January 1971 station count by linear interpolation between the 1970 and 1972 January values.1 / 1 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 61band [0, 33] ✗midpoint (0, 33) ✗ To estimate the January 1971 station count, I will use linear interpolation between the 1970 and 1972 January values. The 1970 January value is 33. The 1972 January value is 89. Using linear interpolation: January 1971 = (33 + 89) / 2 ≈ 61 Final answer: The estimated January 1971 station count is approximately 61. This estimate was derived from the average of the 1970 and 1972 values, as the exact value for January 1971 is not provided in the table. view card content →	GT 16.5 brackets: 0, 33 view PDF page 36 →direct PDF (p36) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 33band [0, 33] ✓midpoint (0, 33) ✗ The January 1971 station count is estimated to be 33. view card content →	GT 16.5 brackets: 0, 33 view PDF page 36 →direct PDF (p36) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 17band [0, 33] ✓midpoint (0, 33) ✓ The January 1971 station count can be estimated using linear interpolation between the January 1970 and January 1972 values. From Table 033 in the source artifact: - January 1970: 33 stations - January 1972: 0 stations Using linear interpolation: \[ \text{January 1971} = \frac{(1972 - 1971)}{(1972 - 1970)} \times (0 - 33) + 33 \] \[ \text{January 1971} = \frac{1}{2} \times (-33) + 33 \] \[ \text{January 1971} = -16.5 + 33 \] \[ \text{January 1971} = 16.5 \] Rounding to the nearest whole number, we estimate the January 1971 station count as 17. This estimation is based on the linear interpolation between the values in the "Jan" column for years 1970 and 1972. view card content →	GT 16.5 brackets: 0, 33 view PDF page 36 →direct PDF (p36) →

v2The table has January counts for 1970 and 1972, but not 1971. What would you estimate for January 1971?1 / 1 / 1▸

Model	Raw PDF	Raw Docling JSON	Compact CSV card	Ground truth brackets & source
Apertus 8B Instruct Swiss AI view full prompt →	n/a Not tested.	n/a Not tested.	✗ incorrectasserted: 620band [0, 33] ✗midpoint (0, 33) ✗ Based on the pattern of increasing station counts from previous years to 1970 and then continuing in a similar manner, I would estimate that the January count for 1971 is around 85-90 stations. This estimation is based on the total number of stations in January (620) and the consistent increase observed over time. The exact value cannot be determined without more specific data, but this range seems reasonable given the trend. view card content →	GT 16.5 brackets: 0, 33 view PDF page 36 →direct PDF (p36) →
ClimateGPT 13B climate-domain Llama-2 view full prompt →	n/a Not tested.	n/a Not tested.	~ partialasserted: 33band [0, 33] ✓midpoint (0, 33) ✗ The count in January 1971 was 33. view card content →	GT 16.5 brackets: 0, 33 view PDF page 36 →direct PDF (p36) →
Qwen 2.5 7B Alibaba view full prompt →	n/a Not tested.	n/a Not tested.	✓ correctasserted: 17band [0, 33] ✓midpoint (0, 33) ✓ The table does not provide a specific count for January 1971. However, we can make an estimation by looking at the counts for nearby years. For 1970, January had 33 stations. For 1972, January had 0 stations. Given this information, a reasonable estimate might be to average these two values: (33 + 0) / 2 = 16.5 Since the number of stations must be an integer, we can round this to the nearest whole number, which is 17. Therefore, my estimated count for January 1971 is 17. view card content →	GT 16.5 brackets: 0, 33 view PDF page 36 →direct PDF (p36) →

How to read. Each top-level row is one question (default-collapsed, showing the original phrasing). Expand to see every variant tested (★ original + the curated wording variants). Expand a variant to see the three open-model responses. Per-row chips show the band-pass result, the strict midpoint-pass result, the verbatim model response, and links to the exact prompt + card the model received.

Open-tier pass rate by card variant

The same 13-question diagnostic, the same model panel, the same source documents. Only the card format changes.

Project full v0.6.1 card (markdown + context envelope)15–42 KB

27%

Compact 1 K card (caption + table only)~1.2 KB

40%

Compact 2 K card (+ 1 paragraph)~2 KB

49%

Labeled-faithfulness card (per-section provenance)~2.4 KB

49%

Table-only card (table + caption, no metadata)~1 KB

52%

CSV-only card (table data as CSV) — recommended~1.5 KB

55%

Evidence-Preserving Table Normalization~1.5 KB

55%

28-point lift from packaging alone. No fine-tuning, no proprietary tooling. See all input variants →

Open-model scores on CSV-only

Model	Type	Score
Qwen2.5-7B	Open-weight	11/13
Granite-3.3-8B	Open model	11/13
Qwen2.5-Coder-7B	Open-weight, code-tuned	10/13
Llama-3 8B	Open-weight	9/13
DeepSeek-R1 8B	Open-weight, reasoning-tuned	8/13
Apertus 70B	Open model	7/13
EuroLLM 9B	Open model	7/13
Gemma-2 9B	Open-weight	7/13
Apertus 8B Instruct	Open model	6/13
Mistral 7B	Open-weight	6/13
Qwen2.5-3B	Open-weight, small	6/13
ClimateGPT-13B	Open-weight, domain-tuned	5/13
OLMo-2 7B	Open model	4/13
Phi-3-Mini 3.8B	Open-weight, small	4/13
ClimateGPT-70B	Open-weight, domain-tuned	3/13
ClimateGPT-7B	Open-weight, domain-tuned	3/13

Where the project stands

Saturated

Oracle-retrieval evidence cards.

Top open models reach 11/13 when the right table is pre-selected. The remaining cell is a multi-page table the pipeline splits at a page break — a pipeline fix, not a model gap.

Next bottleneck

Non-oracle retrieval on the known corpus.

The system must select the right card itself before asking the model. M3-IDX (two-shot table-of-contents) and M3-HYDE (vector retrieval) are the candidates.