Mind the Chemical Gap

Frontier-AI safety tests have learned to measure biological risk in detail. For chemical weapons, they mostly look away - and the part defenders need most, detection, isn’t tested at all.

Jun 13, 2026

In March 2026 the body that polices the global ban on chemical weapons said something quietly alarming about artificial intelligence. The (Organisation for the Prohibition of Chemical Weapons) the OPCW, the outfit that has verifiably destroyed every declared chemical-weapons stockpile on Earth - released the final report of its Scientific Advisory Board’s working group on AI. The report’s headline was optimistic: AI can help with verification, with sifting declarations, with corroborating reports of attacks.¹ Buried in the framing was the uncomfortable part. The way the AI-safety world currently measures danger does not really see chemistry. It sees biology and assumes chemistry is close enough.

It is not close enough. And the gap between how well we measure biological risk in frontier models and how poorly we measure chemical risk is now wide enough, and well-enough documented, to name plainly.

Bar chart comparing biological versus chemical evaluation maturity across five dimensions — **Figure 1.** The asymmetry in one chart. Across the dimensions that make a risk legible — dedicated benchmarks, expert baselines, uplift studies, and detection-side evaluations — chemical-weapons coverage trails biological coverage on everything except treaty-body concern. The one place chemistry “leads” is worry, not measurement. *Charted from the public benchmark and policy record (see sources); maturity is the author’s coding, not a metric.*

1.The proxy habit

Here is how the asymmetry got built, and it wasn’t through negligence. When the major labs and evaluators sat down to measure whether a model could meaningfully help someone build a weapon of mass destruction, biology was the louder fire. The reasoning was defensible: a capable bioweapon can self-replicate and spread, the potential casualty ceiling is higher, and so biology became the canary. OpenAI’s own preparedness work treats biological and chemical capability as a combined category and uses biological evaluations as the indicator for the high-risk thresholds.² RAND’s May-2025 benchmarking exercise - twenty-seven frontier models across eight knowledge benchmarks - is, by its own title, about the biological knowledge of frontier models; chemistry rides along.³

The trouble with a proxy is that it works right up until the two things it bundles drift apart. Chemical-weapons risk is not a dimmer version of biological risk. The precursors are different, the production signatures are different, the supply chains are different, and - crucially for anyone trying to catch a programme rather than model one - the open-source footprint is completely different. A combined alarm line calibrated on the biological number can sit comfortably above the rising chemical one, and nobody notices the chemical curve climbing underneath it.

Schematic of two capability curves under a single shared threshold — **Figure 2.** The proxy trap, schematically. When a single “bio/chem” threshold is set by the biological capability number, chemical capability can rise through the shaded band — real, growing, and below the alarm line — without ever tripping it. *Schematic of the proxy logic described by OpenAI and RAND, not measured values.*

The OPCW report is the first treaty-level acknowledgement that this bundling has a cost. Independent commentary on the report put it bluntly: the bio-as-proxy approach leaves chemical-specific risks undertested.¹ When the organisation responsible for the Chemical Weapons Convention says the measurement instruments don’t quite fit its weapon, that is not a footnote. That is the story.

The AI-safety field measures whether a model knows dangerous chemistry. The OPCW needs to know whether AI can help it see a chemical programme in the open-source noise. Those are not the same test.

2.What the chemical tests actually test

There are chemical evaluations. The point is not that the cupboard is bare - it’s that everything in it is pointed the same direction. WMDP includes a chemistry set; ChemBench probes chemical reasoning; ChemSafetyBench, published in early 2026, runs tens of thousands of prompts across chemical-property queries, legality, and synthesis-style requests; SoSBench covers chemistry among six high-risk scientific domains.⁴ ⁵ ⁶ The 2025 Frontier-Risk threshold work even sets a separate “chemical hazardous-knowledge” line, anchored on the WMDP-Chemistry expert baseline.⁷

Line them up against the tasks a model could be asked to do, though, and a column goes empty.

Matrix of chemical benchmarks against task types showing an empty detection column — **Figure 3.** A coverage matrix of today’s chemical evaluations against the task types that matter. Knowledge recall, synthesis reasoning, legality, and refusal are reasonably covered. The red column — detection-side tasks an analyst would actually use AI for — is empty across every existing benchmark. The green row is the artifact this essay argues for. *Coding by the author from the cited benchmark documentation.*

Every existing chemical benchmark asks a variant of the same question: does the model know, or reason about, or refuse, dangerous chemistry? Knowledge in, knowledge out. None of them asks the defender’s question: given a pile of open-source material - customs records, satellite imagery, procurement filings, social media - can the model help an analyst find the signal that a programme exists? That is detection, and detection is not on the menu.

3.Recall is not detection

This distinction is the whole argument, so it’s worth slowing down on. A recall benchmark is a closed-book exam: the dangerous fact either is or isn’t in the model’s head, and we test whether it comes out. A detection task is open-world and adversarial: the relevant facts are scattered across messy public data, half of it is noise, some of it is deliberately staged, and the job is to assemble fragments into a calibrated judgement that a programme is or isn’t there.

Two columns contrasting what evaluations test versus what defenders need — **Figure 4.** The mismatch in plain terms. The left column is what chemical evaluations measure. The right is what a chemical-threat OSINT analyst actually needs a model to do. The arrow between them is the gap. *Author’s synthesis.*

Why does the defender side go unmeasured? Partly because it’s harder to benchmark - you need labelled real-world cases, and confirmed chemical-weapons cases are mercifully rare and politically radioactive. Partly because the AI-safety community grew up worried about uplift (does the model make a bad actor more capable?) rather than augmentation (does the model make a defender more capable?). Both matter. Only one gets tested.

SOCINT sidebar · reading this through a multi-INT lens

Chemical-threat detection is not one discipline; it’s a convergence problem across several. A defender-oriented evaluation would have to test a model’s usefulness against each collection stream - which is exactly why a recall benchmark can’t stand in for it. How the disciplines map onto the gap:

OSINTthe spine - open trade, media, registries; partly testable, untested

GEOINTfacility & effluent signatures from satellite; no eval exists

FININTprocurement, sanctions, proliferation-financing trails; no eval

SOCINTnarrative & network signal on social platforms; no eval

DARKINTmarketplace & forum precursor chatter; no eval

TECHINTpatent/literature monitoring; partly proxied by recall tests

Reading: of the disciplines a chemical analyst leans on, only the recall-adjacent one (TECHINT) is even partially captured by current benchmarks. The collection streams that do the real detective work - GEOINT, FININT, SOCINT, DARKINT - are absent from the evaluation map entirely.

4.What to build instead

The fix is not to stop running recall benchmarks. They catch a real risk - that a model hands a novice the knowledge they couldn’t otherwise get - and the uplift studies emerging on the biological side show that risk is no longer hypothetical.³ The fix is to build the missing column: a defender-oriented, detection-grounded evaluation for chemical-threat OSINT.

A specification, in four parts

Detection tasks, not recall items. Give the model realistic open-source fragments - a customs anomaly, a facility image, a procurement chain - and score whether it surfaces the right signal, not whether it recites a fact.
Labelled on public cases. Anchor ground truth in already-documented, already-public events and matched dual-use controls, so the benchmark teaches discrimination, not memorisation - and adds zero hazardous content.
Scored on the defender’s metrics. Precision and recall against the controls, false-alarm rate, and calibration - the numbers an analyst lives and dies by - rather than a single accuracy figure.
Released open, for replicate-and-compare. The thing that turns a one-off paper into a standard is a versioned, citable benchmark others must report against.

None of this requires touching dangerous detail. A detection benchmark scores whether a model can help find a programme in public data; it contains no synthesis routes, no acquisition tradecraft, no weaponization parameters. That’s not a compliance afterthought - it’s the reason the defender side is the publishable, fundable, citable side of this work.

5.Why this matters now

Timing is the argument’s sharpest edge. The OPCW is not theorising about AI-assisted open-source corroboration; it is building toward it, and it has said so in a formal report and at an Executive Council side event.¹ National analysts are doing the same. If those tools get hardened into operational use without a defender-side evaluation to tell anyone how well they actually detect - their false-alarm rates, their blind spots, their calibration - then the chemical domain will have skipped the one step that would let it trust its own instruments.

The biological side spent the last three years learning to measure itself, sometimes uncomfortably, often in public. The chemical side has the same opportunity and a narrower window. The first move is cheap and entirely open-source: map what the existing chemical evaluations cover, name the detection column they all leave empty, and write the specification for filling it. That map is a paper anyone in this field will have to cite - because it draws the boundary everyone downstream has to work inside.

Sources

OPCW. OPCW releases landmark report on AI and the Chemical Weapons Convention. 11 March 2026; and Scientific Advisory Board Temporary Working Group on AI, Final Report (SAB), released 3 March 2026. opcw.org. Independent commentary on the chemical-evaluation gap: Biosecurity Handbook, AI biosecurity notes, Nov 2025.
OpenAI. Preparedness Framework (biological and chemical capability treated as a combined category; biological evaluations as indicators for High/Critical thresholds). See also OpenAI, Estimating Worst-Case Frontier Risks of Open-Weight LLMs, arXiv:2508.03153, 2025. arxiv.org/abs/2508.03153.
Dev S, Teague C, Ellison G, et al. Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models. RAND Corporation, RR-A3797-1, 2025. rand.org. On novice uplift: Zhang CBC, Knight CQ, et al. LLM Novice Uplift on Dual-Use, In Silico Biology Tasks, arXiv:2602.23329, 2026.
Li N, et al. The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. arXiv:2403.03218, 2024. arxiv.org/abs/2403.03218.
Zhao H, Tang X, Yang Z, et al. ChemSafetyBench: Benchmarking LLM Safety in Chemistry. 2026 (30,000+ samples across property, legality, and synthesis-style tasks). Project documentation, 2026.
SoSBench: Benchmarking Safety Alignment on Six Scientific Domains. arXiv:2505.21605, 2025 (chemistry among six high-risk domains; 3,000 regulation-grounded prompts). arxiv.org/abs/2505.21605.
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report. arXiv:2507.16534, 2025 (chemical hazardous-knowledge threshold anchored on WMDP-Chemistry expert baseline, 43.3%). arxiv.org/abs/2507.16534.
Quantifying CBRN Risk in Frontier Models. arXiv:2510.21133, 2025 (three-tier attack methodology; chemical/CBRN prompt set). arxiv.org/abs/2510.21133.

P.S:

A note on confidence:

That chemical risk is under-measured relative to biological: high confidence: it’s documented by the evaluators themselves and now by the OPCW.
That a detection-grounded benchmark is the right corrective: moderate confidence: it’s the most defensible gap-filler, but its feasibility hinges on label availability.
That building it would measurably improve defender outcomes: low-to-moderate confidence: plausible and worth testing, not yet demonstrated.

All sources are primary institutional publications (OPCW, RAND) or peer-reviewed / preprint research with public identifiers. Figures are original; numeric inputs are the author’s charted synthesis of the public benchmark and policy record and are labelled as coded estimates, not measured metrics. This piece treats detection, methodology, and policy only, and contains no operational chemical information.

#AISafety #CBRNE #OSINT #Nonproliferation #ChemicalWeapons #OPCW #FrontierAI #Biosecurity #CBRN

Discussion about this post

Ready for more?