I’m aware of the school of thought that says all the dramatic “AIs are so powerful we are at risk of ending humanity” claims are nothing but hype, in the end benefitting mainly the status and wealth interests of the senior people and companies developing these systems. But, for a moment, let’s at least take the concerns in this direction as espoused by the leaders of these for-profit companies at face value.
In which case, Anthropic and OpenAI’s release of their latest models last week feels particularly irresponsible.
As noted by Transformer, both were made available to the public even though the companies concerned think their models are so advanced that we no longer know how to comprehensively test them for safety.
Opus 4.6, Anthropic said, “has saturated all of our current cyber evaluations,” meaning that “we can no longer use current benchmarks to track capability progression.”
The new model has also “roughly reached the pre-defined thresholds” for ruling out the next level of autonomous AI R&D risks — whether it’s capable of fully automating an entry level researchers’ work.
Instead of a rigorous benchmark, then, Anthropic resorted to an internal survey of 16 employees. They decided the model probably wasn’t capable enough — but the company still noted that it has “uncertainty around whether this threshold has been reached.”
With CBRN risks, “the CBRN-4 rule-out is less clear for Opus 4.6 than we would like,” and “a clear rule-out of the next capability threshold may soon be difficult or impossible under the current regime.”
“CRBN risks” refers to “chemical, biological, radiological, and nuclear weapons knowledge”. An (unrelated) recent preprint paper already had this to say:
Frontier Large Language Models (LLMs) pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge…
Our findings expose critical safety vulnerabilities: Deep Inception attacks achieve 86.0% success versus 33.8% for direct requests, demonstrating superficial filtering mechanisms; Model safety performance varies dramatically from 2% (claude-opus-4) to 96% (mistral-small-latest) attack success rates; and eight models exceed 70% vulnerability when asked to enhance dangerous material properties. We identify fundamental brittleness in current safety alignment, where simple prompt engineering techniques bypass safeguards for dangerous CBRN information.
These results challenge industry safety claims and highlight urgent needs for standardized evaluation frameworks, transparent safety metrics, and more robust alignment techniques to mitigate catastrophic misuse risks while preserving beneficial capabilities.
After all, life-ending events in theory don’t need AIs to rise up and decide to destroy humanity - they could just assist malicious humans in doing so.
Similarly, from Transformer, re OpenAI:
OpenAI said much the same thing: “We do not have definitive evidence that this model reaches our High threshold [for cyber risk], but are taking a precautionary approach because we cannot rule out the possibility that it may be capable enough to reach the threshold.”
The general conclusion then is that:
…it is becoming increasingly difficult to tell whether AI models have dangerous capabilities. Evaluation techniques cannot keep up, and companies are resorting to flimsy methods to assess what their models can do.
Nonetheless, seemingly happy to live in irresponsible ignorance, they keep on plugging away, spending incredible amounts of time, money and environmental resources in an effort to be the first to build the next generation what some fraction of their owners profess to believe are systems with extraordinary risks. This does not seem wise, assuming the companies concerned truly believe what they say they do.