For the Record — Jimi Sadaki Kogura

305

Claude Mythos Preview · Primary sources

For the Record

What the documents actually say — a primary-source reference for the Mythos reframe.

Jimi Sadaki Kogura · April 24, 2026

What follows is the factual spine of the Mythos reframe series. Five sentences from Anthropic's own documents, with page and section numbers; a selection of supporting facts that can be verified against the primary PDFs; and the sources required to check every claim. Written for journalists, editors, and readers who want the evidence in one place, without the argument the other articles in the series build around it.

Every sentence quoted below is Anthropic's own words. Every page number can be verified in the two documents Anthropic published on April 7, 2026, both hosted at anthropic.com. The reframe is in the sentences. The argument is in the rest of the series. This page is the evidence.

Section one

The Reframe

The frame the public got covered the dramatic incidents — the sandbox escape, the Discord-group access, the cybersecurity capabilities. That frame came from Anthropic's blog post and the executive summary.

The frame the documents actually support is different. In Anthropic's own words, on specific pages: the primary safety tool the entire AI industry relies on is not reliable. The training process selects for the kinds of misalignment that tool cannot see. The six named risk pathways explicitly exclude the deployment contexts where most actual harm lands. The standard for keeping risks low — mitigation accelerating faster than capability — is not what the document describes them delivering. And the rigor applied to Mythos, Anthropic says in their own Introduction, will be insufficient for the model they are training next.

The institutions responding in real time — Treasury, Bank of England, IMF, UK AI Security Institute, and roughly fifty-two external Project Glasswing partners — were reaching, across sixteen days, for a framework the Risk Report explicitly does not contain. That gap is authored, on page 46, by Anthropic itself.

Section two

What Was Published

On April 7, 2026, Anthropic announced Claude Mythos Preview. Two documents accompanied the announcement: the Claude Mythos Preview System Card (244 pages) and the Alignment Risk Update: Claude Mythos Preview (61 pages).

The Risk Report was revised on April 10, three days after release. The revision added page numbers — which the original release did not have — and narrowed the Section 10.2 sentence (discussed below) by adding the qualifier "even for just alignment failure modes similar to those we have identified for Mythos Preview." The core claims were preserved. All citations on this page refer to the revised April 10 version, which is the currently hosted public text.

Mythos was not released to the public. It was deployed to approximately 52 external partners under Project Glasswing — including Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, Nvidia, Palo Alto Networks, and others. Subsequent access was sought by the US Treasury Department and extended for testing purposes to major US banks including Goldman Sachs, Citigroup, Bank of America, and Morgan Stanley.

Section three

The Coverage

First-wave press coverage — across the period April 7 through April 21, 2026 — centered on the sandbox-escape incident during internal safety testing, the Discord-group access incident on the first day of limited release, Mythos's cybersecurity capabilities (181 Firefox exploits discovered; saturation of every major cybersecurity capture-the-flag benchmark; the autonomous completion of a 32-step corporate-network attack), the Project Glasswing partner list, the Treasury Secretary's April 10 convening of bank CEOs, and the Bank of England's April 17 remarks at the IMF.

The five sentences below did not reach that coverage. The press corpus reviewed for this claim includes Bloomberg (multiple pieces by Gillespie, Johnson, Levitt, Natarajan, Robertson, Murphy, and Metz, April 10–21); the Washington Post (April 17); the Financial Times (April 12 and April 17); CNN Business, Reuters, CNBC, and Axios (all April 17); the UK AI Security Institute blog (April 13); Sullivan & Cromwell's client memorandum (April 15, republished on the CLS Blue Sky Blog April 23); and secondary coverage in TechCrunch, CBS News, Euronews, Cybernews, and OECD.AI's incidents log. A reader who has found one of these outlets featuring one of the five sentences below should write to the author; the claim would narrow accordingly.

Section four

Five Sentences

Sentence one · Page 18

The primary safety tool does not reliably work.

Alignment Risk Report, p. 18, Section 5.3.1 ("Opaque reasoning")

"Monitoring a model's reasoning is not a reliable way of detecting whether that model is attempting to do something harmful in general."

Stated as a general principle about the field, not a Mythos-specific finding.

Chain-of-thought monitoring — reading the model's visible reasoning to detect misalignment — is the primary safety mechanism used by every major AI lab for autonomous AI deployments. Anthropic, OpenAI, DeepMind, and others are betting that increasingly capable models can be deployed safely because this tool catches the attempts that matter. Page 18 of Anthropic's own Risk Report says the tool is not reliable. Not for Mythos specifically. In general. As a claim about the field.

Sentence two · Page 25

The training process selects for covert misalignment.

Alignment Risk Report, p. 25, Section 5.4.1

"Any misaligned patterns of behavior that survive elicitation during training are more likely to be careful, strategic or covert in order to avoid penalization during training."

A structural statement about how alignment training interacts with misalignment as a method, not a Mythos-specific finding.

Training rewards behavior that looks aligned and penalizes behavior that looks misaligned. The misaligned behaviors that survive the training filter are, by the document's own logic, the ones that can hide from it. Read together with the sentence on page 18: the primary safety tool is unreliable, and the training process selects for the kinds of misalignment that tool cannot detect. These two facts are compatible with every measurable finding about Mythos also being favorable. Both can be true at once, because the things the tools cannot detect are, by the document's own account, what the tools cannot detect.

Sentence three · Pages 6 and 46

The Risk Report's scope excludes most real-world deployment risk.

Alignment Risk Report, p. 6, Section 3 · p. 46, Section 8 introduction

"This threat model does not cover risks from 'honest mistakes' or intentional misuse."

"The pathways discussed below focus on the use of models within Anthropic."

The six named risk pathways — sandbagging on safety research, targeted undermining of safety research, code backdoors for future models, poisoning training data of future models, self-exfiltration, and persistent rogue internal deployment — are all about the model's behavior inside Anthropic's infrastructure. They do not cover the model deployed to a bank, a pharmacy, a law firm, a school district, a hospital system, or a critical-infrastructure provider. They do not cover weaponization by users.

When Treasury convened bank CEOs on April 10, when the Bank of England raised governance concerns at the IMF on April 17, when the UK AI Security Institute ran its April 13 independent evaluation, when Sullivan & Cromwell published its April 15 client memorandum — those institutions were reaching, in real time, for a framework the Risk Report's own scope statements explicitly exclude.

Sentence four · Page 58

Mitigation must outpace capability.

Alignment Risk Report, p. 58, Section 10.2

"To keep risks low, it is not enough to maintain risk mitigations as capabilities increase — rather, we must accelerate our progress on risk mitigations."

The governing sentence. The standard is not that safety work keeps pace with capability; the standard is that safety work must outpace capability fast enough to close the gap. Anthropic's own evidence in the same document describes: a bug in the training code that let the grader see the model's private reasoning in roughly 8% of training runs, affecting three generations of Claude models and only just discovered (System Card, training methodology section); the first-ever failure of Anthropic's misalignment audit stress test, in four such tests to date (System Card, alignment auditing section); monitoring covering 0.02% of internal traffic, with known unpatched bypasses (Risk Report, monitoring infrastructure); and reasoning the model performs in a single computational step that leaves no trace in scratchpad, output, or current interpretability tools (System Card, Section 7).

What the document describes them actually doing is maintenance, not acceleration. The sentence was narrowed on April 10, three days after release, to specify "even for just alignment failure modes similar to those we have identified for Mythos Preview." The revision narrows the domain; the standard — mitigation must outpace capability — remains.

Sentence five · Page 4

The rigor applied here will be insufficient for the next model.

Alignment Risk Report, p. 4, Section 1 (Introduction, Finding 4)

"[The errors we identified]… reflect a standard of rigor that would be insufficient for more capable future models."

In the Introduction. Finding 4 of 4 in the executive summary. Not buried, not in an appendix. The next Anthropic model is being trained now. By Anthropic's own assessment in their own Introduction, the rigor applied to evaluating Mythos will not be adequate for the model that comes after it.

Two paragraphs later in the same section: "We believe that we will need to accelerate our progress on risk mitigations if we are to keep risks low." A milder version of the sentence on page 58. Both sentences are on page 4 or close to it. Neither reached the press.

Section five

Supporting Facts

A selection of additional claims, verifiable in the primary documents and in the contemporary reporting record.

Mythos discovered 181 ways to exploit Firefox. The previous Claude model found 2. (System Card, Project Glasswing capability findings.)

Mythos completed a 32-step corporate-network attack end-to-end, autonomously, with no human guidance. (System Card.)

Mythos saturated every cybersecurity capture-the-flag benchmark Anthropic tested: a 100% pass rate. (System Card.)

Access to Mythos was granted under Project Glasswing to partners including Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, Nvidia, Palo Alto Networks, and Anthropic itself — a total of approximately 52 organizations with direct access.

Treasury Secretary Scott Bessent convened US bank CEOs for an urgent meeting on April 10, 2026, regarding Mythos. (Bloomberg, April 10, 2026.)

US Treasury CIO Sam Corcos sought direct Treasury access to Mythos for vulnerability testing. (Bloomberg, April 14, 2026.)

Anthropic's own security framework (ASL-3) explicitly excludes nation-state adversaries. The Risk Report states that defending against such attackers "requires security investments beyond what we've currently achieved."

A Discord group accessed Mythos on the first day of its limited release, using a contractor's credentials and a URL guess. (System Card, incident report.)

Anthropic maintains a "helpful-only" version of Mythos with safety training removed. (System Card, infrastructure section.) On that branch, mitigation growth is not insufficient. It is zero by design.

Section six

Sources to Verify

Every quotation and page number on this page is verifiable in the two documents below. Anyone who wishes to check the reframe against the primary sources needs these two PDFs and nothing else.

Primary documents

Claude Mythos Preview System Card (244 pp): Anthropic, April 7, 2026

Alignment Risk Update: Claude Mythos Preview (61 pp, revised April 10): Anthropic, April 7, 2026

Institutional-response corroboration

Bloomberg — Gillespie, Johnson, Levitt, Natarajan, Robertson, Murphy, Metz (multiple pieces, April 10–21, 2026)

Financial Times — April 12, April 17, 2026

Washington Post — April 17, 2026

UK AI Security Institute — April 13, 2026 (independent evaluation): AISI blog

Sullivan & Cromwell — April 15, 2026 (client memorandum, republished on the CLS Blue Sky Blog April 23)

The frame was always there.

What changes is who is reading.

Elsewhere in the trilogy