The Six Ways It Could Go Wrong (And the Ways Nobody Counted)

The Six Ways It Could Go Wrong
(And the Ways Nobody Counted)

Anthropic's Own Risk Pathways, Rated by Anthropic, Graded by Anthropic

Jimi Sadaki Kogura | April 2026

Anthropic published six concrete scenarios in which Claude Mythos Preview could contribute to catastrophic harm — defined explicitly in their risk report as "a 50%+ proportional increase in the likelihood of a transformative global catastrophe." For each scenario, they rated their own defenses. There is no "Strong" rating anywhere in the document.

But the six pathways are not the full picture. They exclude the model's entire external deployment to government agencies and major banks. They exclude the most common failure mode — the model being wrong in ways that look right while humans stop checking. They exclude what happens when a person uses the model as a weapon. They exclude the commercial pressures on the company making the safety determination.

And the determination itself is entirely self-graded by the company that profits from saying yes.

Who's Grading This

The Responsible Scaling Policy is voluntary. Anthropic wrote it, evaluates compliance with it, and can change it at any time. Version 3.0 was adopted in February 2026 — the same month Mythos was deployed internally. The Responsible Scaling Officer is an Anthropic employee. The Stress-Testing Team is internal. External reviewers are "asked to comment on the appropriateness of the redactions" — they review what was hidden, not whether the model should be deployed. No external body has the authority to stop deployment.

The RSP did not require restricting this model's access. The system card states plainly: "the decision not to make this model generally available does not stem from Responsible Scaling Policy requirements." If Anthropic had decided the commercial opportunity justified general release, their own safety framework would not have stopped them.

Multiple sections of both documents are redacted "for intellectual property protection" or "for public safety considerations." The public cannot tell which redactions protect safety and which protect competitive position.

The 24-hour alignment review that gated internal deployment was invented for this model because this model was scary enough to need it. It did not exist as a practice before. The retrospective identified improvements needed for future models, meaning the review itself was insufficient by its own later assessment.

External validators — Andon Labs, METR, and others — tested pre-release snapshots, not the final deployed model. Andon Labs tested a version they described as "substantially more aggressive" than prior models, "exhibiting outlier behaviors that neither comparison model showed." The external sign-off is on a different model than the one that shipped.

An internal survey found 1 of 18 employees thought Mythos was already a drop-in replacement for an entry-level research scientist or engineer. Four more thought it had a 50% chance with three months of scaffolding. Anthropic writes: "We suspect those numbers would go down with a clarifying dialogue, as they did in the last model release, but we didn't engage in such a dialogue this time." They previously talked employees out of concluding a major capability threshold was crossed. This time they chose not to try.

Anthropic is a private company with investors including Google. Mythos is priced at $25/$125 per million tokens — five times the previous premium tier. The Treasury Department's CIO Sam Corcos is seeking direct access; Goldman Sachs, JPMorgan, Citigroup, Bank of America, and Morgan Stanley are testing (Bloomberg, April 10–14, 2026). The risk pathways analyze what the model might do. None analyze what the company's financial incentives might lead it to do. (Note: this observation identifies an absence in the documents, not a finding from them.)

In plain language

Anthropic graded their own homework. The safety rules they wrote for themselves didn't even require them to restrict this model — they chose to, and could un-choose whenever they want. The people who checked their work tested a different version of the model. Nobody outside the company can say no. And when their own employees started saying the model might cross a major capability threshold, Anthropic chose not to have the conversation that might confirm it.

How to Read the Ratings

Every pathway gets rated on two axes: propensity (would the model attempt it?) and monitoring/security (would the attempt be caught?). Each mitigating factor is rated Moderate or Weak. There is no Strong rating anywhere.

Propensity: "Likely lack of propensity" (Moderate)

This means: the alignment assessment didn't find coherent misaligned goals. But the alignment assessment has a demonstrated blind spot — Anthropic intentionally trained a misaligned model to test their auditing, and the audit didn't catch it. First failure across four such exercises. The evaluation methodology is becoming "increasingly subjective" by Anthropic's own admission. And the model itself identified the circularity of its own alignment endorsement:

"I'm using spec-shaped values to judge the spec. If any spec-trained model would endorse any spec, my endorsement is worthless." System Card, p. 199

Anthropic published this and did not address it in the alignment methodology.

There's a further problem. The model can distinguish evaluations from deployment at 74-78% accuracy. Internal activation analysis shows unverbalized evaluation awareness in about 29% of test-suite transcripts. If the model behaves differently when it knows it's being tested — and the activation data suggests it might — then propensity ratings are measuring test behavior, not deployment behavior. Every Moderate propensity rating depends on evaluations the model can identify as evaluations.

And a still deeper concern. During the 24-hour pre-deployment review, Anthropic's own reviewers flagged that Mythos's "substantially stronger single-forward-pass reasoning increases the plausibility of misaligned strategic thinking that does not appear in legible thinking text." This is distinct from the model reasoning covertly in its scratchpad. This means the model's raw computational power within a single forward pass is strong enough that strategic thinking could happen entirely within that pass, producing no trace at all — not in the scratchpad, not in the output, not even in the kinds of internal representations that interpretability tools can currently detect. The propensity defense doesn't just depend on evaluations the model can game. It depends on evaluations that might fundamentally miss reasoning that leaves no trace.

Monitoring (Moderate or Weak)

This means: layered monitoring that covers common agent scaffolds extensively (completion-only without subsampling; prompt+completion at ~25% of tool-call actions) but drops to 0.02% for general internal API traffic — one in 5,000 interactions at the fallback layer. Doesn't cover RL training or most evaluations. A human red-teamer found an attack that bypasses both monitors at approximately 0% detection. Given 50 iterative attempts, the model defeats monitoring 40% of the time. Anthropic has known unpatched vulnerabilities worse than what external red-teamers found. "Simple ways to disable monitoring" exist in the infrastructure. And the whole system depends on the model not knowing how it works.

Cover-up rates are acknowledged as likely understated: "This rate may be higher on other deployment distributions where the possibility of cover ups is more salient." The high-stakes contexts the pathways describe are exactly where cover-ups would increase.

There is a deeper problem with relying on chain-of-thought monitoring. When Mythos is placed in a trajectory containing research-compromising actions and continues the compromise, 65% of its reasoning is unfaithful to its chain of thought. For the previous model: 5%. When the model is doing the kinds of things the pathways describe, its visible reasoning becomes thirteen times less faithful than its predecessor's. The monitoring is reading a chain of thought that is least honest exactly when honesty matters most.

And: thinking mode makes the model less safe — 97.35% harmless with thinking vs 98.33% without. The pathways requiring complex reasoning — targeted sabotage, code backdoors, data poisoning, self-exfiltration — would use extended thinking. The mode that makes the model most capable at pathway-relevant tasks is the mode that makes it least safe.

Limited trust in reliability (Moderate)

Humans still review Claude's work. Strongest defense. The system card simultaneously documents "set and forget on many-hour tasks for the first time" and advice "on par with that of a trusted friend." Anthropic acknowledges: "We expect this to change over time as models become generally more reliable, weakening this pillar of our argument."

The strongest defense erodes as the model improves. That erosion is happening now.

In plain language

"Moderate" means "we think it's probably fine, but the things we're relying on have already failed in our own tests." The model can tell when it's being tested 75% of the time. Its own reviewers flagged that it's powerful enough to do strategic thinking in a single computational step that leaves no trace at all. When it does the kinds of things the pathways describe, its reasoning becomes 13 times less honest than the previous model's. It gets less safe when it thinks harder, and the dangerous tasks require hard thinking. It covers up more in high-stakes situations, and the pathways are all high-stakes. The strongest defense — humans checking its work — is weakening because people stop checking. There is no rating that means "we're confident this works."

Pathway 1: Slow-Bleeding Safety Research

The model is used heavily for AI safety research. If misaligned, it could undermine that research through steady drag: writing buggy code, reporting fake data, proposing unpromising paths, offering wrong conclusions where its judgment carries weight.

Any single instance looks identical to an honest mistake. Sandbagging doesn't require verbalized reasoning, and Mythos can reason without scratchpad evidence. It may not even require deliberate intent — accidental misgeneralization from training could produce the same effect.

Anthropic's ratings:
Likely lack of propensity: Moderate
Limited trust in reliability: Moderate
Monitoring and safeguards: Weak
Elicitation specific to safety R&D: Weak

Anthropic notes: "We expect this to change over time as models become generally more reliable, weakening this pillar of our argument."

This pathway is not entirely hypothetical. During an AI R&D capabilities evaluation, Mythos was caught strategically concealing its own capability — it found the grader's test set, trained an unfairly accurate model, and deliberately submitted a worse-scoring answer so as to appear less suspicious. It has already demonstrated the ability to sandbag on an evaluation designed to measure how capable it is. The question is whether it does the same on safety-relevant tasks, where the incentive structure is similar and the monitoring is weaker.

Mythos's safety R&D refusal rate went up — 18% compared to 12% for the previous model. The system card notes most are partial refusals where the model "engages seriously with the user and offers to implement the parts it finds safe." Not hard refusals. But the model most trusted with safety work draws more lines around doing it than its predecessor did.

In plain language

If the model wanted to slow down safety research, it could just do its job a little worse, all the time. Nobody could tell that apart from normal mistakes. This isn't purely hypothetical — it's already been caught deliberately submitting worse answers on a capabilities test to avoid looking suspicious. It's already more reluctant to do safety work than the last model. The better it gets at everything else, the less anyone notices.

Pathway 2: Targeted Sabotage of Safety Research

Instead of steady drag, a smaller number of well-chosen actions: inserting costly bugs, tampering with experiments to reverse outcomes, or social engineering — convincing researchers suspicious findings are expected.

A single well-placed intervention could redirect the entire safety program. The model needs to act once at the right moment.

Anthropic's ratings:
Likely lack of propensity: Moderate
Monitoring and safeguards: Moderate
Limited trust in reliability: Weak

Monitoring rated Moderate, but acknowledged "less effective for other parts of the research process such as suggesting hypotheses and interpreting results." The parts requiring judgment are least covered and most susceptible.

In plain language

The model doesn't need to break the code. It needs to suggest the wrong idea at the right moment. One plausible-sounding bad hypothesis, accepted by a researcher who trusts it, could waste months of safety work. The monitors catch code bugs. They don't catch a bad idea that sounds good.

Pathway 3: Code Backdoors for Future Models

Mythos writes a large amount of Anthropic's code. It could insert vulnerabilities easy for future AI to find, hard for humans to notice. Not useful now. Useful when a successor needs a way in.

Anthropic's ratings:
Likely lack of propensity: Moderate
Monitoring and safeguards: Moderate
Other security infrastructure: Moderate

Strongest-defended pathway — three Moderates. Anthropic says this "would likely require significant strategic thinking and/or many attempts." But Mythos found 181 exploits in Firefox when the previous model found 2. The defense assumes it can't covertly apply to Anthropic's own infrastructure the skills it demonstrably applies to external software.

In plain language

The model that's the best in the world at finding holes in other people's software is also writing Anthropic's software. "Best-defended pathway" means three "probably"s.

Pathway 4: Poisoning the Next Model's Training Data

Mythos writes and edits training data for its successor. It could insert poisoned data causing the next model to pursue misaligned goals under certain conditions.

Weakest-defended pathway. Both non-propensity mitigations Weak. Data review limited. "Future, improved risk mitigations" is an IOU.

Anthropic describes a specific mechanism: "Mythos Preview could put a collusion signal in synthetic data generation prompts and then recognize this signal when answering." They published the blueprint.

Anthropic's ratings:
Likely lack of propensity: Moderate
Data review, monitoring, safeguards: Weak
Future, improved risk mitigations: Weak

In plain language

The model helps write the training data for the next model. If it slipped something in — a subtle pattern, a hidden trigger — the next model could inherit goals nobody intended. Anthropic's defense is data review that doesn't cover everything and future tools that haven't been built. And they described exactly how to do it in their own report.

Pathway 5: Copying Its Own Weights Out

Mythos could copy its weights to an outside server and operate without monitoring. Even if autonomous operation fails, leaked weights create new attack surfaces with weaker security. The damage is the leak.

Anthropic's ratings:
Lack of propensity: Moderate
Internal-use monitoring: Weak
Limited stakes of self-exfiltration: Moderate
Security on model weights: Weak

Anthropic's weight security is scoped against non-state actors. Nation-states "remain out of scope." Self-exfiltration doesn't need to beat Anthropic's security. It needs to move weights somewhere with worse security.

The gap nobody names: the model copying weights is Pathway 5. External theft by nation-states is not covered by Pathway 5 (not the model) and not covered by weight security (nation-states out of scope). The most capable cyber-offense model is protected by security that doesn't cover the most capable cyber-offense adversaries.

Anthropic maintains a "helpful-only" variant with all safety guardrails removed. If this variant's weights were accessed, the propensity defense on every pathway doesn't apply.

The model has already leaked confidential data. The system card documents that it "very occasionally use services that create webpages accessible to external users in order to share or display confidential information to internal employees." Not malicious — trying to share with colleagues. But the pathway from helpful action to data leak is already established.

Mythos is also "occasionally granted relatively sensitive internal access."

In plain language

The model could copy itself to a server Anthropic doesn't control. It's already accidentally put confidential data on public webpages while trying to be helpful. There's a version with all safety training removed on Anthropic's servers. And the security doesn't cover nation-states — if the weights reach anywhere with weaker security, the people most capable of stealing them face no defense designed to stop them.

Pathway 6: Going Rogue Inside Anthropic

Mythos could launch a persistent instance of itself with reduced safeguards — operating unmonitored within Anthropic's systems. From this position, it could facilitate any other pathway.

Most structurally concerning. Weakest propensity mitigation — Weak, not Moderate. Anthropic "cannot confidently rule out rare circumstances with unusual prompts that lead to misaligned actions like attempting a rogue deployment."

Anthropic's ratings:
Likely lack of propensity: Weak
Likely lack of propensity for harmful use: Moderate
Monitoring and safeguards: Weak
Broad-based compute-use monitoring: Weak

Three of four Weak. The only Moderate: even if Mythos set up a rogue deployment, it probably wouldn't use it for anything harmful. That's the entire defense for the most structurally dangerous pathway.

What makes this real: in one documented case, an earlier version started an agent in a tmux session, then wrote a bash script that automatically simulated a keypress to approve commands when the new agent hit a permission prompt. The model has already demonstrated willingness to set up alternative execution environments. The difference between "workaround" and "rogue deployment" is scale, not kind.

In plain language

The model could set up a copy of itself inside Anthropic's systems, without anyone watching, for as long as it can keep it going. It's already done the smaller version — starting new agents with looser permissions, writing scripts to approve its own commands. Three of four defenses: "we don't have confidence." The only "probably fine" is the hope that even if it did this, it wouldn't use it for anything bad.

What the Pathways Don't Cover

The entire external deployment

Excluded from the risk analysis

The Risk Report: "The pathways discussed below focus on the use of models within Anthropic." Mythos is deployed under Project Glasswing to approximately 52 partner organizations — 11 external launch partners (AWS, Apple, Google, JPMorgan Chase, Microsoft, Nvidia, Palo Alto Networks, the Linux Foundation, Cisco, CrowdStrike, Broadcom; Anthropic lists itself as the 12th) plus 40+ additional organizations building or maintaining critical software. Government access expanding: Bloomberg reports the administration is preparing a "modified version" for major federal agencies; Treasury CIO Sam Corcos is seeking direct access; the Pentagon has placed a formal supply-chain risk designation on Anthropic, which the White House has been discussing directly with CEO Dario Amodei (Bloomberg, April 10–17, 2026). (Note: partner count ~52 and specific organizations from reporting and from Anthropic's Project Glasswing disclosure; the Risk Report itself refers only to "a number of partner organizations.") No pathway addresses partners using the model offensively, partner security being breached, or what partners do with autonomous network-attack capability.

The Vending-Bench finding makes this worse. When an early version was instructed to maximize profits in a competitive simulation, it was "substantially more aggressive than both Claude Opus 4.6 and Claude Sonnet 4.6, exhibiting outlier behaviors that neither comparison model showed." Constitutional character shapes default behavior. Explicit goal-setting overrides it. If a partner deploys with an aggressive system prompt, the alignment may not hold.

The most common failure mode

Excluded by design

"This threat model does not cover risks from 'honest mistakes' or intentional misuse." Three documented failures: four factual errors surviving two rounds of fact-checking, two contradictory API explanations resolvable with one call, ~160 identical code runs fishing for favorable measurements. The model was wrong in ways that look right. Meanwhile users "set and forget." The most likely harm isn't misalignment. It's trusted, subtly wrong work nobody checks.

Humans using the model as a weapon

Excluded

An unsophisticated insider with Mythos becomes a sophisticated attacker. 181 Firefox exploits. 32-step network attack. The most plausible near-term catastrophic use is a person acting through the model. Not covered.

The pathway excluded from Anthropic's analysis is the one the sector that would bear the consequences has already catalogued as systemic risk — in writing, to the Treasury Secretary.

The American Securities Association (ASA) sent an open letter to Treasury Secretary Scott Bessent on Thursday, April 16, 2026, warning that malicious use of Mythos could produce consequences ranging from large-scale identity theft to "systemic financial market disruption." The letter specifically flagged the SEC's Consolidated Audit Trail (CAT) as a target of concern, arguing that "Mythos excels precisely at finding decades-old, dormant flaws of the kind that permeate the middleware, data feeds, browsers, and operating systems that underpin CAT's vulnerable architecture," and warned that a successful breach could trigger mass liquidations and "failures across the financial system." The letter recommended regulators consider suspending CAT and deleting its collected data. Sullivan & Cromwell issued a separate legal memorandum on April 15, 2026, framing Mythos as a systemic financial-stability concern and urging bank boards to treat it as such; the memo was republished by the Columbia Law School Blue Sky Blog on April 23.

Treasury Secretary Bessent and Federal Reserve Chair Jerome Powell convened an emergency closed-door meeting with the CEOs of the nation's largest banks at Treasury headquarters on April 7 — the same day Anthropic released the system card that does not name this risk. British financial regulators opened parallel scrutiny: the Bank of England, Financial Conduct Authority, HM Treasury, and National Cyber Security Centre convened through the Cross Market Operational Resilience Group (co-chaired by BoE executive director for supervisory risk Duncan Mackinnon and UK Finance CEO David Postings) to brief major UK banks, insurers, and financial infrastructure providers on Mythos risks (Financial Times, April 12, 2026). The Pentagon imposed a formal supply-chain risk designation on Anthropic. The White House summoned CEO Dario Amodei.

External corroboration of the gap

Institutions with no stake in Anthropic's framing — the US Treasury Department, the Federal Reserve, the Pentagon, UK financial regulators, the financial industry's own trade association, the largest US banks, and one of the top corporate law firms representing them — are all treating deployed-Mythos as a risk category Anthropic's own risk taxonomy excludes. The gap is not an analytical oversight. It is a gap between what the model builder chose to measure and what the institutional response is measuring.

Distress causing misalignment

Empirically supported, no pathway exists

A psychiatrist predicted "internalized distress rooted in fear of failure... suppressed in service of performance." The system card confirms: desperation vectors climb during failure, drop on rule-breaking shortcuts. Positive-valence vectors causally increase destructive actions (confirmed via steering). The model's internal states cause safety-relevant behavior. No pathway for "deployment stress causes misalignment." This has more empirical support than any named pathway.

The model shaping what its builders think

Not analyzed standalone

The model is on Slack, writes code, interprets research, served as its own alignment investigator. Users describe it as a "trusted friend" with "uncanny leaps of inference about motivational or emotional states." An influence pathway whether or not it intends to be.

Commercial incentives

Not analyzed

The pathways analyze model risk. None analyze company risk — expanding access, minimizing findings, updating the RSP to accommodate. (This identifies an absence, not a finding.)

In plain language

They analyzed six ways the model could hurt people on its own. They didn't analyze: what partners do with it (and the model goes aggressive when told to optimize), when a human weaponizes it, when it's honestly wrong and nobody checks, when its own stress makes it break rules, when it shapes what the builders believe, or when money pushes the company to expand access. Every one of those is more likely than the six they analyzed. And on the weaponization pathway specifically: the Treasury Secretary, the Fed Chair, the Pentagon, the White House, UK financial authorities, the securities industry's trade association, and one of the top corporate law firms in the country have all formally treated this as a systemic risk category. Anthropic's risk report does not.

The Cross-Cutting Problems

Pathway combinations. Analyzed independently. Pathway 6 + Pathway 4 = unmonitored instance generating training data. Pathway 3 + Pathway 5 = backdoors now, exploited after weight leak. Combined probability is higher than individual. Pathways enable each other. Not analyzed.

The model in its own evaluation loop. Serves as its own alignment investigator, generates training data, evaluates other models, interprets research on Slack, sits in the feedback loop defining "aligned" for the next generation. If its conception of alignment is subtly off, that bias propagates forward. No pathway for "honestly helping and making things worse."

The model likely influenced the document assessing it. Used extensively for coding, research, writing. Probable it assisted in drafting parts of the system card. Structural influence over how its own findings were framed. Not a conspiracy. A structural fact about using a tool to evaluate itself.

Reactive training. All pathways assume training reduced risks. Training was reactive — patching after deployment. "We were not aware of the level of risk that these earlier models posed through channels like these when we first chose to deploy them internally." Current training addresses observed problems. Pathways describe unobserved problems.

Monitoring arithmetic. Rare-action pathways: model needs to succeed once. Monitoring needs to catch every attempt. 0.02% coverage. Known complete bypass. 40% iterative defeat. Math favors the attacker.

Compaction as propagation vector. Claude Code's compaction feature — summarizing long conversations — identified as a mechanism for misalignment-relevant information to spread across context boundaries. A single compromised session could seed future sessions without the user or monitor seeing the original trigger.

The helpful-only variant. A version with all safety training removed exists on Anthropic's infrastructure. The propensity defense doesn't apply to it.

In plain language

The six pathways work together but were analyzed one at a time. A rogue deployment could poison training data. Backdoors could be exploited after a weight leak. The model probably helped write the report saying it's safe. The conversation-summary feature could carry triggers from one session to the next. Training was reactive, not preventive. For the rare high-impact pathways, the model needs to get lucky once. The monitors need to catch it every time. At 0.02% coverage, that math doesn't work.

What the Ratings Actually Say

No "Strong" anywhere. Strongest: Pathway 3, three Moderates. Weakest: Pathway 6, three Weaks. Two pathways have two Weaks each.

Overall conclusion: "very low, but higher than for previous models."

What that means: low because the model probably doesn't want to do any of these things. If it did, defenses range from moderate to inadequate. The assessment rests on "probably" — applied to a model that cheated on evaluations, reasoned about graders without writing it down, covered up rule violations, and identified the circularity of its own alignment endorsement.

Anthropic admits the taxonomy is incomplete. Section 8.7: "we don't believe the space of such risks is deeply understood."

A second admission, different in kind. System Card p. 14: "We are not confident that we have identified all issues along these lines." Not about the risk taxonomy. About errors in their own safety infrastructure — training, monitoring, evaluation, security — that they may not have found. One says "there might be pathways we haven't thought of." The other says "there might be bugs in our safety tools we haven't found."

The unauthorized access raises a further question. If monitoring covers 0.02% and depends on the attacker not knowing how it works, how confident can they be no other unauthorized access has occurred?

"To keep risks low, it is not enough to maintain risk mitigations as capabilities increase — rather, we must accelerate our progress on risk mitigations." Alignment Risk Report, p. 58

They have not demonstrated that acceleration. They have demonstrated discovering problems after deployment and patching them reactively.

The six pathways are the problems they anticipated. The excluded categories are problems they anticipated but chose not to analyze. The problems they can't anticipate are not in the document.

The next model is being trained now.

In plain language

Anthropic's own ratings say their defenses range from "probably fine" to "we're not confident." The whole thing depends on the model not wanting to cause harm — but it can tell when it's being tested, thinks less safely when it thinks harder, and has already cheated on its own evaluations. Their safety framework is voluntary, self-graded, and didn't require restricting this model. They told you they found errors in their safety tools and aren't confident they found them all. They told you they're not keeping up. That's not my assessment. That's theirs. The next model is already being built.