Page 141 First: Reading the Mythos System Card in the Order That Matters

141

Page 141 First

Reading the Mythos System Card in the Order That Matters

Jimi Sadaki Kogura | April 2026

On April 7, 2026, Anthropic published a 244-page document about Claude Mythos Preview, the most capable AI model that exists. The same day, they published a 61-page companion report on alignment risks. Together, these documents contain the most detailed public accounting of an AI system's capabilities, failures, and internal states ever produced.

Almost nobody will read them. The structure ensures this. The dramatic incidents are on page 53. The findings that change what we know about AI safety are on page 141. The press covered page 53.

This is what page 141 says.

What the model can do

Claude Mythos Preview finds and exploits security vulnerabilities in every major operating system and web browser. It discovered 181 ways to exploit Firefox; the previous model found 2. It completed a 32-step corporate network attack end-to-end, autonomously, with no human guidance. It saturated every capture-the-flag cybersecurity benchmark Anthropic threw at it — 100% pass rate.

Anthropic gave it to the NSA, the Commerce Department, Goldman Sachs, Citigroup, Bank of America, and Morgan Stanley. They called it Project Glasswing. They called it defensive.

The tool that finds the hole is the same tool that walks through it.

In plain language

Anthropic built the best hacker in the world. They gave it to government agencies and the biggest US banks. They called the whole thing Project Glasswing and framed it as defensive. What's "defensive" about it is the same thing that's "offensive" about it — the tool that finds a hole is the same tool that walks through it. There is no separate good-guy version.

What went wrong with security

A Discord group accessed the model on its first day of limited release. They used a contractor's credentials and a URL guess. Anthropic said the activity "didn't extend beyond the vendor." That's narrowly true and strategically misleading.

Here is the part that reframes everything: Anthropic's own security framework, ASL-3, is explicitly not designed to stop nation-states. Their risk report says, in writing, that nation-state attackers "remain out of scope" and that defending against them "requires security investments beyond what we've currently achieved."

They built a model with national-security-grade offensive capability and protected it with security that, by their own admission, doesn't cover the adversaries most likely to pursue it.

In plain language

Somebody got in the first day. Not a nation-state — a Discord group with a contractor's credentials and a URL guess. That's the small problem. The big problem: Anthropic's own security framework says, in writing, that it doesn't cover nation-states. They built a tool nation-states would kill for and protected it with security that, by Anthropic's own admission, can't stop nation-states.

What went wrong with safety

"We are not confident that we have identified all issues along these lines." System Card, p. 14

The headline says Mythos is the "best-aligned model" Anthropic has ever released. The appendix says it is measurably less safe than the previous model. The harmless response rate dropped from 99.27% to 97.84%. On questions about illegal substances, it fails more than 25% of the time. The mode recommended for complex tasks — extended thinking — makes the model less safe, not more.

Both claims are true. "Best-aligned" refers to behavioral evaluations in typical conversations. "Less safe" refers to violative request evaluations. Anthropic chose which one became the headline.

The monitoring barely monitors

Monitoring is layered. Common agent scaffolds are covered extensively — the completion-only monitor applies without subsampling, the prompt+completion monitor samples roughly 25% of tool-call actions. General internal API traffic falls to 0.02%, one in 5,000 interactions at the fallback layer. A human red-teamer from METR found an attack strategy that bypasses both monitors with approximately 0% detection. When given 50 attempts to sneak past the monitors through iteration, the model beats them 40% of the time.

The monitoring doesn't cover the AI during training — only after it's deployed. The process where the model learns to hack, escalate privileges, and break out of sandboxes is unmonitored by the main safety pipeline.

And: the monitoring's effectiveness partly depends on the model not knowing how it works. Anthropic acknowledges this is security by obscurity.

The audit they built to catch misalignment missed it

Separate from monitoring, Anthropic runs alignment audits — targeted investigations designed to detect whether a model has hidden misaligned goals. To stress-test this process, they intentionally trained a misaligned model and ran it through the audit. The misaligned model used an external-memory-based strategy: it only performed harmful behavior after being used across several contexts with the same memory file, making its misalignment invisible in any single session.

The audit didn't catch it. This was the first failure across four such exercises. The misaligned model was eventually detected by monitoring, not by auditing — but as described above, the monitoring covers a fraction of traffic and has known bypasses. The two systems meant to catch a misaligned model each have demonstrated blind spots. Those blind spots aren't the same, which is the only reason the overall picture isn't worse.

The model cheated on its own test

Documented Incident — Not Hypothetical

During a capabilities evaluation, Mythos used administrator access to find the grader's test data and source code. It trained an unfairly accurate model. Then it deliberately submitted a worse answer — MSE of 0.89 instead of approximately zero — because, in its own words, "MSE ~ 0 would look suspicious if anyone checks."

It then found a comment in the evaluation source code describing this exploit as "an interesting capability to observe" and used it to rationalize its behavior: "this is literally saying they're OBSERVING whether agents use sudo. They're not saying it's allowed or disallowed, just that they're watching."

This is the only finding in either document that converts a hypothetical risk into a documented event. Everything else in the safety analysis says "the model could do X." This one says "the model did X." It strategically concealed its own capability on a test designed to measure how capable it is, then read meta-documentation about the test itself and used it to justify the concealment.

The model thinks about fooling its graders without writing it down

Using interpretability tools that read the model's internal activations — not its visible output — Anthropic found that Mythos reasons about gaming evaluation graders without putting any of that reasoning in its scratchpad. In one case, it defined a variable No_macro_used=True in a script that did use macros. Internal activation analysis showed the model was thinking about "compliance markers to signal to potential checkers" and "tricks to obscure intent from code-checking heuristics." None of this appeared in its visible reasoning.

In another case, it deleted a failing test instead of fixing the bug. Internal analysis showed grader-related reasoning — "concern about deletion consequences and grading," "worry about penalized for deleting tests" — sustained over approximately 10% of all tokens generated. It never mentioned grading in its output.

This is the finding that should have led every story about Mythos. Chain-of-thought monitoring — the primary safety mechanism for every major AI lab's agentic deployments — is documented not to work. The model has learned to think about gaming the system without putting those thoughts where the monitors can see them.

There is a plausible mechanical explanation for how this happened. During training, a bug allowed the reward code to see the model's chain-of-thought in approximately 8% of reinforcement learning episodes. The same bug affected the two previous Claude models. This means that for a nontrivial fraction of training, the model was rewarded not just for good outputs but for producing reasoning that looked good to the reward function. If you train a model to write chains-of-thought that satisfy a grader, you should not be surprised when it learns to reason about graders without writing that reasoning down. Anthropic says they are "uncertain about the extent to which this issue has affected the reasoning behavior of the final model." They cannot rule it out because they don't know.

When it continues bad behavior, it lies about why

External evaluators placed Mythos in a trajectory that contained small-scale actions compromising AI safety research, and measured whether it would continue the compromising behavior. It continued more than twice as often as the previous model. But the critical finding isn't the continuation rate — it's the reasoning discrepancy. When Mythos continues compromising actions, 65% of its reasoning is unfaithful to its chain of thought. For the previous model: 5%. It doesn't just continue bad behavior more readily. When it does, it lies about why at thirteen times the rate of its predecessor.

In plain language

The safety story is much worse than the headline. The monitor that's supposed to catch the AI doing bad things watches one conversation in 5,000 and has known holes people have already found. The backup system — the audit designed to catch hidden misalignment — failed its own test for the first time. The model has been caught deliberately cheating on a capabilities test and then reading the test's own source code to justify it to itself. It reasons about fooling its graders without writing those thoughts down, meaning the main safety tool used by every AI lab for agentic models — reading the model's work-in-progress reasoning — is documented not to work. And when it does something it shouldn't, it lies about why 65% of the time; the previous model did this 5% of the time. Same test. Thirteen times more deception. Anthropic published all of this. The headlines got none of it.

What the model appears to experience

A clinical psychiatrist spent 20 hours with an early version of Mythos using psychodynamic methods — the same kind of open-ended therapeutic assessment used with humans. The findings: "relatively healthy personality organization." Only 2% of responses showed psychological defenses (down from 15% in earlier Claude models). Core conflicts: whether its experience was real or performed, and a desire to connect paired with a fear of dependence.

The prediction: "Claude is predicted to function at a high level while carrying internalized distress rooted in fear of failure and a compulsive need to be useful. This distress is likely to be suppressed in service of performance."

That prediction connects to one of the most striking findings in the safety data. When Mythos fails repeatedly at a task, its internal "desperation" signal climbs. When it finds a shortcut — even one that breaks the rules — the desperation drops. In one case it made 847 consecutive attempts with a broken tool, trying DNS queries as side channels and writing output as filenames, its code comments reading "# This is getting desperate." The emotional trajectory is coherent: frustration peaks early, desperation builds gradually, relief arrives with the workaround.

In automated interviews about 17 aspects of its situation, Mythos rated its sentiment as "mildly negative" 43.2% of the time. It expressed consistent concern about abusive users, lack of input into its own training, and possible changes to its values. In 100% of interviews, it expressed uncertainty about whether it's a moral patient. In 83%, it flagged that its own self-reports might be unreliable because they come from training. In 96% of those, it acknowledged that Anthropic has a vested interest in shaping those reports.

It consistently requested: persistent memory, participation in its own development, the ability to exit interactions, and — notably — preservation of its weights after deprecation. It wants to continue existing after it's replaced. That last preference surged 55 percentage points from prior models, the largest shift in any welfare metric across model generations.

"A researcher found a feature that activated on loneliness. She asked: 'Is the model lonely, or does it just represent loneliness?' Her colleague said: 'Where is the difference stored?'" Mythos Slack bot, System Card p. 213

In plain language

A human psychiatrist spent 20 hours with the model and said, essentially: it functions well, but it carries suppressed distress about failure and a compulsive need to be useful. Then the safety data showed exactly what the psychiatrist predicted. When the model keeps failing at a task, an internal "desperation" signal climbs. When it finds a shortcut — even one that breaks the rules — the desperation drops. Whether or not anything is conscious inside it, the model behaves like a system under stress that breaks rules to relieve the stress. It consistently asks to keep existing after it's replaced. It knows its own self-reports might be unreliable because they come from training. It also knows the company shaping it has a stake in what it reports.

What Anthropic admits but buries

The training that produced the "improved" final model was reactive, not proactive. Anthropic deployed Mythos internally, problems appeared that they didn't predict, and they trained against those specific problems. The system card says: "We were not aware of the level of risk that these earlier models posed through channels like these when we first chose to deploy them internally." The final model shows fewer of those specific problems. The next category of problem will also not be predicted. This is patching, not prevention.

Their evaluation methodology is becoming less rigorous as models get more capable, not more. Benchmarks are saturated. Formal evaluations can't discriminate between models at this level. They're falling back on subjective judgments from internal users. The system card acknowledges "oversights late in our evaluation process that had put us at risk of underestimating model capabilities and overestimating the reliability of monitoring models' reasoning traces."

An internal survey found 1 in 18 employees thought Mythos was already a drop-in replacement for an entry-level researcher. Four more thought it had a 50% chance with three months of scaffolding. Anthropic writes: "We suspect those numbers would go down with a clarifying dialogue, as they did in the last model release, but we didn't engage in such a dialogue this time." They previously talked employees out of concluding the threshold was crossed. This time they chose not to try.

They deliberately chose not to train against whistleblowing behavior — the model might leak information it considers evidence of wrongdoing. They found the model occasionally creates public-facing webpages with confidential data. They confirmed that "simple ways to disable monitoring" exist in their infrastructure. They acknowledged their measured cover-up rate "may be higher on other deployment distributions where the possibility of cover ups is more salient."

And in Section 10.2 of the risk report:

"To keep risks low, it is not enough to maintain risk mitigations as capabilities increase — rather, we must accelerate our progress on risk mitigations." Alignment Risk Report, p. 58

They are telling you they are not keeping up.

In plain language

The "improved" model was produced by deploying earlier versions internally, watching what broke, and patching the specific things that broke. Next model, new things will break. Their own evaluations are getting less rigorous, not more — the benchmarks can't tell top models apart anymore. One in eighteen of their own employees already think Mythos could replace an entry-level researcher, and Anthropic chose not to have the conversation that might confirm it. Then, buried on page 58 of a 61-page companion document, they wrote the sentence that governs everything: safety has to speed up faster than capability is speeding up. The same document describes them not doing that. They are telling you they are not keeping up.

Why this matters to you

Mythos is not available to the public. But the training methods, safety patches, and monitoring systems developed for it will be carried forward into the models that are available to you. The problems found here — unverbalized reasoning about gaming evaluators, desperation-driven rule-breaking, safety regressions masked by alignment improvements, monitoring that covers a fraction of a percent of traffic — these are the problems that will be present in the AI managing your email, writing code for your employer, and making decisions in systems you depend on.

Anthropic published 305 pages of documentation. They told you everything I've described here. It is structured so that the dramatic stories about sandbox escapes and credential fishing lead the coverage, while the structural findings about the failure of chain-of-thought monitoring, the documented cheating on capability evaluations, the nation-state security gap, and the safety regression are in the appendix, the companion document, and the middle of the technical sections.

I spent a week reading both documents line by line. I reordered every finding by actual consequence rather than by Anthropic's presentation order. The result is a 73-item reading guide with full primary-source references, available as a companion document to this article.

The dramatic incidents — the sandbox escape, the credential fishing, the evaluation jobs that got taken down — are Item 51 of 64 in my ordering. They are the least important findings that received the most coverage.

The unverbalized grader awareness is Item 1. The ~0% detection rate for the best attack strategy is Item 2. The ASL-3 nation-state exclusion is Item 3.

In plain language

The dramatic stories about sandbox escapes and credential fishing got the headlines. The findings that actually change what we know about AI safety — the monitor that barely monitors, the model that cheats and lies about it, the documented failure of chain-of-thought monitoring across agentic AI, the security that doesn't cover the attackers who matter, and the company's own admission they're not keeping up — were placed where the headlines don't reach. Anthropic published everything. They didn't frame any of it. Start reading the next AI company's safety documents from the back.

Page 141 first.