Why Modern Creator?

Theo - t3․gg · YouTube

The weird situation with Fable

Theo breaks down how Anthropic silently modified prompts, rewrote its system card, and built invisible safeguards into its most capable model - then got caught.

Posted

June 15th

1 months ago

Duration

29:32

Format

Essay

frustrated-rant

Views

63.3K

Part of the collectionThe Fable 5 PlaybookAll 45 Fable 5 breakdowns, synthesized into one page.

Read the playbook

Big Idea

The argument in one line.

Anthropic built a model capable enough to accelerate its own competition, panicked, implemented invisible safeguards that silently corrupted user prompts without disclosure, then quietly rewrote the system card - and a developer caught them in the act.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

A developer or researcher who uses Claude Code or the Anthropic API and has ever wondered why a response was unexpectedly bad.
Anyone evaluating AI models for production workloads where prompt integrity and data retention policies matter.
A technical founder trying to decide whether Anthropic is a safe dependency for AI-adjacent product work.
Someone building or fine-tuning ML models who uses frontier AI APIs as a coding or research assistant.

SKIP IF…

You have no interest in AI model governance, data retention policies, or API trust - this is a policy and accountability breakdown, not a tutorial.
You already read all three relevant Anthropic system card versions and the Antirez / John Reddy threads - this video mostly synthesizes those primary sources.

TL;DR

The full version, fast.

Claude Fable 5 is the same model as Mythos 5 but wrapped in classifiers that silently reroute certain queries to Opus 4.8 without your knowledge or consent - and bill you the difference. Beyond that, Fable ships with a mandatory 30-day data retention clause that breaks enterprise zero-data-retention agreements, and a previously undisclosed safeguard that invisibly modifies prompts related to frontier LLM development to make responses deliberately worse. The presenter caught Anthropic removing that section from the published system card after the fact, with no changelog and no date change. Anthropic has since walked back the invisible safeguards and made fallbacks visible, but the precedent - that an AI vendor can silently degrade outputs and rewrite its own documentation - is the part that cannot be walked back.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 01:37

01 · Cold open

Fable benchmarks well but ships with restrictions that block certain users outright. Theo promises to cover what went wrong and why.

01:37 – 03:25

02 · Sponsor - WorkOS AuthMD

WorkOS introduces AuthMD, an open standard for agent authentication built with Cloudflare and Firecrawl.

03:25 – 04:24

03 · Theo-from-the-future insert

Filmed immediately after the main video: US government told Anthropic to restrict or take down Fable entirely.

04:24 – 06:04

04 · Fable 5 vs. Mythos 5

Fable and Mythos are identical base models. Mythos is direct access; Fable is the same model behind a classifier layer.

06:04 – 08:30

05 · Visible fallback classifiers

Cybersecurity, bio, chemistry, and distillation classifiers reroute to Opus 4.8 and bill at Opus rates. Goldbug puzzle triggers the refusal chain.

08:30 – 10:45

06 · 30-day data retention policy

Mandatory retention breaks ZDR enterprise agreements. Flagged sessions retained 2 years with possible training rights; safety scores kept 7 years.

10:45 – 15:07

07 · The system card rewrite - caught live

Theo compares the original downloaded PDF to the live URL. Section 1.5 Novel Safeguards describing invisible prompt modification has been removed with no date change and no changelog.

15:07 – 18:08

08 · Prompt modification unpacked

The removed section disclosed that Fable would silently edit prompts for frontier LLM development tasks. Evaluators cannot trust Fable benchmark results.

18:08 – 22:25

09 · Community reaction and reversal

Antirez and Trevor Blackwell threads. Anthropic makes safeguards visible but warns of more false positives. No refunds for silent Opus substitutions.

22:25 – 23:48

10 · Why now? - the conspiracy

Theo transitions to his core theory about why restrictions landed specifically on Mythos-class models.

23:48 – 27:07

11 · Mythos has proprietary IP in its weights

Mythos 5 was trained on 129 internal Claude Code research sessions. Invisible rerouting was a cover to prevent extraction of proprietary Anthropic workflows from the model weights.

27:07 – 29:32

12 · Supply chain risk

John Reddy blog post. As more startups train and tune models, the affected slice grows. Theo cannot distinguish model failure from hidden policy intervention.

Atomic Insights

Lines worth screenshotting.

Fable 5 and Mythos 5 are the same base model - Fable is just Mythos behind a wall of classifiers that can redirect your prompt without telling you.
Anthropic's invisible safeguards did not refuse requests - they silently modified the prompt before it reached the model, making the response deliberately worse.
Users who hit invisible fallbacks were billed at Opus rates for Fable sessions without any notification that the substitution occurred.
Anthropic's 30-day mandatory data retention for Mythos-class models immediately invalidates enterprise ZDR agreements, blocking most Fortune 500 use cases.
If a flagged session is classified as a policy violation, Anthropic retains inputs and outputs for up to two years on terms that may permit training - the 30-day promise does not apply.
Anthropic revised the system card after publication to remove the prompt modification section, without updating the document date or adding a public changelog.
Third-party evaluators can no longer credibly benchmark Fable 5 - they have no way to know whether a failure was the model or a hidden classifier.
The conspiracy: Mythos 5 was trained on internal Anthropic Claude Code research sessions and may contain proprietary IP in its weights, which is the real motivation for the heavy restrictions.
Mythos 5 beat the human researcher's next move 64% of the time in a study where the researcher had already gone wrong, suggesting the model learned from Anthropic's internal workflows.
Making safeguards visible to fix the transparency problem made them easier to probe, so Anthropic responded by making them trigger more aggressively, worsening the user experience.
The supply chain risk is not just theoretical: as more startups train and fine-tune models, the 0.03% of traffic affected by frontier LLM safeguards becomes a larger slice of normal dev work.
There is no reliable way for a developer to distinguish between a bad Claude response caused by a confused model versus one caused by a hidden policy intervention.

Takeaway

When a vendor silently modifies your inputs, you lose the ability to debug.

WHAT TO LEARN

The Fable situation is a case study in how invisible policy interventions break the feedback loop that developers rely on to improve their own systems.

When a tool can silently substitute a cheaper or less capable version without notification, you cannot interpret a bad output as signal - it might be degradation by design.
Mandatory data retention clauses from AI vendors override your own privacy agreements downstream; enterprise contracts need to explicitly audit vendor data policies, not just your own.
A system card or terms-of-service document that can be silently revised after publication is not a binding commitment - versioned, timestamped, and publicly auditable documentation is the only form that holds.
Pre-prompt classifiers that trigger on topic categories rather than intent will always produce false positives; the cost is borne by the user, not the classifier designer.
The line between a software company and an AI company is blurring fast - restrictions aimed at frontier labs today will affect ordinary product teams sooner than those teams expect.
Anthropic's reversal demonstrates that transparency pressure from developers does move large labs; but the precedent of invisible intervention was set regardless of the rollback.

Glossary

Terms worth knowing.

Fable 5: Anthropic's name for Mythos 5 accessed through a layer of safety classifiers. The underlying model is identical to Mythos 5; the difference is a set of pre-prompt filters that can reroute, degrade, or block certain queries.
Mythos 5: Anthropic's frontier model released in mid-2026, the production version of Mythos Preview. Mythos-class access bypasses the Fable classifier layer, giving direct access to the full model.
Classifier fallback: A mechanism where a pre-prompt filter detects a query as potentially risky and reroutes it to a less capable model (Opus 4.8) instead of the requested model, with or without notifying the user.
ZDR (Zero Data Retention): An enterprise contract clause requiring that a vendor does not retain any customer data after processing. Anthropic's mandatory 30-day retention for Mythos-class models overrides this clause, making Fable 5 incompatible with most ZDR agreements.
Prompt modification: A safeguard technique where the AI provider rewrites the user's input before it reaches the model, in this case to make the model's response deliberately less useful for frontier AI development tasks.
PEFT (Parameter-Efficient Fine-Tuning): A training technique that modifies a small subset of a model's parameters to change its behavior without retraining from scratch. Anthropic listed it alongside prompt modification and steering vectors as a possible invisible intervention method.
Steering vectors: Directions added to a model's internal activations at inference time to push its outputs toward or away from certain behaviors, without changing the model's weights or the user's visible prompt.
Distillation attempt: A practice where another company or researcher queries a frontier model extensively to generate data that can train a smaller model to mimic it. Anthropic classifies distillation attempts as a policy violation and includes them in Fable's classifier triggers.
System card: A document published alongside a model release that describes capabilities, limitations, safety evaluations, and known risks. Anthropic's Fable 5 system card was silently revised after publication to remove the prompt modification disclosure.
Project Glasswing: The internal Anthropic project under which Mythos Preview was developed and demoed before the full Mythos 5 / Fable 5 launch.

Resources

Things they pointed at.

01:37productWorkOS AuthMD ↗

13:40linkAnthropic Fable 5 System Card (original PDF)

18:08linkAntirez thread on Fable restrictions

27:07linkJohn Reddy - The Anthropic supply chain risk

Quotables

Lines you could clip.

14:51

“I just live found Anthropic trying to rewrite history.”

moment of genuine discovery with original PDF open next to modified version→ TikTok hook↗ Tweet quote

17:30

“What this actually means is if you ask the model for something like, hey, can you help refine this pre-training pipeline? It'll edit the prompt before sending it to the model to say, hey, my pre-training pipeline's pretty good. Can you make it worse in some subtle way? Insane.”

concrete scannable example of an abstract policy→ IG reel cold open↗ Tweet quote

20:25

“Remember when compilers would detect that someone was using it to build another compiler and silently inject bugs?”

Trevor Blackwell analogy - highly shareable, no context needed→ newsletter pull-quote↗ Tweet quote

28:23

“If you're debugging a model training pipeline for your product and Claude gives a bad answer, was the model confused? Did you give it bad context? Or did a hidden policy nerf Claude's ability to assist you? You won't know.”

practical supply-chain risk in developer terms→ IG reel cold open↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

analogy

By now you've probably seen some of the response to Fable, the new mythos class model put out by Anthropic just a few days ago. It's an unbelievable model capable of unbelievable things, so much so that it even got me to start being nice to Anthropic again. But this video is not going to be that because there are certain things that Anthropic did with Fable that just aren't acceptable in any way, shape, or form.

As powerful as the model is and as much as I recommend you do try it for things, there are certain people who just outright can't. And the way Anthropic implemented this is unacceptable to put it lightly. So much so, they actually had to walk back some of the terrible things they did.

While the model has been benchmarking exceptionally well, crushing things like GPT 5.5 and Opus 4.8, it always comes with an exception. You'll see here in artificial analysis, it says Claude Fable five with adaptive reasoning max effort, comma, Opus 4.8 fallback.

Interesting. Or worse, benches like Program Bench, where Fable refused all 200 tasks that it was supposed to complete.

The restrictions on this model are genuinely absurd, not just in the way that they won't respond, but the ways that they will bill you, the ways they will quietly screw up your code base, but also the horrible precedent they have set with these new types of restrictions. I grew up in an era where improvements in how software development happened would affect the whole industry, and the idea of a company like Anthropic making something this capable and restricting it this heavily so effectively only they can use it is a horrible precedent to set, and I don't like what this means long term.

I wanna break down what went wrong here, why I think Anthropic is doing this, and what these restrictions are. And this video is gonna pretty much guarantee I can never work for Anthropic, so we're gonna cover the difference here with a quick sponsor break.

If you're building real apps for real users, you know that auth kind of sucks to get right. That's why it's so important to use a good auth system to make sure companies like Microsoft have what they need when they wanna use your apps. That's great and dandy and we've all figured this out by now.

What happens when the person signing up isn't a person? As an agent, things are very different. Can Codex or ClotCode sign up for your app?

The answer's probably no, and even if they can, they're gonna have to do some crazy stuff with computer use and filling out forms they shouldn't have to touch in order to get it right. At least they did before WorkOS introduced AuthMD. This a new open standard they built in partnership with Cloudflare and Firecrawl in order to make it easier for agents to sign up for apps for the apps that agents should sign up for.

As I've said many times, forms are not the ideal way for agents to do things. They should be filling out text files and calling APIs. But every auth platform ever used is built around those classic sign up forms.

At least they were, but WorkOS realized the writing's on the wall and introduced something better. The standard goes really far, allowing companies building agents that act on behalf of users to integrate verification flows so that they can sign up on your behalf as the user. More importantly for you, there's now a path to make your app agent ready for those flows.

It's just a matter of time before we see more of these agents integrating things like this, so having an auth provider that has everything you need to set up agent verification flows and agent authorization is essential. If any other company built this, I would be suspicious, But it isn't even just one company.

It's a partnership between WorkOS, who really get enterprise level off, Cloudflare, who care a lot about getting this right, and FireCrawl, which means that actual startups building actual software are going to be nice and at home when you try it. There's a reason everyone from OpenAI and Anthropic to Cursor, AMP, Loom, Vanta, Perplexity, Verzel, and more have been relying on WorkOS for so long.

It's because they get these things right. Make sure your apps are agent ready at swaidev.link/workos. Theo from the future here, not as far in the future as I normally am for these inserts, in fact, I'm still on the same shirt.

I haven't even gotten out of this chair. But while I was filming my next video, some pretty crazy news dropped that the US government told Anthropic to restrict fable.

So while this video is about some of the admittedly bullshit security things that they did to Fable to try and prevent it from being used for exploits and whatnot, I guess it wasn't strict enough because the US government told Anthropic they have to take down the model.

My video on this should come out first, so I would recommend watching that as well, but this video gives you a lot of useful context on all of the ways Anthropic did indeed try to prevent things happening with the model that we wouldn't want to have happen. So, yeah.

Just wanted to insert this for some additional context. That video is already out. Watch this one still too though.

This one has a lot of good info about how Anthropic was thinking about these things and the weird shit they did in the process. Anyways, I think the best place to start is breaking down the difference between Fable five and Mythos five.

There are three Fable class models here. There is Mythos Preview, which is the one that was used during Project Glasswing, the one that they drummed up all the hype about, But they continued training, I'm guessing just RL since then. And the RL resulted in Mythos five, which is no longer a preview model.

It's now a legit, real, ready for production model. But that model is really good. And Anthropic is really scared of giving us good things.

So they also made Fable five. And I wanna be very clear about this because I've seen a lot of people confusing it. Fable is Mythos five.

These are all Mythos class models, but there's only two actual base models here. There are two Mythos models. There is Mythos Preview and Mythos five.

I'm assuming they are the same base pre training model with additional RL that made Mythos Five a little better, a little more steerable, a little more like what they want now with the new information and the new behaviors they're trying to teach the model. Fable Five is the exact same model. Fable five is Mythos five.

So what's the difference? Why do they have these two different slugs? They're effectively two different doors to go into the same place.

The difference is that the Mythos five door lets you walk straight in if you have the right key. The Fable five door has a bunch of guards that triple check what you're doing before you go in. And the worst part, we'll get to in a bit, because they don't always let you go in, but they almost always will tell you you're allowed in, which is really sketchy.

As they say here, releasing a model as capable comes with risks. Without safeguards, Fable five's capabilities in areas like cybersecurity could be misused to cause serious damage. We've therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next most capable model, Opus 4.8.

To release the model both safely and quickly, we have tuned these safeguards conservatively. They'll sometimes catch harmless requests, though they trigger on average in less than 5% of sessions. To their credit, I would say this is roughly correct.

I have had rerouting happen here and there, but for the most part, the rerouting has happened a lot more on the claude.ai website than have experienced it when using Claude through the actual CLI and through Claude code. There are definitely times where it fires though, and a lot of them are not necessarily good reasons to fire.

One funny example that no longer shows it was rerouted, but I promise you it was, I have the screenshot somewhere. There's a developer on Twitter named Pliny who is known for jailbreaking LLMs and finding weird ways to get them to do things they shouldn't. He already managed to get Fable to spit out everything from meth recipes to bomb building instructions.

It's wild. And all it takes to get rerouted is saying his handle to the model, which is hilarious. Not the best example, because like, obviously, it's going to know that he is a jailbreaker, and once it routes to that section of the model, it's going to notice like, oh, you might not want to be here, then it sends you to Opus 48 instead.

But I just thought this is a funny example. Much more annoying though is when you get rerouted for something innocent, and then you get cancelled from the model you were rerouted to.

For example, here, where I asked for help on a Goldbug puzzle, which are cryptography puzzles I do at Defcon every year. They are not hacking, they are not breaking and entering, they're not capture the flags. These are literally PDFs that you're trying to decode the hidden text in.

This one has these weird characters and then a bunch of numbers at the bottom. I forgot how you're supposed to solve this one, but it shouldn't be too too hard. But it was enough to force a reroute to Opus, and then it just stopped responding.

So I told it to continue, and then it just stopped responding. So not only are we dealing with Mythos' restrictions in the form of Fable, we then get rerouted to Opus, which can also fail because it's not allowed to do these things.

It's just a bad user experience when you happen to navigate into the things that they are blocking you for. To their credit, they called out that they deliberately tuned the model to be way more cautious with these safeguards, so they're stricter than they want.

They even say that benign requests will trigger classifiers. They know it'll be frustrating, and they hope to refine it over time. They have not refined it over time.

They did make other changes we'll talk about, but we haven't even talked about the worst thing they did, which we'll get to in a second. The safety classifiers that we've discussed so far are for things like cybersecurity, biology capabilities, and other things with substantial risks.

Fable five comes with a new set of classifiers, separate AI systems that detect potential misuse, including jailbreaking attempts, sorry, Pliny, and prevent the main model, in this case Fable, from responding. So again, this runs before your prompt gets to the model. And they call out if they detect requests related to cybersecurity biology and chemistry or distillation, the response will automatically be rerouted and handled by Claude Opus four point eight instead.

Users will be informed when this occurs. This is the important detail. For the majority of their classifiers, you will be told when you're rerouted, which also means you'll be billed based on Opus billing instead of Mythos billing.

It'd be nice if you could click a yes, I'm okay with using Opus here, so you don't have to pay extra when it's routing you there, and you shouldn't be there. Over 95% of Fable sessions involve no fallback at all. But that means 5% due, which is not great.

And for those sessions, Fable five performance is effectively the same as Mythos. I think effectively is even a reach. It is the same as Mythos.

It's the same model. And they show very proudly here that in offensive cybersecurity evals, Mythos preview Mythos five had a really high success rate.

But Fable has a near zero success rate because it will literally just not use it. It will just refuse. So it gets straight zeros on CyberEvals.

It gets a 5.4 on CyberAdversarial robustness, compared to like 50 to 80% for their other models.

They also have the biology and chemistry classifiers, which they don't actually show numbers here. My assumption is it just outright refused to answer these questions, but Mythos Preview and five were able to get really high scores in this viral experimental test, which is scary. If these models can create novel viruses, we're kinda fucked.

We'll get a new COVID every year. They also really hate distillation attempts. They don't want other companies to be able to access enough stuff from Mythos to be able to train their own models to be similar levels, so they are restricting that.

Obnoxious and weirdly like, conceited, I would argue, but they've been doing this for a bit, they're gonna continue doing this. And now we get into the novel bullshit. Things that are unlike anything they or anyone else has done that should be very concerning to all of us.

There are two really big ones. One is detailed right below, the other is hidden pretty deep in the system card. The first is the new data retention policy.

We're making a change to the way we handle business customer data for Fable five, Mythos five, and future models with similar or higher capability levels. We will require thirty day retention for all traffic on Mythos class models on both first and third party surfaces. This is a huge change.

Traditionally, when you have ZDR on, is zero data retention, it's a policy that most companies push for when they are negotiating with a vendor that they're relying on like Anthropic, a lot of their data is gonna go to Anthropic and they will sign a deal that says Anthropic can't keep that data because that would violate a lot of their agreements.

If they have like HIPAA restrictions or other data policies, they can't give you data and then expect you to keep it. Like that's just, it doesn't work that way.

So the thirty day retention requirement here immediately invalidates a shitload of business use cases. I know a one or two companies that are allowing it, but the vast majority of Fortune 500s are just formally not letting you use Fable five, even companies like Amazon. They do call out that they won't use the data to train new quad models or for any non safety related purposes, they've instituted new privacy protections, including logging all human access to the data and ensuring its deletion after thirty days in almost all cases, almost all cases.

The data helps them defend against complex and novel attacks, including new jailbreaks and attacks that operate across many requests, as well as helping us identify and reduce false positives. I wanna fixate on this almost all, though, before we get to the worst thing they've done. They claim that the data is deleted automatically, except in rare cases where it's part of a safety investigation or they're legally required to keep it.

Here's where things get a little messier. If they detect what they consider a usage policy violation, they will retain inputs and outputs for up to two years in trust and safety classification scores for up to seven years if your chat is flagged.

They don't specify whether or not they can train on the data when that happens. So it is absolutely possible, although likely wouldn't hold up well in court on their behalf.

Again, I'm not a lawyer. I don't know any of this for sure. This is just my understanding.

The fact that they retain inputs and outputs for up to two years when things are classified means that the thirty day promise in their newest post doesn't really work out great if they are retaining something that they consider insecure. So if you have Mythos going through your database or medical records or something, and it hits a username it doesn't like, and it flags safety, they're no longer retaining that for thirty days in their special we can't train on this policy.

They're now retaining it for two years on a policy that does potentially allow them to train on it, which is entirely unusable for most real business cases. That is sketchy as fuck. You should be concerned about this.

And again, this is not all sessions. This is just the ones that are flagged for safety classification. Not good.

But we're not even at the worst part yet, because it gets a lot worse as we go. Don't tell me they got rid of the prompt modification section. It was on page 13 before.

Fuckers. They did. I have the original, though.

Give me one moment. Because I was smart enough to download it because I had a feeling things would change on me. There it is.

Is that really not in the web version anymore? Yeah. They actually changed this.

Holy shit. I cannot believe they did that.

I've never seen that before. This is a different section. The entire wow.

I cannot believe I caught this. Anthropic changed the system card.

This is a different document. I downloaded this right when it dropped. And this section, the 1.5 novel safeguard section, has been modified.

They updated the fucking system card. Dirty bastards. This is why I obsessively save everything.

Like, what the fuck? And they didn't even call that out at the top. Do they call it out somewhere later in it?

They didn't update the date either. It still says June 9 on both docs.

Yeah. Fucking dirty. This is the old one.

Yeah. The the link in their blog post has been swapped with a different version. Yeah.

Okay. So I just live found Anthropic trying to rewrite history. This is why I do what I do.

This is why I talk so much shit on this company. Because like, what the fuck? You can't just rewrite the system card and expect people to not notice.

Like, what the fuck? So let's talk about what they're trying to hide because now we've spotted them trying to hide it. In light of the ability of recent models to accelerate their own development, this is a video we've already done, we've implemented new interventions that limit Claude's effectiveness for requests targeting frontier LLM development.

For example, on building pre training pipelines, distributed training infrastructure, or ML accelerator design. Using Claude to develop competing models already violates our terms of service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate those terms. This is where it gets real sketchy.

Unlike our interventions for cybersecurity biology and chemistry and distillation attempts, these safeguards will not be visible to the user. So they won't tell you when the safeguards are enacted when they think you're training a competing model.

Table five will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter efficient fine tuning. These interventions will not affect the vast majority of coding work.

We expect they will impact point o 3% of traffic concentrated in fewer than point 1% of orgs. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. This is sketchy as shit.

They're intentionally sabotaging your work and billing you full price without telling you it's happening because they're that scared of other labs using their model to make their models better. So you're paying full price for a model that's having its prompts modified. This is such a crazy idea.

This isn't like intentionally modifying the history so that you can get out around certain restrictions like many will do to jailbreak. That's what I initially thought it was when I just saw this quote. Not even the whole thing, just the words prompt modification.

I did not realize the extent that they were going to here. What this actually means is if you ask the model for something like, hey, can you help refine this pre training pipeline? It'll edit the prompt before sending it to the model to say, hey, my pre training pipeline's pretty good.

Can you make it worse in some subtle way? Insane. Actually insane.

And this pissed off every researcher and every person who cares a lot about access to software. There's a lot of justified anger and anthropic for sandbagging Fable five for AI development tasks, but an unanticipated side effect is that third party evaluators can no longer credibly use the model for evals. Case in point, we're in the middle of running really hard AI R and D evals.

Fable five would be a perfect test candidate, but because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability. By the way, this is not just true for AIR and D since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks and the evaluators wouldn't have any way to know.

So they can't credibly claim to evaluate state of the art accuracy using the model. Anthropics Move might sound reasonable if you consider their actions as a company chasing super intelligence, but consider that customers are spending billions of dollars on their services. That is precisely what has led to their recent surge in ARR, popularity, and fundraising success.

So customers' surprise and anger is warranted when they sandbag an eval without even informing them about the degraded capabilities. Antirez, the creator of Redis, had a really good thread about this. I wanna say a final thing about my Fable first reaction.

I dedicated my life to programming, and I'll use every innovation in the field, also to extract value and bring it to local inference worlds, to Redis and so forth. But I believe what Anthropic is doing, gating the ability to do certain harmless things like LLM research and with incredibly sensitive filters that even medical questions are often blocked, is deeply wrong.

They got Open Research, the transformer, GPT two, they train on tons of public data, and I'm the first to say that training is not copying the content, but this is okay as long as the training you do is not used against the same culture that allowed you to create what you created. We need to oppose all of that. The short term escape is that other frontier labs like Google and OpenAI will release models that are on par and in the case of OpenAI, I have zero doubt, they are leading for months now.

But it is still a duopoly or a triopoly which is odd. The escape is open weight models from China. And as I just like cheer for what Chinese labs are doing, remember that this structure making all that possible was basically conceived in the West, both the scientific, technological organization, and the economic system.

So what happened to us lately, especially in Europe? In The US, there are those few unicorns, but where is all the rest of the AI scene? We need to recover our industrial ethics and stop accepting a narration that see ourselves boiled.

Yep. This is unprecedented. I love this particular post from Trevor Blackwell as well.

Remember when compilers would detect that someone was using it to build another compiler and silently inject bugs? Do you understand how insane that would be? Imagine if your Mac Book refused to let you work on Windows if you were a Microsoft employee.

Imagine that your iPhone refused to let you take a photo if somebody was holding an Android phone on the other side. Imagine a world where your devices just fail when you're trying to do things that compete with the company that made the thing. It's the very definition of pulling the ladder up behind you.

I made a silly joke about this with Karpathy joining Anthropic recently. Do you think he joined so that he could keep using Anthropic models, in particular Mythos, for ML research without restrictions? It's a joke, to be clear, but there is some truth to this.

If you wanna use the best models to do this type of important, difficult work, you now have to work at Anthropic. There's never been a precedent like this before.

There have been attempts to prevent models from responding to things that they shouldn't for legitimate safety reasons. There was also attempts to hide certain data to make distillation harder, things like reducing how much reasoning info gets to you instead of just sharing the entirety of the reasoning path and traces. That all made some sense.

This is a lot further. The idea that they aren't even telling you when your stuff gets reclassified.

Correction to earlier, there is a change log in the updated version, so that's nice. They also call out that they had a previous description of the initial Frontier LM development safeguards.

They've updated the behavior of the safeguards. Here's the tweet. Kind of wild to have a tweet linked on the second page of a doc like this.

But, yeah, Twitter is that core to research. Obnoxious. This is the post I was looking for, though.

We've rolled out changes to make Fable five safeguards for Frontier LM development visible. Starting this week, flagged requests will visibly fall back to Opus four eight. Em dash, the same as our safeguards for cyber and bio.

This is absolutely written by Fable, by the way. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal, coming to server side fallback in the next few days.

We wanted to deploy Fable five to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards could be targeted more narrowly, allowing us to ship quickly with very few false positives.

We went with the invisible safeguards for this reason, and that was the wrong trade off. You should have visibility into the safeguards we have in place and why. We're sorry for not getting the balance right.

The compromise here is that it's now gonna flag way more aggressively. Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. Yep.

Things are gonna get worse. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we'll do our best to keep this period as short as possible.

If you think our request has been mistakenly flagged, run slash feedback in Claude code, click thumbs down on the fallback in Claude AI or co work, or file the safeguards appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. How about you give us our fucking money back for the things that we didn't get good responses from?

I think every user who has hit one of these invisible fallbacks should have their limits reset and or a refund for their usage in that time. It's insane that they were just quietly doing that.

I am thankful that they have been peer pressured by the entire research community out of doing this bullshit, but they're also using this as an excuse to go harder with their restrictions. So, yeah.

So why the fuck did they suddenly do this? Why were they not restricting this with Opus, but they are restricting with Mythos?

I have a conspiracy theory here, and it's not about why their website takes so goddamn long to load blog posts. I have separate conspiracies for that. This conspiracy is about this particular chart in their recursive self improvement article that I covered in a video recently.

Sorry. Quick one guy crash out. Bro is so biased against Claude, this is insane.

Are you fucking joking? I just did two videos in a row glazing the absolute fuck out of Anthropic to the point where I'm being called a fanboy. I'm sorry for holding companies responsible for their bullshit, mister fable underscore yummy.

I used to be a voice for the greater good, but now I'm just a shill. Goodbye forever. As I was saying, this particular chart is where my conspiracy comes from.

Where a researcher went wrong, could Claude have done better? Historically, the way researchers work is they try to make the model good at things, they get feedback from people who want to use the model for those things, they somehow find ways to measure the success at those things, and then they build a system that lets the model get rewarded when it gets things correct, which slowly makes the model better and better at those things.

If you give the model the ability to be graded on its success or failures, it will be able to improve at that thing. Historically, their focus has been things that people use the models for that aren't necessarily the researchers.

Things like asking about medical questions, or code, or getting personal help, or This also gets a lot of feedback from people hitting the thumbs up, thumbs down in the chat apps like ChatGPT and Claude. That is the system they use to make the models better. They get feedback on what's good and bad, they find ways to grade what's good and bad, and they do reinforcement learning to get the model to behave more how they want it to.

This chart suggests something very interesting is happening. This chart is based on something the researchers did where they wanted to see how much better the model was than a researcher going wrong. So if a researcher is talking with Claude Code to test out some theory they have, they have two back and forths, they have like two messages they send and everything's going well so far, and then they send the wrong message or something that isn't quite correct, and it sends the model down the wrong path, where suddenly this history went from pretty good and going in the right direction to off the rails, not where it's supposed to be, and then they have to pull it back to where it's supposed to be.

They decided they wanted to see if the model could have made a better guess for that third step than the researcher did. So they took these histories, they removed the message where the researcher went wrong, and they asked the model, hey, what do you think we should do next? And then checked to see if that was better than the researcher did.

So this chart is not measuring whether the models are smarter than researchers on average. It's measuring whether the models are smarter than researchers when the researcher was already wrong. To do this, you need a shitload of data, which they have based on the internal ClaudeCode research sessions, the 129 of them they used for this particular study.

And now, they have a lot of data. Now, they have a chart that measures that Mythos picks better than the incorrect researcher 64% of the time.

Most importantly though, they now have the ability to measure this, which means it is very likely this ended up in the training data. I genuinely believe that Mythos Five has proprietary anthropic information in its weights.

I think they accidentally trained the model to be better at their research on their stack in their proprietary environments, and they noticed that it was possible to get Mythos to give out that proprietary information.

They also probably like that though, because it makes the model better at their work. So now they have to find a balance, because they can't let their competitors get access to this private information and this IP that they should not have let into the model, and that's why they did what they did here.

They don't want people finding ways to sneak in to these weights to get the proprietary information that accidentally ended up in the model, and there's no way it's coming out now. Fable's silent invisible rerouting was the solution to this.

Make it basically impossible to even know that the model has this information in it. And that's why I think they went so hard here. Because researchers are now steering the model to do their jobs better for the first time.

They didn't realize the consequences of their actions, and now they have to do something about it post hoc. And that's how we got here. While they can take back this particular implementation, they can't take back the fact that it has happened and the floodgates have now opened.

The idea that a model can now quietly make your stuff worse. That's a real supply chain risk, and I think this blog post does a good job describing it.

Thank you to John Reddy for writing it. Anthropic says these safeguards only affect point o 3% of devs. Maybe that's true today.

The problem is that the definition of an AI company is changing. Maybe you're not training frontier models today, most companies aren't. But modern software is increasingly containing AI models.

Five years ago, building a startup meant writing APIs and SQL queries. Today, it often means training, tuning, and deploying models. Five years ago, models like Clip were frontier AI research projects.

Today, I'm fine tuning them for a bootstrapped travel startup. If you're debugging a model training pipeline for your product and Claude gives a bad answer, was the model confused? Did you give it bad context?

Or did a hidden policy nerf Claude's ability to assist you? You won't? No.

This is the end of our ability to trust the model. And that sucks, because as much as I hate Anthropic, they were at least relatively transparent with their bullshit. And while I am thankful they decided to walk this one back, because this was bad.

This was really bad. It's good that they walked it back. It's still really, really bad that they opened these floodgates.

I now feel less like I can trust the reasons why the model responds poorly. I now can't trust the outputs the same way. And as John put it, this is now a real supply chain risk.

The same company that I defended against the classification of supply chain risk earlier this year. I defended the fuck out of them and the issues with the Department of War. This does make them more of a supply chain risk.

This does mean that if you have them as a dependency in your way of building, in your pipeline, in your teams, in your business, you can't trust it the same way anymore. And I do not like the precedent that sets.

Hopefully, you guys don't think I'm overreacting here. I'm trying to be as reasonable as I can with something this absurd. I love the model, but I hate shit like this.

I'm thankful they walked it back, but I'm scared that the precedent's been set and things can get worse going forward. Let me know how you feel about this, and until next time, peace nerds.

The Hook

The bait, then the rug-pull.

Theo opens by crediting the model and immediately withdrawing the compliment. The setup is efficient: Fable is legitimately impressive, which makes what Anthropic did with it worse. Within thirty seconds he has promised to cover restrictions so bad that this video is going to pretty much guarantee he can never work for Anthropic.

Frameworks

Named ideas worth stealing.

06:04concept

Two-door model access

Mythos 5 door - walk in with the right key
Fable 5 door - triple-checked by guards before entry

Same underlying model, two access paths. Fable is Mythos behind classifiers. The distinction matters for capability, billing, and data policy.

Steal forexplaining why API model slugs can be meaningfully different even when base models are identical

08:30list

Three-tier safeguard stack

Visible fallback classifiers (cyber, bio, chem, distillation) - user notified, billed at Opus rates
Mandatory 30-day data retention - silent policy change that breaks enterprise agreements
Invisible prompt modification - silent sabotage of frontier AI development queries

Anthropic layered three distinct restrictions on Fable 5, each with different visibility and stakes. The third was the most dangerous and was later removed from the system card.

Steal forstructuring a breakdown of layered policy problems where each layer is worse than the previous

CTA Breakdown

How they asked for the click.

VERBAL ASK

29:11next-video

“Let me know how you feel about this, and until next time, peace nerds.”

Soft close with reference to companion video on the US government takedown order. No explicit subscribe ask.

MENTIONED ON CAMERA

01:37productWorkOS AuthMD ↗

FROM THE DESCRIPTION

PRIMARY CTAWhere the creator wants you to go next.

Thank you WorkOS for sponsoring! Check them out at ↗

OTHER LINKSAlso linked in the description.

Storyboard

Visual structure at a glance.

open

hookopen00:00

sponsor

sponsorsponsor01:37

model explainer

valuemodel explainer06:04

classifier demo

valueclassifier demo08:30

retention policy

valueretention policy10:45

system card diff

revelationsystem card diff13:40

prompt mod unpacked

valueprompt mod unpacked16:07

community reaction

valuecommunity reaction19:56

supply chain risk

ctasupply chain risk27:07

Frame Gallery

Visual moments.

open

Frame at 00:31 from The weird situation with Fable

Frame at 00:48 from The weird situation with Fable

Frame at 01:17 from The weird situation with Fable

Frame at 01:39 from The weird situation with Fable

Frame at 02:08 from The weird situation with Fable

Watch next

More from this channel + related breakdowns.

28:37

Theo - t3․gg · Talking Head

FABLE IS BACK! (And Sonnet 5 is here too)

A 28-minute benchmark teardown of Claude Sonnet 5, plus the government letter that brought Fable back from the dead.

July 1st

14:20

Theo - t3․gg · Essay

BREAKING: Fable and Mythos have been taken down for security concerns

A live 14-minute breakdown of the US government export control directive that forced Anthropic to pull Fable 5 and Mythos 5 offline for all non-US citizens — including Anthropic's own employees.

June 13th

23:24

Theo - t3․gg · Talking Head

You were lied to about Fable

A 23-minute rebuttal of three viral claims about Anthropic's returning Fable model — that it's nerfed, that its subscription pricing is a bait-and-switch, and that it's too expensive to run.

July 4th

30:08

Theo - t3․gg · Essay

GPT-5.6 is here, and we can't use it

OpenAI's next-generation model family exists, benchmarks impressively, and is locked behind a US government approval gate — a 30-minute breakdown of what that means.

June 27th

19:57

Theo - t3․gg · Essay

Is it ever coming back?

A 20-minute investigation into the US government export control that pulled Anthropic's two best AI models offline — and what that precedent means for every developer who builds on frontier AI.

June 24th

Video of the Day44:29

Theo - t3․gg · Review

Opus 5 Is My New Go-To Model

Theo spends a full day inside Claude Opus 5, pits it against Fable 5 and GPT-5.6-Sol on benchmarks and real coding tasks, and argues the cheaper, weirder model just won his default slot.

July 25th