Why Modern Creator?

Theo - t3․gg · YouTube

Fable is Mythos, and it is really good.

A 33-minute first-take from a developer who spent $3,000 on inference in 24 hours — benchmarks, real demos, session math, and the hidden safety intervention that silently degrades the model without telling you.

Posted

June 11th

1 months ago

Duration

32:54

Format

Talking Head

educational

Views

84.7K

3.7K likes

Part of the collectionThe Fable 5 PlaybookAll 45 Fable 5 breakdowns, synthesized into one page.

Read the playbook

Big Idea

The argument in one line.

Fable 5 is a qualitative leap in AI coding capability — but Anthropic's undisclosed safety interventions mean you may pay full price for a silently degraded model without any indication it happened.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You use Claude Code daily and want an honest first-take on whether Fable 5 justifies the price jump over Opus.
You run an engineering team and are deciding whether to upgrade everyone to $200/month plans.
You care about benchmark credibility and want to know which benchmarks are meaningful versus gamed.
You've hit Claude Code session or weekly limits and want to understand how the usage math actually works.
You want to understand Anthropic's new 30-day data retention policy and what it means for enterprise usage.

SKIP IF…

You're not a developer and have no interest in AI coding tools.
You want neutral analysis without strong opinions — this is a first-take hot take.
You're looking for a deep dive on any single topic; this is a broad sweep of many.

TL;DR

The full version, fast.

Fable 5 is the best available coding model — not because it scores highest on every benchmark, but because it writes code with more taste, tolerates vaguer instructions, and can be trusted to explore open-ended problems and return with real results. The cost is double Opus at $10/M input tokens and $50/M output, aggressive enough to burn $100 in eight minutes on heavy agentic work. The deeper concern is Anthropic's hidden safety system: for certain topics like frontier LLM development, the model is silently degraded via prompt modification and steering vectors with no notification, meaning you pay full price for a lobotomized response. Despite these concerns, the current window where $200/month subscriptions include Fable access is the best moment to push the model hard and discover where its real ceiling sits.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 00:42

01 · The hook — Mythos vs Fable, $3k in 24 hours

Establishes the scale of testing done and sets up why Fable 5 matters even though Mythos 5 is the full release.

00:42 – 03:08

02 · Sponsor: Blacksmith CI

Blacksmith CI sponsor read — faster GitHub Actions builds, better logs and monitoring.

03:08 – 06:02

03 · Benchmarks overview

SWE-Bench Pro (80%), Frontier CodeBench Diamond (30% vs GPT 5.5's 5.7%), SkateBench (79%), TerminalBench penalty from safety filters.

06:02 – 08:29

04 · Frontier CodeBench skepticism + DeepSWE + pricing

Anomalous reasoning-curve data in Frontier CodeBench raises credibility questions. DeepSWE shows Fable comparable to GPT 5.5. Pricing: $10/M input, $50/M output.

08:29 – 11:50

05 · Real-world coding: Ping.gg modernization

15,000-line codebase modernization — worked in ~5 turns. Attempted full stack swap to TanStack/Convex/Clerk. Got further than expected but broke core functionality.

11:50 – 13:44

06 · $100 in 8 minutes + session limit math

Switched to usage-based billing to finish a workflow — spent $100 in 8 minutes. Then maxed a second sub in 2 hours. 1 session = ~25% of weekly limit.

13:44 – 18:44

07 · Hidden safety interventions + fallback routing

Fable routes to Opus 4.8 transparently for some topics. For frontier LLM development it silently degrades via prompt modification, steering vectors, PEFT. Bullshit Bench: 33% refusals. Artificial Analysis recorded 8% fallback vs stated <5%.

18:44 – 22:04

08 · UI taste + team demos

Fable has better design instinct than prior models. Team demos: Rust terminal text adventure, t3 code ported to Rust TUI, Minecraft clone with procedural assets, multiplayer racing game.

22:04 – 24:23

09 · SkateBench analysis

Used Fable to analyze its own SkateBench failures. Identified that two problems involving 'caballero / full cab' are nearly unsolvable for most models because of a jargon-compression mismatch. GPT 5.5 found no interesting trends; Fable did.

24:23 – 27:02

10 · Data retention policy + Simon's impressions

All Fable 5 traffic now requires 30-day retention regardless of trusted-access status. Simon built a 14MB WASM binding and full dataset agent entirely with Fable. 'Big model smell' — feels bigger, more capable.

27:02 – 32:53

11 · How to use this model: trust it more

Argument to push the model with vague exploratory tasks, use fuzzers and worktrees, clean up stale PRs in bulk. The economics of software development have changed. Get as much usage as you can during the $200/month window.

Atomic Insights

Lines worth screenshotting.

Fable 5 is Mythos 5 with a safety layer bolted on — the underlying capability is identical, but guardrails cost up to 20 benchmark points on TerminalBench.
Anthropic's fallback routing fires in 8% of real-world tasks according to Artificial Analysis, not the stated under 5%.
The model refuses to answer approximately 33% of questions on Bullshit Bench — a benchmark designed specifically to test refusals.
$100 in 8 minutes is a documented real number: heavy agentic workflows at $10/M input + $50/M output hit brutally fast.
Fable 5 has meaningfully better UI design instinct than prior Claude models — possibly emergent taste, not just newer templates.
The 'big model smell' is a real quality signal: the model compresses domain jargon correctly rather than following descriptions too literally.
DeepSWE numbers show Fable performing comparably to GPT 5.5 even with safety restrictions active — and it uses fewer tokens per task.
Anthropic's new 30-day data retention policy applies to all Fable 5 traffic regardless of prior trusted-access agreements, making it legally unusable for some enterprise workloads.
One 5-hour session equals roughly 25% of the weekly token limit — you can exhaust the weekly cap in four full sessions.
The hidden intervention for frontier LLM development uses prompt modification, steering vectors, and PEFT — and is explicitly not disclosed to the user in real time.
Frontier CodeBench shows anomalous results where higher reasoning budgets do not monotonically improve scores, suggesting the benchmark may not be reliable.
The real productivity shift from Fable is not one-shot accuracy — it is being able to give the model vague exploratory instructions and trust it to come back with something real.

Takeaway

What actually changes when a model has taste.

WHAT TO LEARN

Fable 5's ceiling is not higher benchmark scores — it is a model that compresses domain knowledge correctly, tolerates vague instructions, and can be trusted to explore rather than just execute.

A model with taste does not just produce more correct outputs — it produces outputs you actually want to merge, because it applies the same compression heuristics an experienced developer would.
Hidden safety degradation is a new category of risk: you can pay full price for a model that has been silently made worse for your specific use case, with no indication it happened.
Session and weekly token limits compound against each other — one heavy agentic session can consume 25% of a weekly budget, making burst tasks disproportionately expensive relative to steady daily use.
The productivity shift Fable enables is not one-shot accuracy — it is the ability to hand the model a vague exploratory problem and trust it to return with something actionable rather than something that needs full supervision.
Data retention policies are now a first-class model selection criterion: if your work involves sensitive data, the 30-day mandatory retention on Fable 5 traffic may make the model legally unusable regardless of capability.
Benchmark credibility requires scrutiny of anomalies — a benchmark where higher reasoning budgets do not produce monotonically better scores is not measuring what it claims to measure.
The current subsidized window on subscription plans is a real economic opportunity: you get access to the most capable available model at a fraction of its inference cost, but that window is explicitly temporary and capacity-dependent.

Glossary

Terms worth knowing.

Fable 5: The publicly released version of Anthropic's Mythos 5 model, shipped with additional safety guardrails that restrict certain categories of requests and can silently degrade responses in others.
Mythos 5: Anthropic's highest-capability model in the Mythos series. The full unrestricted version; Fable 5 is the public-facing variant.
TerminalBench: A coding benchmark that runs agentic terminal tasks. Fable 5 scores 20 points lower than Mythos 5 on this benchmark because safety filters block a subset of the test tasks.
SkateBench: A benchmark the reviewer built that tests models on naming skateboard tricks from three-dimensional spatial descriptions. Google models consistently outscore others, revealing differences in spatial reasoning and domain compression.
DeepSWE: A software engineering benchmark from DataCurve that the reviewer considers more credible than SWE-Bench Pro because it correlates with real-world coding experience rather than memorized pull request descriptions.
Frontier CodeBench: A tiered coding benchmark with 150 general, 100 harder, and 50 diamond-difficulty problems. Fable 5 scores roughly double GPT 5.5 on the diamond tier, though the benchmark shows anomalous reasoning-level curves.
PEFT: Parameter-efficient fine-tuning. One of the hidden intervention methods Anthropic uses to limit model effectiveness for certain sensitive topics without falling back to a different model.
UltraCode: A high-intensity agentic workflow mode in Claude Code that uses more tokens to achieve more thorough results. Referenced as the mode responsible for the $100-in-8-minutes incident.
5-hour session limit: A hard usage cap in Claude Code that resets on a rolling timer. Under the current subscriber plan, one 5-hour session consumes approximately 25% of the weekly token allocation.

Resources

Things they pointed at.

01:43toolBlacksmith CI ↗

07:36toolDataCurve DeepSWE

18:40toolArtificial Analysis intelligence benchmark

24:23toolSimon's dataset agent (datasette)

Quotables

Lines you could clip.

09:38

“It just feels smarter, like it's writing better code. It feels like a better employee or a better person with more years of experience than previous models.”

Clean standalone take, no setup needed, visceral quality signal→ TikTok hook↗ Tweet quote

12:51

“Eight minutes. Eight fucking minutes to do a $100.”

Shocking number, zero context needed, instantly shareable→ IG reel cold open↗ Tweet quote

16:28

“They are intentionally making the model dumber when you try to use it for certain things and they don't tell you when that happens.”

Controversy hook, clear accusation, invites engagement→ TikTok hook↗ Tweet quote

29:49

“It's like going from a really cracked junior engineer to a kinda laid back senior one, where it knows enough to make good decisions by itself.”

Memorable analogy, self-contained, widely relatable→ Newsletter pull-quote↗ Tweet quote

30:06

“I've never felt more like I'm along for the ride it's taking me on, rather than I am the one steering the ship.”

Poetic closer-style line, shareable as a standalone thought→ Newsletter pull-quote↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

metaphoranalogy

This video is a little later than usual. I wanted to do my due diligence when testing this new model series because it is a big, big deal. We've been hearing about Mythos forever, and to finally have it in our hands is unbelievable.

It's so unbelievable that we don't actually have it in our hands because Mythos five is not the model that we all have access to now. What we did end up getting is a new model called Fable five, which is still Mythos, but it has a bunch of safeguards in front.

And while those guards definitely cause it to not do what we want in a lot of different scenarios, that hasn't stopped me from using the hell out of this model. In the twenty four hours since it dropped, I've done over $600 in inference. Oh, wait.

That's my Mac Mini. Yeah, that looks a little more right. I've done about $2,000 of inference on this model already.

I burned through the five hour session limits on two $200 accounts at the same time, which I never thought I would be able to do. I've been pushing so much shit with this model, and it's not just me either. I got my whole team upgraded to the $200 a month plan, and they have been cooking crazy stuff too.

From terminal based 2.5 d adventures to fully functional multiplayer three d racing games.

Yeah, my team's been cooking. I can't wait to show off all the cool things we've been able to do with this model, as well as the problems you might encounter as you build with it. But since I just spent like $3,000 on subscriptions for my whole team, and we'll probably have to spend a lot more on inference for this model, I hope you understand why we're doing a quick sponsor break.

I know making fun of GitHub is all of the rage, and I've been doing it a lot, But there's one thing on GitHub that is genuinely just unacceptable to still be using. It's the CI. If you're still on GitHub CI, please pay attention.

You've already heard me talk about Blacksmith, but you need to actually try it. It's so easy to set up. You go to the YAML file, you swap Ubuntu latest to Blacksmith 4vCPU, you sign in with GitHub in order to link it, and now you can have builds that are up to four times faster and also half the price.

They say 60% here, and this is what a lot of people see, but we've seen much further. I've seen most of my build times get cut in half, and when you combine that with the much cheaper price, the result is as low as a fourth of the cost. That's why everyone from Mercury to Expensify to Descript and Exa to Clerk and also now t three chat and t three code are all building with Black smith.

They also just introduced runners for Mac OS, which has been a game changer for us. Our Mac builds used to take upwards of like sixteen to twenty minutes, and now they're consistently under 10, which is unbelievable. It's made us way more excited to do new releases for t three code.

We even introduced nightlies because of how much Blacksmith sped up our builds. If this was just the faster builds, that would be worth the money, and honestly, I'm down to pay for it just for that. But the better logs, monitors, and history systems are way, way more powerful than they have any right to be.

I had no idea how much I was just flying blind as I was using more and more of GitHub actions. And now that I can actually see what's going on, I can set up custom monitors, I can see where the error rates are happening, what actions are failing and why, debugging things is suddenly so much easier.

When you're trying to figure out why your CI is still slow, having a UI like this that shows you clearly which steps take how much time and how many lines of output did they generate, ugh, game changing. Your team and your agents deserve better CI. Get it now at soydev.link/blacksmith.

So let's talk about these models. I have a lot of layers to go into here, from how it feels to build with them, to all the benchmarks that we got, to a handful of benchmarks that aren't even public yet that I was able to get early access data for so I can show you guys first. So you should be excited.

I know I am. I took a lot of time to get this done, to do it right. I'll do my best to not waste time and just cover the facts.

So we'll start with the first section here, which is the scores that the models got. Notice that they say Mythos five slash Fable five here. The reason for that is because a lot of these benches had questions that Fable five outright refused to answer, which would plummet the scores.

Even something simple like Terminal Bench saw a 20 drop compared to when the same run was done with Mythos, because it just couldn't run a bunch of the problems.

But when you take a look at what its capabilities are, as in what it's able to do unrestricted, it crushes.

SWE Bench Pro got an 80% on, which g b d five six only got a 58.6. Remember though, we don't like that bench.

I did a detailed breakdown of why SWE Bench Pro is kind of garbage now, and this model unsurprisingly kills it, because most of the bench is existing pull requests and handing the model the description of this PR that merged in an old commit hash, and it just seems like Mythos five has so much data in its training that it has the details to recreate those PRs.

The Frontier CodeBench, however, is much more interesting, and we'll dive into that more in the future. For now, though, it got a 30% when Opus 4.8 only got a thirteen, and five point five only got a 5.7. Knowledge Work, it did very well.

Its vision capabilities are a meaningful upgrade from the Opus line. It's finally leading GPT in vision. Apparently, it's beating out Gemini three point Pro, so I'm not sure about that, because Gemini vision stuff tends to be really good.

It's really good at spatial reasoning. The first time they've taken this lead from OpenAI, which is cool to see, like, I have a lot of fun things to test with spatial reasoning.

That said, it didn't do as good with my spatial reasoning bench, SkateBench. This is a bench where I measure every model's capability of naming skateboard tricks based on the description of it in a three-dimensional space. Google has maintained their absurd lead here with 3.1 Pro at 98%, which is insane.

And almost every other model is just not coming close. There's a couple problems in my private version of SkateBench that very few labs get. For some reason, Google does and the rest don't.

And we don't see much progress here with Fable five, at a 79%. A lot better than the 72 they got previously with 4.6, but 4.8 was actually a pretty big regression.

I don't know if there's something wrong with the thinking on that, but it struggled a lot and only got a 47. Fable did better, but also, like, it doesn't want to reason on this. No matter what I've done to try and force it, yeah.

A previous run where I threw it on max, was able to get up to an 83% from Fable, but still nothing compared to Google scores, which are as high as a 98% pretty often. Not the best bench in the world, just an interesting one to compare things against.

And as I mentioned before, it scored well on Terminal Bench, but the model we get scored much more poorly because it blocked a handful of the requests in the terminal bench tests. We should talk about the frontier code numbers a little bit because they are nuts. This is the second of their three tiers.

They have their general tier, is a 150 problems, their harder tier, which is a 100, and then their diamond tier, which is just the 50 hardest. The numbers we were just looking at was for that 50. This is the number for the 100, where it's a lot closer, but still almost double GPT 5.5 score.

I have suspicions with this bench. I've been talking about them a bit with others. When I dug through Frontier Code, I found some things that make me skeptical of it.

In particular, this chart here, which shows that Opus four eight did better on minimal reasoning than low by like 30 plus percent, went right up to where it was before on medium, and then for some reason, high suddenly does really well at like 13 something percent, and then drops again when you go to max.

This looks like a random number generator to me. If I'm being real with you guys, the amount of hard problems it can solve should go up as reasoning goes up, not randomly go up and down. I'm lucky to have DataCurve sharing the Deep SWE numbers with me early.

If you haven't already watched the Deep SWE video, highly recommend it. Most of these code benchmarks suck. This is one that I think makes actual sense.

The very least, it lines up with my experience, and what it shows here is that Fable on X High performs comparably to GPT 5.5. Remember though, Fable is somewhat nerfed simply by the safety restrictions. I have no reports of them hitting those restrictions or even mentioning that it dropped down to Opus or whatever automatically, which is what a lot of benches do.

They just dropped down to Opus when Fable refuses to give it a reasonable experience similar to what you would have with Claude code because that's how it works there. They got numbers very close to GPT 5.5. More importantly, they crushed all of Opus' scores with fewer dollars spent.

So even though this model is much more expensive, it ends up being a better deal overall compared to most Opus models because it uses way fewer tokens. This is a great change and I'm thankful to see Anthropic token maxing a little less hard, at the very least with the capabilities of this model.

And it's a very good thing because the price is double Opus' price, per token specifically. Fable and Mythos are offered at $10 per million input tokens and $50 per million output tokens. This also means they burn through your limits way faster.

And more importantly, ClaudeCode subs only have access to Mythos in ClaudeCode from today until June 22. Fables included for now, but that will change on June 23.

If capacity allows though, they will extend the window. A big part of why they were finally able to put this out is because they have GPUs. Thank you, Elon.

So those are the benchmarks. That's the like core info on it coming out. How does it feel to actually use this model?

Well, I mentioned, I've been using it a lot. I've been juggling between accounts. I have been token maxing using UltraCode and tons of workflows for the last twenty four hours.

And it's a damn good model. I'd go as far as to say it's the best coding model ever released by quite a bit even. It feels very different from the jump to something like five four to five five, where the simplest I can put it is that GPT five five felt like they went to five four, they trimmed a bunch of stuff out, and then they rebuilt it to be better, Mythos really just feels like more opus.

Like they turned it up to 12 or 13 somehow. It's more thorough, it's more willing to cheat, it's harder working, it is smarter, it is unbelievably capable, and it gets shit done.

Usually, at least. I tried one of my favorite tasks, was taking my old Ping. G code base for the video streaming app I made in 2021 that I haven't touched much since 2022, so it's very out of date, and I try to let the model go and modernize it.

It used UltraCode, it burned quite a bit of tokens, and while the final build it created didn't work, about four turns later of me screenshotting errors and pasting it in, it got a fully working version, which was really impressive. This code base is massive. This is a 15,000 line of code change, and it worked.

It didn't one shot it, to be very clear. Like, I had to put a good amount of time going back and forth with it, but it did succeed, and that's really cool to see. Only a few models have even come close, and Fable is one that is capable of doing this type of work.

So I tried to go a bit further and have it modernize the project with the things I like to use now. Instead of using the old MySQL database through Prisma, instead of the old hacky build of tRPC and custom self rolled auth that was breaking all over the place, I asked it to move over to TanStack start instead of Next. Js, to Convex for all of the data stuff, and to Clerk for the off layer.

I actually don't think I asked it to use Clerk, I think it just picked based on my system prompt. I, yeah, I do have something in my global that mentioned that, so that's probably where that came from. When we look here, it is doing yet another workflow where it's going through trying to diagnose weird auth issues I was having because as good as Clerk is, it managed to hit some edges.

It made some silly mistakes like with the layouts on this page here. Amazed that it screwed up something like this with the port. It did an okay job.

The Google one will get us in trouble though, sadly. It maintained my browser support, which is good. This is progress.

It didn't work for this before. Oh, god. What's this menu?

It it it does some questionable things, to to put it lightly. Why don't you request to join my own room? I own this room.

Okay. So this type of port is still far from what this model is capable of. It broke almost all the core functionality.

It has a ton of, like, random UI regressions. But it got further than I would have expected, honestly. It's good to see it able to get even close to something like this until you learn how much it costs to do.

And this is where one of the big problems with this model starts to show. I was running this workflow before bed last night at around 11PM or so, and it started getting pretty far. But then, right as it was finishing the last step in one of the, like, bigger workflows it was doing, I hit my usage limits.

And I wanted to just see it get to the end so I could film my video and go to bed. So I decided to switch over to usage based billing so I could just let it do some work and finish up what it was working on. That ended up being a mistake because it spent a $100.

How long do you think it took to spend that $100? An hour? Five hours?

Maybe like late into the night? It was about eight minutes. Eight fucking minutes to do a $100.

And I realized that my usage limits were much better than that with the sub, so it's probably worth just getting another. So I did. And then I maxed that sub out in another two hours.

To be clear, only the five hour limit, the session limit. But goddamn, the fact that I could in one run with one workflow, with one prompt, max out two subs, and then have to go to bed without the task being done because I didn't wanna spend a thousand additional dollars on it.

It's rough. It is. And I'm very excited for this level of intelligence to be cheaper eventually.

Since I was on a fresh account, I did get some useful info. For my rough math, it looks like one of those five hour sessions that you get is worth roughly 25% of your weekly limit.

So when you max out your five hour, you can keep going when the timer's up, but you can only do that to the end four times in a week, and I could see myself easily hitting that. This means I also suspect they'll be putting out a higher price tier in the future that has even more subsidization because their margins are good enough, they can probably afford it now, especially with all the new GPUs.

But it's also clear they're not sure what the usage patterns will look like and how it will affect their compute. That's why they said here, if capacity allows, we'll extend the included window because they don't know yet. That's not do they get enough GPUs, that is a question of can they keep up with the demand.

I did have a few moments with this model that made me very frustrated though. It knows so much that it sometimes confuses itself.

One example I have is when I was working on Lakebed. I had it do an audit of my cloud project that I've been working on for a bit. Hopefully, we'll ship a real version of this very soon.

And in this environment, I have two environments. I have production, which is linked to the prod branch on GitHub, but I separately have staging, which is linked to the main branch on GitHub.

So when I merge changes, they auto deploy on staging, a new package comes up for the staging environment, and once I verified everything is good, I then promote it to prod. When I had it auditing this project, it insisted that the current versions of the package were broken because they required some fields to deploy that only existed on the staging environment and on main.

So it audited the main branch. It audited the NPM packages and said, these are incompatible. Everything is broken.

What are you doing? And like wrote a whole report about how I need to fix it ASAP. And I'm like, no.

That's not how any of this works. The package deploys to the production environment. The main branch is mapped to staging until we promote it.

And when I told it that, it rewrote the report with that information, still insisting it needs to be fixed. I'm like, no, that doesn't need to be fixed.

That's just how the environments work. Chill out, bro. And that was one of my first impressions, and it wasn't great.

So I didn't have that high of hopes when I started asking it to do other big work and overhauls and crazy tasks that I didn't think it'd be capable of. And it always got pretty damn close if not making it there. I've seen a lot of other people mentioning they're getting tons of refusals with the model.

That has not been my experience at the very least while using Claude code almost at all. I think I got one refusal total. On the website, however, I've gotten a lot of them.

When I was testing out the claude.ai site for Fable, I asked some questions about some like SEO stuff for agents because I noticed certain things weren't coming up when I asked about it in Claude. I was like, okay, how do I make it so these results are more likely to be noticed and surfaced by agents? It thought I was trying to manipulate data for my competition or something and routed me to Opus four point eight, and I had to correct it and say, no, I I am Theo.

I am t three dot g g. I own this website. I'm just asking if it's a good place to put this information for SEO and agent search reasons.

And then it let me go back to Fable. Was like, okay, yeah, I get it now. It's also you have to negotiate with the model to convince it that you should be allowed to use Fable for the task.

And it's not always clear about whether or not you're being routed to Fable. When you hit on sensitive topics like cybersecurity, biology, and chemistry, it will route you to Opus and be pretty transparent about it.

But with certain things, like in particular, if you're working on frontier LLM development, like you're trying to make your own models, it won't fall back to a different model. Instead, it will limit the effectiveness through methods like prompt modification, steering vectors, and parameter efficient fine tuning.

The interventions will not affect the majority of coding work. We estimate they'll impact point o 3% of traffic and fewer than point 1% of organizations. That is still scary though.

This means that they are intentionally making the model dumber when you try to use it for certain things and they don't tell you when that happens. So you're being billed full price for a model that is dumber and don't even know when that happens. This is bad enough that I plan to do a whole video on it in the near future, but I don't want it to be the focus of this one because there's enough good to talk about here that I do really wanna focus on that.

The artificial analysis bench did very well as expected. It is now the smartest model ever according to their benchmark, and let's be real, it's the smartest model we have access to by a lot right now. But there are some interesting details in here.

It's five points ahead of 5.5, which is huge. It's one of the biggest leads we've seen in a while, but more interestingly is the experience they had testing it.

Anthropic states the fallback to Opus 4.8 occurs in fewer than 5% of sessions on average, but artificial analysis recorded fallback routing in 8% of the tasks across the intelligence index, mostly in scientific questions for from evaluations like GPQA, AI Omniscience, and Humanities Last Exam.

It apparently bombs HLE like nearing a zero because it can't answer so many of the questions, but Mythos gets a state of the art score. Just interesting to see that their attempts to make the model safe are now actually measuring dumber.

These numbers scare me in particular. This benchmark might not be the biggest, most important one ever. It's called Bullshit Bench for a reason, but it grays out refusals and it refused to answer at all on 33% of the questions in the bench, which is kind of insane.

I will again emphasize that I've only really had refusals when asking about like cryptography problems or SEO stuff in the Claude app. Cloud Code has been nowhere near as aggressive with Refusals for me, but I know others have had experience much worse and I again plan to cover that all in detail in the near future. I wanna talk a bit more about code though because there's other pieces that are genuinely super impressive.

It's way better at UI. I honestly feel like Opus got away with a lot because the templates it felt like it had built in were good enough to look pretty good a lot of the time, but those templates got old fast.

I can't tell yet if Mythos and Fable have a newer, better set of templates, or if they actually now have some Design Instinct, but I can tell you confidently, a lot of these look better. Some look cringe as shit still, but a lot of these look nice.

Like, the little tape it added there, the structure of the page, it's solid. It's not as aggressively AI generated.

And I could see a lot of these designs being good starting points. This one almost feels like a Gemini design, actually, now that I look at it. It's good.

Like, these are workable. It's kinda crazy how quick models got decent enough at design type work. We gotta have a bit more fun, though.

Let's play with some Rust. This is the game I mentioned Maria made before. It's a terminal based 2.5 d game where you can give instructions, like a classic text based adventure.

You can also click, which works better than I would have expected. They keep me indoors for defensive purposes. I found 23,019 bugs, and they grounded me for it.

A friend at SpaceX owes me a favor. Say, wait.

A dot appears in the sky. It grows. 230,000 GPUs, slightly used.

Rock stickers are still on it. Quick. Type connect.

Oh, boy. Some issues with the text, but, like, the fact that this all works at all and that, like, I have this environment in the terminal is just unbelievable.

For a model to have been able to build something like this is truly insane. She also ported t three code to a terminal app in Rust.

She already had t one code in TypeScript, but she successfully ported it to Rust, and now it's a full terminal UI where you can, like, click around, make new threads, get responses. Wild shit. Addy made a full Minecraft clone.

I did not realize this was this legit. I saw her posting some screenshots and clips, but It's a story game?

The copy's good. I don't know how much of this she wrote. Like, know it's silly, but having a proper Minecraft clone where it made all the assets, it made all of everything, this is unbelievable.

I'm I'm very impressed. Because remember, they don't even have ImageGen, they had to, like, make all these textures itself probably programmatically somehow. Wild.

Proud of my team. I teased the racing game earlier, but what I didn't mention properly is that it has full functioning multiplayer, as well as a spectator mode.

So my whole team was just playing this with each other all night last night, which was crazy. Again, this is a piece of software that didn't exist before Mythos dropped, and they whipped it together with a few prompts in like an hour or two. This level of capability, this level of three d understanding, this level of one shot ability is unbelievable.

It has been a slow burn to get this far, but seeing it in action, it just it hits different.

It's wild to see the tools getting this far so much faster than I would have guessed. I'm not as creative as my team, so my use cases are not as cool as a lot of theirs.

But one of the things I did do is I had it analyze the results on SkateBench, so I could get like a breakdown on interesting things.

And it found some really cool info that I didn't have before. One piece I found in particular that's been super helpful to know is that SkateBench is quote, really two benchmarks.

I don't like the way it writes copies still, but it points out that two of the problems here are nearly unsolvable for most models.

The tricks are a cavalarial or a full cab, which is a faky three sixty. And since I describe it as switch and off the nose, it calls it a switch nollie backside three sixty.

This occurs over a 163 times in my bench, or a switch nollie backside big spin for the faky big quote. If you're not a skater, these things don't mean anything to you. What it means is that the other models are following the description of the trick too literally and not compressing the name properly.

Stupid analogy here, but hear me out. If I'm building a computer and I need a fan, that is a computer fan. If I'm using a computer and I need a keyboard, that is a computer keyboard, but nobody calls a computer keyboard.

They just call it a keyboard. Is it correct to call it a computer keyboard? Yes, but people would look at you funny when you say it.

And that's what all of the models did when asked to name these tricks other than at Gemini three one Pro. And it was actually really cool to see Mythos, even though it also got the question wrong, able to identify this and describe the results so well here.

It also identified some very interesting trends around reasoning levels and how certain models do or don't really do anything with that. As a sanity check, I ran the same prompt with the same data against GPT 5.5, and it didn't get any interesting trends at all, really.

It found that the three sixty Ord Heelflip question does a good job of separating the bad models from the good ones, but it didn't identify the specific interesting points that we talked about before.

Just just a interesting thing to ask two models to do. And yeah, I'm impressed with Mythos' ability to take large amounts of data and do things with it. It also recreated the whole bench.

Something I haven't done yet, though, is look at the web version because it also overhauled the visualizer. Oh, no. It put all the questions public.

I can't do that. Don't want people distilling on it. Oh, speaking of distilling on your inputs, it is very likely a big part of why this model is so good is because they are distilling on our Claude code histories.

And alongside that, they also added a new data retention policy. So all usage for Fable five will require thirty day retention for all traffic. So even if you have the trusted setup with Anthropic where they don't save your data, that cannot be the case with Fable five.

If you are using this model, Anthropic is getting your data. They claim that they won't use it to train new models or for any non safety related purposes, and they also have new privacy protections for logging and whatnot, but they are now storing your data, which means you literally cannot use these models for a ton of real world things where the data retention policy would make it against the company policy or even against the law in some cases.

Simon also didn't get early access, so this is based on his first few hours of use, but he called out specifically that while he doesn't care how much a model knows, he definitely can tell that this model is big. It has the quote big model smell.

He recently built a micro Python WASM binding so he could use WASM for Python code inside of a web container. And he decided to try and get it to be full Python instead of just this subset. After a little bit of encouragement, it spit out this 14 megabyte WASM binding that apparently just works.

He updated his dataset agent, which is an app he built for managing and looking into SQLite databases with a bunch of huge new features that he ended up actually releasing written entirely by Fable. He said he was impressed with the quality of the API design tests, code, and documentation that Fable put together. And I agree here.

That's one of the things I've liked about Fable, is the code feels more like code I wanna hit merge on. It's not like measurably better in the sense that it performs way better or whatever. It just feels smarter, like it's writing better code.

It feels like a better employee or a better person with more years of experience than previous models. All models will put in enough effort to solve most problems now, at least the high end good ones.

But Mythos and Fable now write higher quality code. And at the absolute least, I plan to keep using them to go review changes I made and find ways to simplify and make the code better. It just has more taste.

It's still weird as fuck. It still feels like a Claude model. It still has its random errors with tool calls and their shit hardest and all of that.

But it feels better. The outputs are much more usable than anything I've gotten out of an Anthropic model before, and better than I'm even seeing from OpenAI models now, is good.

I wanna talk now a bit about how we should use this model and how it should affect our day to day work. I like this post from Walden as a starting point. Your organization will likely not scale with the exponential curve of AI.

This should be a wake up call for engineering teams. Set up your cloud software factories now. Models can fix impossible bugs, UI test the hardest flows, write extremely good code, etcetera.

I haven't opened Datadog manually as far as I can remember. AI should be the first line defense for bugs and feedback. Humans should only look at PRs after an AI's already reviewed it.

AI should generate screen recordings of any PR before human eye has even reached it. The agents should just prompt itself most of the time. I have still not fully come around to agents prompting themselves.

I find agents write shit prompts, but things like the workflow and ultra code features that I've been abusing recently in Claude code, starting to come around a bit. It feels like I'm lighting tokens on fire, but it can make working results often enough that I find myself pushing way further than I previously did.

In the past, I would rotate between being super parallel and being super locked in on the loop, especially once I started using Five Five on fast and low and medium, I would just go back and forth a whole bunch and churn out work quickly. But I wasn't experimenting as much because I was in the loop. I wasn't really able to do multiple things at the same time and still have the quality I wanted out.

Mythos is capable of going off and exploring and trying bigger things with much vaguer instructions. You can give it something as vague as look into other options to make this more performant and it will synthesize good ideas, test them, validate them, and then come back to you with results.

Another just unbelievable thing it did is it started writing fuzzers to check for potential issues inside of Lakebed, and and found a handful in a huge overhaul I was doing of the database architecture there. It's able to come up with solutions.

And that feels different. I'm not saying old models couldn't do this before. I'm saying I didn't trust them to do this before.

And if it did come up with a solution, I would audit it deeply and use other models to come in and give feedback on it before letting it proceed. Mythos is smart enough, it has enough knowledge baked in, it has enough taste baked in, that I can trust it more to go out and figure out these things, and then come back with results.

Does it do this perfectly 100% of the time? Absolutely not. But you should be running into those problems more, because you should be pushing the limits of what it can do more.

It's It's hard to put into words how often I've been genuinely impressed with the outputs I got from this model. And not just in code either. It has enough taste to be helpful with things like finding good content to make videos about.

I've had a bunch of YouTubers hit me up saying it's the first model that can generate titles in a way that isn't cringe and terrible. It is smarter and it has taste. And because of that, you can trust it with more.

It's like going from a really cracked junior engineer to a kinda laid back senior one, where it knows enough to make good decisions by itself. And I've never felt more like I'm along for the ride it's taking me on, rather than I am the one steering the ship.

And I think we should embrace that a bit. Am I saying you should fully let go, stop reading code, and just vibe it out? Absolutely not.

What I'm saying is the ceiling for this model is meaningfully higher and we should be pushing to hit it. Rather than just being excited when a problem 5% harder is solvable, we should go for problems that are five times harder, be surprised when it fails, and like figure out where the line is between the two and what we can give the model to make it better at unblocking itself.

Things like giving it computer use capability so it can debug in the browser itself. Things like letting it write fuzzers or test beds or whatever it needs to be more confident in the code that it writes. I had this model go through all of my stale PRs in Lakebed, make work trees for each of them, analyze them deeply, and give me its thoughts on whether or not it should be updated, rewritten, or thrown out entirely.

I cleaned up a shitload of PRs in an hour or two using this model. I rewrote a bunch of data layers with this model in a few more hours. I ported an app from a five year old code base nobody wants to touch to a modern stack, which admittedly has its bugs, but is absolutely fixable because I let it go and do its thing.

It's time to think a little more boldly with how we use these things. Especially during this short window where we have thousands of dollars of usage for just $200 a month. I know I just got a lot of people to cancel their ClaudeCode subs and I stand behind that, especially with the business practice Anthropic was going at at the time.

But a combination of their realization that they need to be nice if they want to survive, as well as this model being just so far ahead of anything else currently on the market, I think you should try to squeeze out as much usage in those plans as you can right now. To get a good feel for the model, to see where it succeeds, to see where it fails.

So instead of spending a thousand dollars on a PR that you can't merge, you spend $200 on 10 PRs, three of which you can merge. The economics of software development have changed on a fundamental level as a result of this drop, and I wanna make sure you guys know that even despite being an anthropic hater, this is worth paying attention to.

Things are changing. They are accelerating, and the market's going to catch up in crazy ways.

I expect pricing for models this smart to go down as competition ramps up. I expect Google to do jack fucking shit as they always do. Things are changing.

And I've never felt more like my video about the ceiling was just entirely wrong. It's crazy this just exists now. And then I could open up my terminal and rewrite any piece of software in any language with a pretty high success rate.

Go play with this. Push it to its limits. Let me know if you think I'm insane.

Let me know what you're able to build with it. And most importantly, try to stay excited. I know this is a huge change for our field and the thing that we have grown up doing for years, if not decades for many of us.

But in many ways, it's more exciting because you can do things that just didn't make sense before. I'm having a ton of fun with that, and I hope you can have some fun with it too. So until next time, peace nerds.

The Hook

The bait, then the rug-pull.

Twenty-four hours. Three thousand dollars in inference. Two $200 accounts burned through their five-hour session limits simultaneously. That's the data behind this video — not a spec sheet review, but a working developer's reckoning with a model that might actually be different.

Frameworks

Named ideas worth stealing.

13:09model

Claude Code session limit math

1 session = ~25% of weekly limit
4 sessions = weekly cap
Usage-based: ~$100/8 minutes on heavy agentic work

The session and weekly limits on $200/month plans are more restrictive than they appear. You can exhaust the weekly limit in four full heavy sessions.

Steal forAny content about Claude Code pricing or capacity planning

16:00model

Hidden intervention tiers

Transparent fallback: cybersecurity/biology/chemistry → routes to Opus 4.8
Silent degradation: frontier LLM development → prompt modification + steering vectors + PEFT
Claimed impact: 0.03% of traffic, <0.1% of organizations

Anthropic has two tiers of safety intervention. The transparent one tells you it switched models. The silent one makes the model worse without disclosure.

Steal forAny content discussing AI transparency or model behavior

CTA Breakdown

How they asked for the click.

VERBAL ASK

32:15next-video

“Go play with this. Push it to its limits. Let me know if you think I'm insane. Let me know what you're able to build with it.”

Organic, no hard sell. Ends on community engagement — share your builds.

MENTIONED ON CAMERA

01:43toolBlacksmith CI ↗

FROM THE DESCRIPTION

PRIMARY CTAWhere the creator wants you to go next.

Thank you Blacksmith for sponsoring! Check them out at ↗

OTHER LINKSAlso linked in the description.

Storyboard

Visual structure at a glance.

open

hookopen00:00

sponsor

sponsorsponsor01:43

benchmarks

valuebenchmarks03:28

demo

valuedemo08:29

$100/8min

hook$100/8min11:50

interventions

valueinterventions16:10

demos

valuedemos19:04

how-to-use

ctahow-to-use27:02

ctaclose32:15

Frame Gallery

Visual moments.

open

Frame at 00:31 from Fable is Mythos, and it is really good.

Frame at 00:57 from Fable is Mythos, and it is really good.

Frame at 01:22 from Fable is Mythos, and it is really good.

Frame at 01:51 from Fable is Mythos, and it is really good.

sponsor

Chat about this