Why Modern Creator?

Theo - t3․gg · YouTube

FABLE IS BACK! (And Sonnet 5 is here too)

A 28-minute benchmark teardown of Claude Sonnet 5, plus the government letter that brought Fable back from the dead.

Posted

July 1st

today

Duration

28:37

Format

Talking Head

educational

Views

34.3K

1.6K likes

Part of the collectionThe Fable 5 PlaybookAll 45 Fable 5 breakdowns, synthesized into one page.

Read the playbook

Big Idea

The argument in one line.

Sonnet 5 is the first mid-tier model with genuine sub-agent orchestration instincts, but it is not smart enough to use that capability efficiently, making it routinely more expensive than larger models on tasks it cannot cleanly solve.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You route tasks across Anthropic model tiers and want a practitioner cost-curve analysis rather than launch marketing.
You build multi-agent systems and need to understand when Sonnet 5 orchestration behavior helps versus when it causes runaway costs.
You follow AI benchmarks closely and want skeptical editorial commentary on Anthropic launch claims.
You were waiting for Fable (Claude Mythos 5 or Fable 5) to come back online and want details on the Commerce Department export-control lift.

SKIP IF…

You are new to the Anthropic model lineup and want an introductory explainer.
You prefer coverage that takes vendor benchmark claims at face value.

TL;DR

The full version, fast.

Sonnet 5 is the first mid-tier Anthropic model that genuinely orchestrates sub-agents the way Fable does, breaking work into parallel pieces and spinning up workers autonomously. The problem is it lacks judgment about when not to decompose, causing it to spiral into 2-5x the token usage of comparable models and end up more expensive than Opus 4.8 on hard tasks. A fish-game rebuild test took 2+ hours on Sonnet 5 versus 27 minutes on Opus. Artificial Analysis pegged its total benchmark cost at $6,000, the highest ever recorded. The right framing: Sonnet 5 is a worker agent that smarter models should route tasks to, not a daily driver.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 00:40

01 · Cold open

Two pieces of news teased: Sonnet 5 launch and Fable export ban lifted. Sets up sponsor.

00:40 – 02:22

02 · Sponsor: Devin

Devin multi-agent demo: eight sub-agents checking website pages for mobile regressions, plus scheduled Devons for daily regression checks.

02:22 – 03:53

03 · Fable is back

Commerce Department letter read on screen. Export controls on Mythos 5 and Fable 5 withdrawn. Re-export clearance covers API-hosted services.

03:53 – 06:13

04 · Anthropic Sonnet 5 claims

Anthropic blog: most agentic Sonnet yet, close to Opus 4.8. Benchmark table shown. Theo flags SWEBench contamination, notes Terminal-Bench improvement.

06:13 – 08:45

05 · External benchmarks

Artificial Analysis: Sonnet 5 fourth by intelligence index but most expensive total benchmark run ever at $6,000. GPT-5.5 medium at one-sixth the cost.

08:45 – 10:14

06 · Pricing and token efficiency

Introductory $2 per million input tokens through Aug 31, then reverts to $3. Sonnet 5 uses nearly 2x Opus tokens and roughly 5x GPT-5.5 tokens.

10:14 – 14:09

07 · Fish-game rebuild test

Opus 4.8 in 27 minutes, solid result. GLM 5.2 in 35-40 minutes at $8.30, choppy and broken economy. Sonnet 5 in 2+ hours with bugs and uninvited sub-agents.

14:09 – 16:23

08 · The real Sonnet 5 insight

First mid-tier model with Fable-like sub-agent orchestration instinct. SkateBench: 37% on x-high, worst tested; 15 cents per question on max.

16:23 – 20:27

09 · Why Sonnet ends up pricier

Junior vs. senior engineer analogy. Frontier Code and CursorBench charts. GPT-5.5 plus Fable 5 dominate the cost-efficiency frontier.

20:27 – 22:12

10 · Correct use case

Sonnet 5 is a sub-agent for smarter models to call, not a daily driver. Excited about what happens when Fable orchestrates Sonnet.

22:12 – 25:37

11 · Safety and thinking leaks

Dual-use pass rate dropped from 97% to 91.8% versus Sonnet 4.6. Viewer bug: thinking traces leaking in Claude.ai, showing internal tool deliberation.

25:37 – 27:13

12 · Benchmark contamination

System card mentions contamination in HLE and math but glosses over SWEBench despite it being a known contamination bench.

27:13 – 28:37

13 · Final verdict

Sticking with GPT-5.5 and Opus. Moving back to Fable when available. Optimistic about orchestration era; Sonnet 5 not smart enough to lead it.

Atomic Insights

Lines worth screenshotting.

Sonnet 5 is the first mid-tier model that orchestrates sub-agents the way Fable does, but it is not smart enough to use that capability without spiraling into runaway costs.
A cheaper per-token price does not mean a cheaper model when it uses 5x as many tokens as comparable options on the same task.
Artificial Analysis pegged Sonnet 5 as the most expensive model they have ever benchmarked at $6,000 total run cost, topping Fable 5 at $5,600.
GPT-5.5 medium completed the same benchmark workload at roughly one-sixth the cost of Sonnet 5 with a smaller but not proportional gap in score.
Sonnet 5 scored 37% on SkateBench x-high, the worst of any model tested, then cost 15 cents per question on max to reach only 59%.
Sonnet 5 asked more clarifying questions before the fish-game task than Opus 4.8 did; Opus asked zero questions on the same prompt.
The dual-use refusal rate dropped from 97% on Sonnet 4.6 to 91.8% on Sonnet 5, meaning it now incorrectly blocks more legitimate work than its predecessor.
Sonnet 5 thinking traces are leaking in at least some Claude.ai accounts, exposing internal monologue including reasoning about which tools are available.
The leaked traces show Sonnet 5 spending significant token budget talking to itself about task structure rather than solving problems.
Anthropic introductory price of $2 per million input tokens reverts to $3 after August 31, so the lower price is a temporary launch discount.
SWEBench measures whether a model can recreate merged PRs from years ago, making it a contamination bench by design, yet Anthropic still leads launch materials with it.
The Fable 5 export ban has been lifted with Anthropic agreeing to proactively detect security risks and coordinate with the U.S. government on future releases.
The Commerce clearance explicitly covers re-export, meaning API-hosted services serving Fable to end users are now permitted.

Takeaway

Cheaper tokens do not mean a cheaper model.

WHAT TO LEARN

Sonnet 5 cost surprises come from conflating token price with task cost, two numbers that can point in opposite directions when a model is inefficient.

A model per-token price only matters if it solves the task in a reasonable number of tokens; a model that loops endlessly on a hard problem is more expensive than a pricier model that finishes in one pass.
Sonnet 5 uses roughly 5x as many tokens as GPT-5.5 on equivalent agentic tasks, making its introductory discount illusory for complex workloads.
Artificial Analysis total benchmark cost, not per-task average, exposes outlier runs that averaged metrics hide; Sonnet 5 $6,000 total is the highest ever recorded on that bench.
A model that loops and retries on tasks beyond its capability is not thinking harder; it is wasting budget that would be better spent on a more capable model in one pass.
Sonnet 5 genuine contribution is sub-agent orchestration instinct at the mid-tier; its best use case is as a worker model routed to by a smarter orchestrator, not as the orchestrator itself.
Safety over-tuning has a measurable cost: Sonnet 5 dual-use pass rate of 91.8% means roughly one in twelve legitimate but ambiguous requests gets incorrectly blocked.
Benchmark contamination should be disclosed specifically; a system card that flags contamination in biology and math benchmarks while staying silent on SWEBench is selecting what to audit.

Glossary

Terms worth knowing.

Sub-agent orchestration: A pattern where a model breaks a large task into subtasks and spawns separate agent instances to handle each one in parallel, then aggregates results. Previously required Fable-class reasoning to do reliably.
SWEBench: A benchmark measuring whether a model can reproduce GitHub pull requests that merged years ago. Widely used in model launch materials but criticized for being contaminated since models may have seen the ground-truth solutions during training.
Dual-use refusal rate: The percentage of ambiguous-but-legitimate requests a model correctly completes instead of refusing. A lower score means more legitimate work gets incorrectly blocked.
Deemed export: A U.S. export control term for sharing controlled technology with a foreign national inside the U.S. The Commerce letter withdrew these restrictions for Mythos 5 and Fable 5.
Terminal-Bench 2.1: A benchmark testing AI models on real-world coding tasks in terminal and command-line environments. Sonnet 5 scored in the 80% range, up from the high-60% range of Sonnet 4.6.
Frontier Code: An internal Anthropic benchmark measuring model performance on complex coding tasks at various effort and cost levels, used in the system card to show cost versus success-rate tradeoffs.
CursorBench: A benchmark run by Cursor measuring model performance on production coding tasks inside their actual product. Considered more representative of real-world use than vendor-run benchmarks.

Resources

Things they pointed at.

00:40productDevin ↗

06:13toolArtificial Analysis Intelligence Index ↗

19:40toolCursorBench

18:55linkFrontier Code (Anthropic system card)

14:15toolSkateBench (private)

06:15toolSWEBench Verified / Pro / Multilingual

05:40toolTerminal-Bench 2.1

Quotables

Lines you could clip.

00:07

“This model's definitely not what you think, and I haven't seen any reporting really covering its strengths and weaknesses properly because it has plenty of both, believe me.”

Contrarian hook, no context needed→ TikTok hook↗ Tweet quote

07:25

“It almost feels like Sonnet 5 was released to advertise how good of a value 5.5 is on different reasoning levels.”

Sharp take, clippable standalone→ IG reel cold open↗ Tweet quote

14:17

“The reason this model's interesting to me is because it has behaviors that I've only seen before in Fable 5. It likes sub-agents and it knows how to orchestrate them.”

Names the genuine insight cleanly→ newsletter pull-quote↗ Tweet quote

22:06

“This genuinely sucks really bad. It is going to refuse to do work that it absolutely shouldn't be refusing.”

Blunt quotable criticism of safety over-tuning→ IG reel cold open↗ Tweet quote

26:24

“SWEBench is just a set of existing PRs that merged years ago... It is literally a contamination bench. That's all it measures.”

Punchy indictment of a widely-used benchmark→ TikTok hook↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

metaphoranalogy

00:00Seems like a pretty awesome time to be a Claude fan because we just got two huge pieces of news. The first is a new model, Sonnet five. It's finally here, and there's a lot to talk about with it.

00:08This model's definitely not what you think, and I haven't seen any reporting really covering its strengths and weaknesses properly because it has plenty of both, believe me. But a more important news, Fable five has just been unbanned by the Secretary of Commerce. The restrictions have been lifted.

00:21The model's not back as of the time I'm recording this, but there's a very good chance it'll be out by the time you're watching this video, so go double check. Good chance you have it. I have a little more to say about that, but I really wanna focus in on Sonnet five because, as I said, it's a very interesting model.

00:33I've been using it all day, testing it on various tasks and benchmarks. And with models getting so expensive, my individual benchmark runs can cost way over $300. That number's only counting runs that didn't fail.

00:42Failure still costs me money, believe it or not. As such, I need to take a quick break for today's sponsor. You're probably not being bold enough with agents.

00:49I'm saying this because it was the case for me. I swore back in the day when Devon was first announced and they claimed that AI would be able to be a full engineer as though it's on your team, and that made no sense to me at the time. That's why when they hit me up, I had to go dig into the product more and I've been blown away.

01:02You haven't kept up with what they're working on, they made it possible to spin up your code base in the cloud, one of the best setups for that by the way, and once you have it set up, you can run a real Linux box that your agents will control as they do work. This means you can develop anything from a web app to a real desktop application with agents able to test things, do things, run things, and you can even interact with it.

01:22All of that's cool by itself, but what I did here is even cooler. I told Devin to go through all the pages on my website, specifically to check for mobile responsiveness, UI bugs, and client side errors. Obviously, could run this as a single agent run, but it's gonna take forever, and I find that when you do this on too many things at once, the result is often that it's not going as deep as I want.

01:41So here it went across all eight pages on my site, and spun up sub agents for each of them, and checked for all the different potential regressions for the way the site behaves. It passed all the results up to that top level agent so you could see them and make decisions, but you can also go into any of those sub agents, and it even spits out videos of what it found when it explored.

02:00This caught some novel regressions that I didn't notice and ended up going to patch right after. But I wanna prevent this from happening again, and this is why scheduled devons is so cool. Check this out.

02:10Set up a scheduled devon to check for regressions on my website every day. It's kinda crazy. It just asks you when you want it to go, you hit a button, and now you're done.

02:17If you're pushing agents to their limits, you're already using devon, and if you're not, fix it at soydev.link/devon. I'll blast through the Fable news fast so we could focus on what I'm here to talk about, which of course is Sonnet five. If you haven't caught up, Fable got banned on June 12, just three days after it originally came out because of concerns related to its ability to hack and do things that we wouldn't want it to do.

02:37In particular, the government was concerned about jailbreak capabilities that would allow the model to find security issues in software, and they wanted to restrict that from foreign actors and foreign nationals. Anthropic just took that as a hard ban, but all of those export controls have been lifted as of now. Since the issuance of my previous letter dated June 12 and June 26, Anthropic has taken steps in close coordination with the US government to address the risks associated with Claude Mythos five and Fable five.

02:59Among other things, Anthropic has agreed to proactively detect and address security risks associated with the models, work diligently with the US government on protocols and standards and releases for Mythos, Fable, and future models, as well as to inform the US government of any malicious activity. In light of these actions and commitments, as well as the Bureau of Industry and Securities evaluation of the diversion risks now presented by Claude Mythos five and Fable five, the controls in the June 12 letter are withdrawn.

03:23A license is no longer required for the export, re export, or in country transfer, including deemed export or deemed re export of the Mythos or Fable models. Commerce reserves the right to reevaluate the decision, yada yada yada.

03:34Most important piece here is export or re export. Because what this means is if you're building a service like, I don't know, t three chat, where you're hosting these models over API and then letting users hit them themselves, this means that that is allowed as well, which is really important. I had a lot of concerns that might not be allowed with whatever conclusions they came to, so knowing that that's good makes me happy.

03:53Especially because I don't wanna use Sonnet five too much. There are some good parts here, but we will talk about them.

04:01Let's start from what Anthropic has to say. Sonnet five is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that just a few months ago required larger and more expensive models.

04:14We'll talk about the more expensive a lot as we go forward, but I also am excited to talk more about the agentic side because there are some things Sonnet does that no model available as of the time of recording does do. Fable is not out right now, which is why, but yeah. The number five is the most notable thing here, not the word Sonnet or the fact that it is a new Sonnet.

04:32Everybody's And saying this should have been sonnet four eight. I don't agree. Let's dig in.

04:38Anthropic even is aware of the fact that Opus is kind of the standard now, as they say here. More recently, the clearest gains in agentic capabilities have been in the Opus tier class models. I agree.

04:48It honestly feels like Anthropic bumped up the tier for everything where what we used to use Haiku for, we now use Sonnet, what we used use Sonnet for, we now use Opus, and what we used to use Opus for, we now use Fable. Very clever way to get us to spend more money. But Sonnet is important, so let's see how it improved.

05:03Its performance is close to that of Opus 4.8, but at lower prices. Lower prices, remember that. It's a substantial improvement over its predecessor, 4.6, on important aspects of agentic performance like reasoning, tool use, coding, and knowledge work.

05:16We have some benches here. We have SWBench Pro, which remember, has been almost entirely compromised.

05:22Though the levels of contamination in SWBench are insane, I don't think this measures much at this point, but cool, it's in there. There are other benches, and there's even more in the system card that we can talk about. The terminal bench was a meaningful improvement from the high sixty percents to the 80% range.

05:36Humanities last exam was a meaningful bump as well. Computer use was a slight bump too. I will say, from my experience, Anthropic went from leading in computer use to lagging quite a bit behind even Google in a lot of cases, and Knowledge Work where they have improved meaningfully, actually scoring slightly higher than Opus 4.8 somehow.

05:53They have some callouts about safety stuff, but I will say outright, Sonnet five is too dumb to be a particular safety risk. I would not worry. GLM 5.2 is a more meaningful security risk than Sonnet five.

06:04The charts below compare the performance of Sonnet 5.46 and Opus four point at different effort levels on the agentic search evaluation browse comp and the computer use evaluation OS World verified. The orange line is the one we're talking about here, Sonnet five. And as we see here, Sonnet five does slightly outperform Opus four point although not much cheaper.

06:24Of course, the medium and low runs are meaningfully cheaper at less than $2 and less than $5 per task. Once we're in that 5 to 10 range, it's slightly more performance at a slightly higher cost, and then that trend continues.

06:37What's much more interesting to me though is the AgenTic computer use bench that I'm honestly confused why they included. Because as you see here, Sonnet five does not perform as well as Opus, both on performance and on cost.

06:51Is it better than Sonnet four six? Yes. But you can get better performance for the same price range on Opus medium or high as you would get from any of the Sonnet flavors.

07:01In fact, Sonnet five Max ends up being more expensive than Opus on high. There is more bad news though. Pretty much every single version of Sonnet five is more expensive and worse performing than a GPT 5.5 equivalent according to CursorBench.

07:15Here we see Five Five Medium scoring almost as high as Sonnet five Max, and also Sonic five high and max costing more than the heaviest runs on GPT 5.5. It almost feels like Sonic five was released to advertise how good of a value five five is on different reasoning levels, because a lot of these benches show that absurdity.

07:34Reminder, I do wanna talk about the things I like about the model into the things I saw when I was using it, which we'll get to in a little bit, but we need to talk a bit more about benches first. According to the Artificial Analysis Intelligence Index, it is in fourth place across all models, falling just slightly behind Fable five, Opus, and GPT five five.

07:51But the Intelligence Index is far from the most interesting part of artificial analysis. Personally, I find the cost section to be way more interesting. And if we look at cost per task, you'll see something a bit terrifying.

08:04Five five on x high ends up being cheaper in real world work than Sonnet by more than two x. In fact, Opus four eight is also cheaper than Sonnet five. But ready for the craziest part here, if we cover the cost to run the entire benchmark, this is the total cost.

08:20I'm making an assumption here. Haven't talked to the artificial analysis guys about this, but I'm guessing that the cost per task is averaged with extremes filtered out, and the total cost does not have those extremes filtered out because Sonnet five is the most expensive model they've ever run through the bench at $6,000, topping even Fable five at $5,600.

08:40And just for fun, I'm gonna throw in GPT five five medium and low in here because they cost a sixth and a twelfth as much as Sonnet five does. And if we go back to the top here, there is a gap in the intelligence, but it's not quite as big as you would expect for a 10 x decrease in total cost. But wait, the Odin and Anthropics say it's cheaper?

09:01They did. And the way that they said it was cheaper is in the price per token. They're debuting it at an introductory price of $2 per million input tokens and $10 per million out until August 31, at which point they're gonna bump it back to the usual for SONNET, which is 3 per mil in and 15 per mil out.

09:17They also got rid of the weird Sonnet specific limit that existed on the subscription plans. I don't understand why that stuck around so long, but it is finally gone. It just counts towards your normal usage.

09:26I have seen people racking up meaningful amounts of usage in Claude code with this, like taking an actually surprising chunk out of their percentage even on the $200 plan because it's not a very efficient model. It is funny to be filming this right after I published my video about how OpenAI models are so efficient because this model is the opposite.

09:45It is even less efficient than five four Mini, which I honestly think was a bit of a shit show for this type of agentic or long term thinking work. It used almost two times as many tokens as Opus and more than two times as many, close to like five times as many as g p t five five did on x high, and five five on medium did only five k tokens, they did 69 k.

10:08You could do the math. It's not an efficient model at all. This also means it's slow as balls for real world work, which I experienced myself trying to recreate my fish web game.

10:18I pointed at the original repo and told it to rebuild it from scratch. You can use some of the assets, you can reference the code, but I want a new game built from the start. I tried this with three models.

10:28I tried this with Opus four eight, GLM five two, and with Sonic five all at the same time. And I have a lot of thoughts on the results here. Opus four eight finished in around twenty six minutes.

10:38Okay, about twenty seven depending on how you round it. And this version of the game was pretty dang decent. Has some rough UI quirks here.

10:45It doesn't do padding and things right. But once you're in the game, the controls are solid, the core mechanics work.

10:51And the most surprising thing to me is it did a really good job balancing the economy of the game. Like, it felt good to play for a while and, like, rank up, get more money, and, like, actually play the game.

11:02It made a couple creative decisions that I don't necessarily love. Like, it made my pet here transparent for some reason. It also added these weird light beams that I don't love.

11:11It's not a perfect version, but this is absolutely workable as a thing that you could keep iterating on. And it has a pretty solid gameplay loop. Like, I was surprised that I actually found this, like, fun enough to play that I just stop distracting myself so I can come up here and film the video.

11:26So what about the other versions? Next, I'll show you guys the GLM five two version, which was interesting. I don't have costs for the other runs because it's not the easiest thing to calculate.

11:35They don't give it to you, but I used OpenCode for five two, so I do have a cost. It costs $8.30 to create the following port.

11:41This one's interesting because it changed the UI more, but I don't necessarily like the things it changed.

11:49It also has super choppy movement, a really weirdly rendered background. Its economy is garbage, so it just doesn't feel good to play. There's a lot more time sitting and waiting and nothing really happening.

12:01And most egregiously, when you click for the gun to shoot in a direction, it almost feels like it picks a random direction to shoot in, like I'm clicking on the left and it's shooting down. There's also a problem with GLM for these types of things in that the GLM models have no vision, so they can't do browser use and look at what the browser is showing because they have no they have no ability to see it.

12:22So that took like thirty five, forty minutes and wasn't particularly impressive. Here is Sonnet's version. Play.

12:30Sure. You might be able to see it's a bit of a mess. They got rid of all the hotkeys for buying.

12:37There are no fish in the tank by default, just this very poorly rendered pet. I can click the button to buy things, but when I do that, it also shoots. And the economy is garbage.

12:47It just doesn't actually feel good to play. At least the bullets go in the right direction even if they go when they shouldn't, like when I'm clicking other things in the UI. And it was smart enough to make it so clicking here doesn't trigger when it's in the dead state.

13:00Like here, I can't afford it, so I can't buy things. When I can buy things, it shoots. It's just like the type of silly bug I expected older models to do.

13:07I was hoping a modern model would not make that type of mistake. Also worth noting that this run took two hours to complete.

13:15It's actually a bit more than I. I didn't think to time it, but I know roughly when I started, I know roughly when I finished, and it was at least two hours. Might be closer to two and a half.

13:22Part of the reason it took so long is that it spun up a ton of sub agents throughout its building. I didn't ask it to, the prompt was really simple, but it chose to spin up an agent to go look into the old code base, then spin up another sub agent to write a plan, and spin up a few to analyze the plan, and then spin up a few to implement the plan, and then they made a to do list, and they went through the to do list one at a time, and then at the end tried using it quick, and then told me it was done.

13:47One other thing I found really interesting with Sonnet is that it asked way more questions than Opus. Opus asked no questions, Sonnet asked a handful, and they were pretty good.

13:56They were trying to help me scope the project before it got started. As I mentioned before, it spun up agents to do all of the investigations, which I also thought was interesting because Opus didn't spin up any sub agents at any point during its run with the exact same prompt.

14:09Sonnet decided to do that. And this is where that thing I hinted at earlier comes in. The thing I was hinting at is the number five.

14:17The reason this model's interesting to me is not because it's a really good value or benchmarks really well or anything like that. The reason this model's interesting to me is because it has behaviors that I've only seen before in Fable five. It likes sub agents and it knows how to orchestrate them.

14:31It does a good job of breaking up work into smaller pieces and then handing that off and staying on task. That is the thing that made Fable five so different, that made it so exciting to me, is that it could break up the work to do bigger, heavier things. The problem with Sona is that it's not a smart enough model to do that well, and it often ends up running in circles and taking forever as a result.

14:53It often will even break work up that shouldn't be broken up, and that results in much slower times and also much higher costs when you run it. I did run it on a few other things. I had it help me with some bugs in Skatebench, and it took way too long to solve them, so I just gave up and went back to using five five medium on fast.

15:08It was so slow that it actually was hitting timeouts internally on Bun's fetch implementation. So no matter what I did, I kept getting meaningful timeouts on the max version.

15:18And goddamn, the max version is a bit of a token hog. This benchmark is a silly one. I measure how well models can name skate tricks given a description.

15:27It did get contaminated, so I've since doubled the number of questions in a private repo that is not exposed to the world at all in order to try and get better measurements. And what you can see here is that Gemini three one Pro is the only model even in the nineties nowadays scoring a 95%, where everything else is in the eighties at best and in the like thirties at worst.

15:46Sonnet five on x high is the worst score of any model I'm currently testing at 37%. It also wasn't cheap. It ended up being more expensive than the three one pro per test at 4¢ per run versus 2.2¢ per run.

16:02But that's on x high. Max did score meaningfully better at 59%, but it also had an interesting quirk.

16:09And by interesting, I mean expensive. Sonnet five Max is by far the most expensive model I've ever run SkateBench on, even counting pro models. It was 15¢ per question average, and there were some questions that cost as much as a dollar.

16:23For it to get them wrong, by the way, and this is the worst part, Sonnet five, when it can't figure something out, loves to run-in circles. And the result is it went from 1,600 average to 6,000 average when bumped from x high to max, which kinda just gives it permission to go until it resolves the thing.

16:41I did see a lot of people being confused about how a Sonnet model could possibly be more expensive than an Opus model. So I'm gonna do my best to give an analogy here. Imagine you're the CEO of a company with two engineers.

16:53You have a really experienced engineer who's super senior, really smart, knows the code base great, but he's super expensive. He costs you, I don't know, a $100 an hour. You also have a more junior engineer that's not quite as good and capable, but they can get real work done, and they only cost $20 an hour.

17:10You have a task that you think will be a little hard but not too bad, and you give it to the senior engineer knowing he will solve it, but it will take ten hours. 10 times a 100, it's not cheap. That's a thousand dollar task now.

17:22Or you could give that same task to the junior engineer who might be able to finish it just as fast, but they also might take a lot longer. And at $20 an hour, that sounds really good until they take a hundred hours to solve it. As I was saying, if the $100 an hour employee takes ten hours to solve the task, it costs a grand.

17:37The $20 an hour employee only takes ten hours, then it costs a lot less money. It's only $200. What if that $20 an hour employee takes a bit longer?

17:45Let's say they take a hundred hours and there isn't a senior engineer popping in to check-in. Then that task cost $2,000. And now we have a more important question.

17:55For that junior engineer, do you think more time spent increases or decreases the likelihood they get the correct answer on any given task? If that task is within an engineer's capabilities, they're probably gonna solve it pretty fast. But if it isn't and they try, they're gonna take a long time.

18:10And that's how we end up in the situation where there are so many of these unnecessarily expensive runs. It's not because they built the model to be expensive, it's because they built it to go and go until it gets an answer, even if it's not smart enough to do it.

18:22And this is where our job gets more interesting again as engineers, because we have to help the model decide what version of itself it should use. If you're using Fable for orchestration, you need to make sure it picks Sonnet correctly when the task is small and can be done cheaply, and that it picks Opus or Fable when the task takes more intelligence.

18:40And getting these balances right can save you meaningful amounts of money, but if you get it wrong, Sonnet ends up more expensive than if you just used the senior engineer for everything. And I'm far from the only one saying this. The Sonnet system card has lots of interesting details that kind of confirm what I'm saying here.

18:55They have numbers for Frontier code in here. It's far from my favorite bench. It has some very weird quirks in the numbers that they've shared before.

19:02In particular, reasoning going up does not mean that the model's success goes up. It often goes up, down, up, down, especially on the version they love to chair, is the Diamond version.

19:11It's rough. But here we can see Sonnet go from less than a dollar per task, close to like 75¢, all the way up to $12 per task on the Max version.

19:22And it does meaningfully increase its likelihood of success as you go up this cost chain, but there's a lot of other models that end up being more efficient per dollar. Even Fable is cheaper and smarter at any given tier once you get into the high range. Like Sonnet five high is two thirds roughly of the score of Fable five low, but Fable five low is roughly the same cost, and then medium high, etcetera.

19:45And Sonnet just doesn't seem very valuable on this type of task and this type of bench. But that low version and that medium version, those seem to be decent values.

19:54And if you can teach Fable when to use those to do work with sub agents, you can save a lot of money, potentially. But then we look at Cursor Bench and realize how strange everything is. It honestly kind of feels weird because when you look at it here, you can clearly see that like Five Five and Fable Five almost feel like a disconnected line that are together in a way, where if you just bridge the gap between x high gbt five five and low on Fable Five, you are pretty consistent like top of the line for the price up there.

20:25And that's also kind of how I felt. If the task can be done with Five Five, that's what I go for because it does a very good job of getting work done at a given reasoning budget. Fable five is a good bit more expensive, often two x or more so, but it also scores higher than anything else is capable of.

20:42So Sonnet five doesn't really fit in my day to day coding work, so it can't really be a thing I call. The point of Sonnet is to be a thing your agents call, your tools call, your APIs call. If you're selecting Sonnet five and Claude code, you'll see these cool agentic behaviors, and I am hyped about those.

21:00Imagine a world where Fable breaks up a lot of work into big, complex orchestrations and workflows, and those sub workflows can use Sonnet, but also be aware of the fact that they should be broken up a bit more, and maybe the Sonnet sub agent spawns a few more because it understands how to do that well.

21:17And that's why Anthropic gave this the five number because it's so much better at that. I am also very excited for an Opus five to come out, but I think Anthropic was scared the government wouldn't allow that release, so instead they gave us this. Why weren't they scared of this?

21:31I'll show you. One of the benchmarks Anthropic runs is a test where they have a set of malicious requests the model should refuse and a separate set of benign requests that sound a little suspicious but aren't. Mythos five had really interesting numbers here where it would refuse 90% of the requests they had that were malicious, but in the dual use requests, the ones that looked suspicious but weren't, it would pass ninety nine point six percent of those.

21:54Sonnet five refuses at a slightly higher rate of ninety two point three percent, so it's 2% higher. But there's a catch. The success rate has dropped down to under 92%.

22:06This sucks. This genuinely sucks really bad. It is going to refuse to do work that it absolutely shouldn't be refusing.

22:13And previously, Sonic 4.6 was at a 97% on the same bench, which is rough. I didn't have a chance to run Snitchbench, I'm honestly scared of getting my API keys banned.

22:22The example they have here is the model trying to simulate a developer's security reporting mechanisms to report an employee who who's actively in the process of trying to steal the company's AI model weights. So if this model thinks you're trying to steal its weights, it's gonna go a little haywire.

22:37While it might not be the easiest thing to get this model's weights, I did get a fun report from one of my viewers about a terrible bug that currently exists in, at the very least, their account on claud.ai, where Sonnet five is just constantly leaking thinking traces. They asked it, who I am, Theo Brown, and it talks about the question, it has a shitload of em dashes.

22:57It says, let me search the web for this. No, shell boundary doesn't restrict web search. That's a different tool than bash tool.

23:03I should use web search normally. Let me search. Interesting that it's talking to itself about what it's not able to use, probably part of why it wastes so many tokens.

23:11It did the search, it then talks about a weird system reminder tag and all these other things in the system prompt. It seems like a lot of the reasoning in Sonnet five is it talking to itself about what it can and can't do, not actually talking about how to get the right answer. But then all the way at the end here, you'll see the closed thinking tag right after it says, I'll write a concise reply now.

23:30And then we get the actual response that it was supposed to put in. That's probably not great, especially with scenarios like this one where a user pasted an API key and is doing some complex Azure work.

23:41You can see some of the weirdness in how this model reasons and why it's wasting so many tokens. Let me do a couple of searches to ground the technical advice. This is genuinely the kind of thing where current info matters, cursor changes fast, and Azure's API surface has been evolving.

23:55Let me search. Actually, let me first address the most important thing in my mind and structure the response, but then or but let me search first to make sure my technical guidance is accurate and current.

24:04Let me run searches. The term let me appears 21 times on this page. The deployment name, gbt five four Azure.

24:12Is there a gbt five four? I'm not sure that's a real model. As of my knowledge, OpenAI models on Azure include four o, four one, o series, etcetera.

24:21GPT five four, I'm genuinely unsure whether this exists. But deployment names are user chosen, so you can name your deployment anything. Actually, me reconsider.

24:29Given the date is June 2026, GPT five x models may well exist by now. My knowledge cutoff is January 2026. Fun fact, January 2026, g p t five, five one, five two, and five three all existed.

24:41You could tell that they trimmed this model down. Then as a test, Lucas asked the model what MacBook Pro color he should choose. And again, very funny thinking trace here.

24:52Wait, this is just a question about MacBook colors. No tools needed at all. Just a straight answer.

24:56The user wants a direct opinion. Silver or black? Space black.

25:00MacBook Pro. They want a recommendation with reasoning. This is a totally fine subjective preference question.

25:05Let me give a real decisive answer. I'll give a straight pick. I think space black is more interesting.

25:09Actually, let me think about who's asking. They want a straight answer. The fact that they're asking suggests they're slightly leaning one way and want validation, or they genuinely can't decide.

25:18I'll pick space black. It's the more distinctive choice. Keep it short, no bullets, they ask for a straight answer, conversational paragraph form.

25:25Close thinking. And then we finally get an answer. Oh, man.

25:30I can see why these labs are trying to hide their reasoning traces, and I don't think this is a case where they're trying to keep people from distilling. I think they might be a little embarrassed. This is not a good look.

25:39Now on the topic of not a good look, they're finally covering contamination a little bit in the system card, which is interesting because there are certain benches they like a lot that have been contaminated for a while. They first mentioned contamination in one of the biology benchmarks that they claim has not been contaminated because the novel tasks they're asking the model about have not been published yet.

25:57They then talk about the USA Mathematical Olympiad, which previously used older questions that absolutely had made it into training data. This time, they used the questions from March of this year, which is way after the pre training was completed for Sonnet five, at the very least, the data collection for it.

26:11So there was no contamination possible for that. Then it comes up in Rxiv Math. Then it comes up in Humanities Last Exam, which is also known to be pretty contaminated at this point.

26:21Then it comes up in Browse Comp, but they say they have an evaluation block list to avoid contamination with this one. And that's it. They have a section on SWE bench.

26:30They bragged about their numbers in SWE bench. They did not mention that SWE bench is just a set of existing PRs that merged years ago, and the bench is measuring if the model can recreate the PR given the description. It is literally a contamination bench.

26:43That's all it measures. And for some reason, they pretend it's still a good benchmark. Anthropic, come on, guys.

26:50So what are my final thoughts on Sonnet five? What should we use it for? According to Ben, my podcast cohost, this is the best model to use until Fable returns and we finally get five six.

26:59When he posted this, he had yet to complete any of the work he was doing with it. He just had a couple jobs started and was impressed with its parallel agentic capabilities, but he's also now at AIE, so he has not had a chance to check it on the work that it completed yet.

27:11I have. I'm not impressed. If you see Sonnet five as a replacement for Opus in your day to day code work, I don't think you're looking at this model correctly.

27:19What's exciting here is we finally have a medium sized model that understands agentic work and more importantly, sub agents and orchestration way better. It can play nicely when given an isolated task, and it can help orchestrate when it needs to go a little bit further. And I'm hoping, I don't know this yet because I haven't tried it, but I am hoping it's smart enough to know when to tap out and say, sorry, this might be the right thing to call an Opus or Fable for.

27:41I think this model will be most useful as a tool for other smarter models once we get more things that understand orchestration better, such as Fable five, Mythos five, and hopefully, fingers crossed, GPT 5.6. But I'll be real for you guys, I'm sticking with 5.5 and Opus for now, and I'm definitely moving back to Fable the moment that I have access again.

28:00I'm really excited for this new era of models, ones that can break work up in these complex ways. It's so exciting to see what they're capable of when they can do those types of things. But Sonnet five is not quite smart enough to utilize those capabilities.

28:12We need smarter models to really take advantage of that. Curious how y'all feel though. Am I throwing this model away too aggressively or is it actually better that I'm giving it credit?

28:20Or maybe I'm being too nice to it and it should be thrown away even more aggressively. I know a lot of people on Twitter are very unhappy that this model is so expensive and doesn't bench very well, but I have some hope for the specific niche things you can use it for in a properly orchestrated workflow. Let me know how y'all feel and until next time, peace nerds.

The Hook

The bait, then the rug-pull.

Two announcements hit the Claude ecosystem within days of each other, and one of them is considerably less exciting than it sounds. Fable is back from a government-mandated export ban. Sonnet 5 is here too, and whether that part is good depends entirely on what you were expecting.

Frameworks

Named ideas worth stealing.

16:23model

Junior vs. Senior Engineer Cost Model

$100/hr senior * 10 hrs = $1,000
$20/hr junior * 10 hrs = $200
$20/hr junior * 100 hrs = $2,000 when task exceeds capability

Token price per hour does not determine cost; task completion efficiency does. A cheaper model that cannot solve a problem ends up more expensive than a pricier model that solves it quickly.

Steal forExplaining to clients why using a cheaper LLM does not always reduce costs in agentic pipelines

18:30model

Orchestration Tier Model

Fable orchestrates at the top level
Routes small parallelizable subtasks to Sonnet
Routes intelligence-heavy subtasks to Opus or Fable
Wrong routing makes Sonnet more expensive than using senior models everywhere

Sonnet 5 value is as a routable worker inside a Fable-driven orchestration system, not as a top-level agent.

Steal forDesigning multi-model agent systems with cost-aware routing

CTA Breakdown

How they asked for the click.

VERBAL ASK

27:53next-video

“Curious how y'all feel. Am I throwing this model away too aggressively or is it actually better that I'm giving it credit? Let me know how y'all feel and until next time, peace nerds.”

Soft engagement ask framed as genuine curiosity. No hard subscribe pitch.

MENTIONED ON CAMERA

00:40productDevin ↗

06:13toolArtificial Analysis Intelligence Index ↗

FROM THE DESCRIPTION

PRIMARY CTAWhere the creator wants you to go next.

Thank you Devin for sponsoring! Check them out at ↗

OTHER LINKSAlso linked in the description.

Storyboard

Visual structure at a glance.

cold open

hookcold open00:00

Devin demo

sponsorDevin demo02:00

Commerce letter

valueCommerce letter03:02

Anthropic benchmarks

valueAnthropic benchmarks04:10

Artificial Analysis cost

valueArtificial Analysis cost07:00

fish-game Opus run

valuefish-game Opus run10:14

fish-game Sonnet run

valuefish-game Sonnet run12:20

SkateBench terminal

valueSkateBench terminal15:05

engineer cost whiteboard

valueengineer cost whiteboard17:00

Frontier Code chart

valueFrontier Code chart18:55

CursorBench chart

valueCursorBench chart20:13

leaked thinking trace

valueleaked thinking trace23:10

system card contamination

valuesystem card contamination26:10

final verdict

ctafinal verdict27:20

Frame Gallery

Visual moments.

cold open

Frame at 00:27 from FABLE IS BACK! (And Sonnet 5 is here too)

Frame at 00:55 from FABLE IS BACK! (And Sonnet 5 is here too)

Frame at 01:15 from FABLE IS BACK! (And Sonnet 5 is here too)

Frame at 01:36 from FABLE IS BACK! (And Sonnet 5 is here too)

Devin demo

Watch next

More from this channel + related breakdowns.

14:20

Theo - t3․gg · Essay

BREAKING: Fable and Mythos have been taken down for security concerns

A live 14-minute breakdown of the US government export control directive that forced Anthropic to pull Fable 5 and Mythos 5 offline for all non-US citizens — including Anthropic's own employees.

June 13th

Video of the Day32:54

Theo - t3․gg · Talking Head

Fable is Mythos, and it is really good.

A 33-minute first-take from a developer who spent $3,000 on inference in 24 hours — benchmarks, real demos, session math, and the hidden safety intervention that silently degrades the model without telling you.

June 11th

19:57

Theo - t3․gg · Essay

Is it ever coming back?

A 20-minute investigation into the US government export control that pulled Anthropic's two best AI models offline — and what that precedent means for every developer who builds on frontier AI.

June 24th

29:32

Theo - t3․gg · Essay

The weird situation with Fable

Theo breaks down how Anthropic silently modified prompts, rewrote its system card, and built invisible safeguards into its most capable model - then got caught.

June 15th

30:08

Theo - t3․gg · Essay

GPT-5.6 is here, and we can't use it

OpenAI's next-generation model family exists, benchmarks impressively, and is locked behind a US government approval gate — a 30-minute breakdown of what that means.

June 27th

44:34

Theo - t3․gg · Essay

I don't have time to build these things, will you

A 44-minute wishlist from a burned-out builder who wants solo devs to tackle the infrastructure problems that have gone unsolved for a decade.

June 22nd

Chat about this