Modern Creator
Nick Puru | AI Automation · YouTube

I Cancelled GPT-5.5 After Testing Opus 4.8 (Here's Why)

A 17-minute practitioner verdict on whether Claude's fastest-ever release fixes what 4.7 broke.

Posted
yesterday
Duration
Format
Review
educational
Views
844
20 likes
Big Idea

The argument in one line.

The compute crisis that made Claude 4.7 painful forced Anthropic to fix the model's core weakness, long-session drift, so 4.8 is a qualitative shift back to the trustworthy agentic behavior last seen in 4.6, not just a benchmark step.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…
  • You use Claude Code daily for real builds and felt 4.7 going off the rails on long autonomous tasks.
  • You switched to GPT-5.5 or Codex over the last six weeks and want to know if it's worth coming back.
  • You're evaluating models for agentic workflows where the agent runs unsupervised for extended sessions.
  • You want a benchmark read that distinguishes paper scores from daily-use feel, with actual output comparisons.
SKIP IF…
  • You need a quick one-shot code generator and don't care about long autonomous sessions.
  • You're looking for API pricing analysis or enterprise deployment guidance; this is a practitioner's daily-use take.
TL;DR

The full version, fast.

Anthropic grew 80x in one year, ran out of compute, and the degraded service is what made 4.7 feel broken. Opus 4.8 dropped 42 days later, backed by a $1.25B/month xAI compute deal and Anthropic's fastest model turnaround ever. Three side-by-side coding tests show a clear qualitative gap: 4.8 understands the full intent of a prompt, checks its own work before returning results, and holds the thread on long multi-step jobs without drifting off task. On benchmarks it beats GPT-5.5 on agentic coding 69 to 58, loses terminal bench 74 to 78, wins computer use. Same price as 4.7. For practitioners running real agentic builds, it's worth switching back.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →
Chapters

Where the time goes.

00:0001:13

01 · Cold open and Anthropic's own framing

Hook via confession (gave up on Claude), Anthropic's own 'modest but tangible' wording sets credibility baseline, promise of three tests and a final verdict.

01:1302:22

02 · The compute crisis

Anthropic grew 80x vs planned 10x; CEO quote; throttling at peak hours; 4.7 wasn't dumb, the service was buckling.

02:2203:20

03 · The xAI deal

$1.25B/month, 220k GPUs, 300 MW; same week doubled Claude Code limits; headline: Higher limits plus a compute deal with SpaceX.

03:2005:05

04 · Same price, better model

$4.75/M input, fast mode 3x cheaper; recap of how bad 4.7 felt: second-guessing loops, tokens draining, Reddit complaints.

05:0507:15

05 · Test 1: Solar system build

Same prompt, two models. 4.7 built a functional demo (Heliocentric). 4.8 built the Orrery: planets, stats panel, surface landing mode, correct scale. 4.8 did more of what was asked and got there faster.

07:1509:50

06 · Test 2: 2001-vs-2026 encyclopedia flip

4.7's Compendia nailed both eras. 4.8's Omnipedia lived inside the 2001 era (Netscape button, 56k bar) and tied both versions with one idea: the old blue hyperlink becomes the new site's accent color. 4.7 faster; 4.8 better.

09:5012:15

07 · Test 3: Long-running jobs

4.7 drift problem: loses thread, redoes old decisions, stops to ask keep going every few minutes. /goal command helped but only prolonged the drift. 4.8 fixes focus in the model itself.

12:1513:49

08 · Dynamic Workflows

Manager 4.8 splits a huge job into hundreds of parallel sub-agents, each self-checking before assembly. Early preview, higher plans only.

13:4914:56

09 · Benchmarks

Agentic coding: 4.8 at 69.2 vs 5.5 at 58.0. Terminal bench: 5.5 wins 78 to 74. Computer use, reasoning, knowledge work, financial analysis: 4.8 leads all.

14:5616:58

10 · Verdict and CTA

4.8 over 4.7 is not close. 4.8 edges 5.5 on most tasks. Wait on Gemini. CTA to free AI Accelerators community.

Atomic Insights

Lines worth screenshotting.

  • Anthropic grew 80x in a single year against a planned 10x, and the compute shortage was responsible for most of 4.7's bad reputation, not model quality.
  • The xAI compute deal ($1.25B/month, 220k GPUs) landed weeks before 4.8, and the same week Anthropic doubled Claude Code session limits.
  • 4.8 costs the same as 4.7 ($4.75/M input tokens) while fast mode got 3x cheaper; price stability after a $1B+/month infrastructure build almost never happens.
  • The /goal command made 4.7 keep going but didn't make it smarter; it let the model run longer down the wrong path instead of fixing the drift.
  • 4.8 asks when it's unsure rather than guessing forward, eliminating the worst failure mode: confident long-session drift producing plausible-but-wrong output.
  • On agentic coding benchmarks, 4.8 scores 69.2 vs GPT-5.5's 58.0; more than 10 points, not a photo finish.
  • GPT-5.5 still wins terminal bench 78 to 74; raw terminal coding is the one benchmark 4.8 doesn't take, and anyone claiming a clean sweep is wrong.
  • Dynamic Workflows fires off hundreds of parallel sub-agents from a single session, each self-checking before results return; manager-worker architecture baked into the model tier, not layered tooling.
  • The quality gap between models shows in prompt interpretation: both got the same words, but 4.8 built what the user meant while 4.7 built what the user literally typed.
Takeaway

Why benchmark scores miss the real story on AI models.

WHAT TO LEARN

Infrastructure constraints, not model architecture, drove the worst of Claude 4.7's failures; the fix came in the model itself, not in the tooling layered on top.

  • A model company growing 80x while planning for 10x will degrade the service around a model even if the model itself stays constant; users experience the infrastructure, not the weights.
  • When a workaround patches a model's behavior without fixing the underlying issue, it can make things worse by extending the wrong behavior further into a long session.
  • Benchmark numbers and daily-use feel can diverge significantly: a 10-point gap on agentic coding benchmarks translates to one model finishing and another finishing better on the same prompt.
  • The most important property of an agentic model is whether you can leave it alone; drift and second-guessing in long sessions destroys the core value proposition of unsupervised task execution.
  • Model pricing doesn't always track infrastructure costs: 4.8 costs the same as 4.7 despite Anthropic absorbing a $1.25B/month compute deal, so evaluating models on cost alone misses this signal.
  • Prompt interpretation quality is separable from completion speed: the faster model produced the lesser result because it understood less of what was actually meant.
Glossary

Terms worth knowing.

Agentic coding
A benchmark category measuring how well a model completes real software engineering tasks autonomously without human guidance at each step. Maps closer to daily developer use than single-function code-generation tests.
Dynamic Workflows
A Claude 4.8 feature (early preview, higher plans) where the model acts as a manager, breaking a large job into parallel sub-tasks and spawning multiple sub-agents to work simultaneously, each self-checking before results are assembled.
/goal command
A Claude Code slash command that sets a completion condition and keeps the model running turn-after-turn until it reaches that condition, replacing the need to manually type 'keep going' repeatedly.
Effort dial
A per-session setting in Claude Code that controls how hard the model thinks: High (default), Extra (recommended for long jobs), and Max. Ultra Code mode sets this automatically and enables workflow spawning.
Terminal bench
A benchmark measuring raw terminal-based coding speed and accuracy. GPT-5.5 outscores Claude 4.8 here (78 vs 74), the one category where 4.8 does not hold the top spot.
Resources

Things they pointed at.

00:00productReprises AI
12:15toolDynamic Workflows (4.8 preview)
Quotables

Lines you could clip.

03:13
You don't go from a full on compute crisis to shipping your best model in six weeks unless something big changed in the middle.
Punchy causal claim with no setup needed, works as a standalone cold openTikTok hook↗ Tweet quote
10:10
4.8 fixes this in the model itself, not in some tool wrapped around it.
Clear differentiation between a real fix and a workaroundIG reel cold open↗ Tweet quote
13:53
4.7 finished first, 4.8 finished better.
Seven-word verdict that lands without contextnewsletter pull-quote↗ Tweet quote
15:55
The chart says small steps, using it says huge.
Clean contrast between benchmark optics and daily-use realityTikTok hook↗ Tweet quote
The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

metaphoranalogystory
00:00So Claude Opus 4.8 just dropped, and I've spent the last few days basically doing nothing but testing it. And I'm going to be honest with you. Before this, I've pretty much given up on Claude.
00:10The last version 4.7, it got so frustrating to work with that I moved most of my coding over to GPT's 5.5. But 4.8 honestly pulled me right back in.
00:19So in this video, I'm gonna be walking you through exactly what I found. I took the same prompts and I ran them through four point seven and four point eight side by side to real builds so you could see the difference with your own eyes. And then I'm going to be getting into the long running stuff, which is the part that I actually care about the most.
00:35And at the end, I'll give you my honest take on where 4.8 really lands against 5.5 and Gemini and whether it's worth coming back to. So if that sounds good, let's get into it.
00:44Now before I show you a single test, I wanna be honest about something because it's the whole reason that I actually trust this release. Now, Anthropic, they're not overselling it if you look at their own page. So this is their wording.
00:55Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. So modest. That is the company that built the thing telling you on paper it's a small step.
01:05And you know what? On paper, they are right. So that jump from 4.7 to 4.8 on the benchmarks.
01:11It is very small, and I'll show you the numbers later. Okay. So the first question that I had was why is this even out already?
01:174.7, this was barely six weeks old, and the answer actually explains a huge part of why 4.7 felt so rough in the first place. It comes down to one thing and that's compute.
01:26So just a quick timeline, 4.6, it came out February 5, 4.7 came out on April 16, and 4.8, this dropped on May 28 just forty two days later. So this is the fastest Anthropic has ever turned a model around by a lot. So what actually changed is the compute.
01:41So here's what's actually going on behind the scenes. Anthropic, they blew up this year. Their CEO, they said they plan for the company to grow about 10 times, instead it grew 80 times, and he said it straight out.
01:51Quote, that is the reason we have had difficulties with compute. They flat out ran out. And you felt that even if you didn't know, around the time four point seven was out, Anthropic was throttling people at peak hours just to keep things running.
02:01Everybody was burning through their limits way faster than normal, and Anthropic admitted it was the number one thing that they were scrambling to fix. So that feeling that Claude got slow and kept just cutting you off, a big chunk of that was just them running out of room to actually serve everybody. Now to be fair, Anthropic says they never made the model itself dumber and I'll take them at their word on that, but the service around it was clearly buckling under the demand.
02:22So how do you fix running out of compute? You go buy a mountain of it, and that is exactly what they did. So a few weeks before four point eight, Anthropic just signed a giant deal with Elon Musk's x AI, $1,250,000,000 a month for around 300 megawatts of power and over 220,000 GPUs.
02:39And here's the part that actually proves what it was really for, because the same week that deal landed, Anthropic turned around and doubled people's limits in Claude code. Now their own headline, it literally read higher limits plus a compute deal with SpaceX. So XAI, it didn't just help them.
02:53XAI's out of that hole. And then with all of that compute behind them, six weeks later, we get 4.8. Now is the compute the actual reason 4.8 is this good?
03:01Well, Anthropic credits the training, not the hardware. So that link right there, that's my read, not their words. But you don't go from a full on compute crisis to shipping your best model in six weeks unless something big changed in the middle.
03:13And the big thing that they say that changed was x AI, and I don't really buy that. It's a coincidence. The And part that I respected the most in all of this is the pricing, and they just took on a build north of $1,000,000,000 a month, and they passed none of it onto you.
03:24So 4.8, it costs exactly the same as $4.75 per million tokens in, 25 out, the fast mode, it even got three times cheaper, better model, way more compute behind, same price. That basically never happened.
03:37So to get why 4.8 feels so good, you have to remember how bad 4.7 actually got for me. And I've run Clawd all day, so I felt every single bit of this. Now 4.7, it was sometimes pretty painful to be working with and the worst part was just watching it think.
03:50We'd give it a problem, it would start solving it and stop and second guess itself. And then second guess the second guess over and over again talking itself out of the right answer and back into it while you just sat there watching it argue with itself and your tokens draining away. And all that back and forth didn't give you a better answer, it actually gave you a worse and slower one.
04:09Now, I'm not imagining this. If you go look at Reddit, people are complaining about the exact same stuff. So developer is going back to 4.6 after a week, people calling 4.7 a flat out disappointment.
04:18It was everywhere and I felt the same way that they did. And the kicker is while Quad was doing all of this, 5.5 and Codex were eating its lunch on coding. Now, 4.8 just a few days in, my honest feel is that it's very sharp, it's very fast, and it hasn't made anything up on me.
04:32And when it's not sure about something, it just asks me instead of guessing, and that alone fixes what I hated the most about it. And the best way that I can put it is that 4.8 just feels like what 4.6 should have grown into. So it's not some brand new personality, it's just 4.6 with the rough edges just all smoothed out.
04:47Now, will be very fair, it was not perfect for everybody on day one. There was a thread on launch day literally titled is 4.8 broken where some people just hit bugs and slow responses. So not everybody had my experience, but when I sat down and actually tested it, every single thing that I hated about 4.7 was just gone.
05:03So here are the tests. Alright. So the first thing that I wanted to do was put four point seven and four point eight on the exact same task and just watch them work.
05:11So here, I have opened up two chats, Opus 4.7 in one, Opus 4.8 in the other, both on high effort. So the only thing changing here is the model, and I gave them the same prompt. So what I'm asking for here is just a real build, there's a lot of room to do it well or just do it okay.
05:26So the first thing that I always watch is how the model plans things out before it writes a single line. And right away, you can see the difference in how these two actually think. So 4.8, it stops and it checks its design notes.
05:38It actually says it's making a few smart calls up front, so the thing comes out right, like using the planets real sizes instead of just faking it. In 4.7, it plans a bit too. It lays it out in two different modes, but it goes lighter on it and it gets to just building quicker.
05:52Okay. So here's what four point seven gave me and look, it's pretty real. It called it the heliocentric.
05:57You've got two models, an orbit view and a fly through. So there's the sun, there's a few planets, and there's a speed slider so you can watch them in orbit. So it works and if I had never seen the other one, I'd tell you that's a pretty solid result and I'd move on with my day.
06:12But here's 4.8 on the exact same prompt. They called this one the orrery and right off the bat, watch the difference. So every single planet, it's right there in a menu on the side.
06:21So I could click on Mercury and a whole info panel pops up. So we could see the size, how far it is from the sun, how long a day is, the gravity, the temperature, even a little fun fact at the bottom.
06:31And then there's a free flight mode and a land on the surface mode, I can actually drop onto a planet. Now the whole thing, it just looks cleaner and more put together. And the part that actually matters to me is in the prompt.
06:41I said, let me click a planet and see its stats. So 4.7, it gave me the planets I can look at. 4.8 gave me the planets, the stats, the surface, all the stuff that I actually asked for.
06:51So it's the same exact words for me. Four point eight just understood more of what I actually had wanted and that's the thing you feel all day. It does the obvious next thing without you having to spell out every little piece of it.
07:00So this one's not really a contest. 4.7 just built the demo and four point eight built the one that you'd actually keep and show somebody. And four point eight got there faster for me as well.
07:09So I'm not going to try to slap a fake timer on the screen. I wasn't running a stopwatch or anything. So take the speed as my read, not a hard number, but the difference and what they built that you can see with your own eyes.
07:19So that's the first one. Then I ran this second test. So I'm basically asking for just two websites in one.
07:26So an old 2,001 encyclopedia, that whole early Internet feel, blue links, funky search bar, and the links have to actually work. And then a button that just flips the entire thing into a gorgeous 2020 version.
07:37So it's the same content, same soul, just a completely different world and that's a design test and a logic test at the same time. Thus, why I went with this harder approach. Now, again, you could watch how each one thinks before it builds.
07:48So 4.8 checks its design notes first, and then it plans out how it's going to build the first thing, one set of content that powers two totally different looks behind a single button. And then it does something that I really liked. Before, it shows me anything, it stops and double checks that its own code actually runs, and then it is checking its own work before it hands it to me.
08:06Now 4.7, it plans it out well too, where it talks through how the 2,001 side and the 2,026 side are just each going to work.
08:15Just gets the building things a bit quicker. Alright. So here's 4.7.
08:18They called it Compendia and honestly, this is pretty solid. I wanna say that very clearly. So the 2,001 page, it is spot on.
08:24You have the navy banner, a little hit counter, a breaking news ticker just scrolling across the top and the old 56 k connection bar at the bottom. Then I hit the button and it flips it into a clean modern site with the real header in a category list. So 4.7, it understood the job and it nailed both versions.
08:40And now here's 4.8. So I called this one Omnipedia and it just went further on every single piece. The 2,001 page, it has the made with notepad badge, it has the Netscape now button, the best viewed at 800 by 600, the visitor counter, a line that says edited 09/11/2001.
08:57It didn't just point at the old era, it actually lived in it, and then the 2026 version. So if you look at this header, it wrote ultramarine for six centuries, the most coveted color on earth was crushed from stone, and the men who made it kept the recipe secret. So model wrote a line that actually makes you want to click, and then the part that got me is it tied the two versions together with just one single idea.
09:16So down in the footer, it wrote a clickable link was born blue in 1991, and it kept the blue. So that old blue link from the early web becomes the main color of the new site.
09:24Four point seven, it built two good websites. Four point eight had a real idea tying them all together. Now, here's what I wanna be fair and this is the interesting part about four point seven is that was actually faster on this one.
09:34So it finished first. So if you care about raw speed, four point seven, it wins. But this is exactly why speed isn't the whole story because I would ship the 4.8 version every single time.
09:43It thought it through more, checked its own work, and it came back with better taste. So 4.7 finished first, 4.8 finished better. Alright.
09:50Now, this third one is the one that I actually think matters the most, and it's the part that I've been personally been living in since its launch. And this isn't just a quick one shot, you know, like the first two. I'm talking about the long jobs where you hand the model something huge and you let it run on its own for a very long time.
10:06And I do want to slow down here because this is exactly where 4.7 hurt the most. So on a very quick task, 4.7 being a little wishy washy was just annoying and on a long task, it was a deal breaker. So here's what would actually happen.
10:19The longer that it ran, the worse that it got and then it would slowly lose track of what it was even doing and drift off the thing that you actually asked for. And then it would go back and just redo decisions that it had already made an hour ago, burning a ton of tokens going in circles or it would just stop early and ask like, should I keep going every few minutes or it would just lock on to some wrong idea and just keep running with it.
10:41So in real life, you could not leave 4.7 alone on a big job. You had to babysit it. You had to just constantly check it, catching anything before it went off a cliff and that kills the whole point of using these models.
10:53The reason you want a model that can run a long time is so you can do something else. So with 4.7 you couldn't do that, you were just stuck watching this. Now in Thropic, they actually saw this and tried to patch it so right in the middle of all the 4.7 pain they shipped a command in Claude code called slash goal.
11:08And the idea, very simple. You give it a finish line like just keep working until all the tests pass and then it just keeps going turn after turn, checking its own progress until it actually hits the finish line. So you're not just sitting there typing, just keep going and keep going a 100 times.
11:22And with this, it did help a lot. So people could finally kick off a job at night just instead of checking it in the morning. But the catch with this and the important part is that slash goal just makes the model keep going.
11:31It doesn't make the model any smarter about what it's actually doing. So with 4.7 a lot of the time, all keep going was really doing was just letting it keep going down the wrong path. So the command was fine, the model underneath it was still the problem.
11:43And that's the thing that I really want you to get here is that 4.8 fixes this in the model itself, not in some tool wrapped around it. So even in plain chat with no fancy features turned on, 4.8 just holds the thread way better on a long job. And it's not only me saying that, and ThreatPix own notes admit that 4.7 would skip steps that it actually needed and drift off after a long session and that 4.8 stays on task with fewer of those slip ups.
12:06So now when you put a command like slash goal in front of 4.8, keep going finally means just making real progress and the tool and the model are finally pulling in the same direction. And then totally separate from the model, there's a brand new feature that came out with 4.8 called dynamic workflows, and I wanna be very clear that this is its own thing.
12:23So with this turned on, BaudCode can take a giant job, plan the whole thing out, and then fire off a bunch of smaller agents, sometimes hundreds of them, all working at the same time in one session, checking their own work before any of it comes back to you. So think of it like one manager handing pieces of the job out to a whole team.
12:40So instead of just one person trying to do all of it alone, and with this, it's still an early preview and it's only on the higher plans. But the point is is that four point eight on its own is already doing better at the long stuff, and dynamic workflows is the extra gear that you bolt on when the job is really, really big.
12:56And then the last dial that's worth knowing about is that you can actually tell four point eight how hard to think. So with this, it starts on high and for the heavy long jobs, can just bump it to extra or all the way to max, where extra is the one that Anthropic actually recommends for the long stuff and that's the one that I reach for as well.
13:11So there's even a mode in quad code called ultra code that pushes the effort up high and it just lets it kick off those workflows on its own. So here's what all of this has actually felt like because I've been pushing it hard. So I hand 4.8 a pretty big multi step job, and I just walk away, and it keeps going.
13:25It holds the thread. It doesn't drift. It doesn't re argue any old stuff that it already decided.
13:31It doesn't quietly switch to the wrong problem. So I could come back and it's still doing the exact thing that I had asked for just further along. Now, after living in 4.7, that honestly feels like a different planet.
13:41So this is the first quad in a long time that I trust to leave alone on something big. And for the way that I work, that's the one upgrade that I truly care about most. Okay.
13:50Now the benchmarks and look, I've basically stopped fully trusting these because every time a new model drops, the company just waves a chart around where shocker, their model is winning. And, yeah, to a point, it is true, but it's never the whole story.
14:02The real story is to me, like, actually using the thing which I just showed you. So take this with that part in mind. But let's just look because those numbers are pretty genuinely good.
14:11But this is Anthropics own comparison. So if you look at the coding row, agentic coding, the one that actually matches real building work, 4.8 is at 69, where 5.5 is at 58.
14:22So four point point eight didn't just edge it out. It took the coding crown back by more than 10 points. Now computer use for the model actually drives a browser on your screen.
14:294.8 is the strongest model they have ever shipped. So the reasoning, the knowledge work, the financial analysis, it's sitting on top of all those two. Now with this, I'm not gonna spin you because that's the whole point of this channel.
14:38There is one row that four point eight loses, terminal bench. So the raw terminal coding, five point five wins that one. 70 eights to 74, so five point five is still genuinely elite.
14:48And anybody telling you that four point eight sweeps the board clean is just blatantly lying to you. It drops that one. But it just wins basically everything else that matters.
14:56So where do I actually land on all of this? I'm just gonna give it to you very straight. Four point eight compared to four point seven, it's not even a close call.
15:03It's just better. It's plain and simple. All the stuff that drove me crazy about 4.7, like the second guessing and the back and forth that wasted my time, that is gone.
15:13In 4.8, it's fast, does the thing that you actually asked for, and when it isn't sure, it just stops and it asks instead of guessing. And honestly, this is the model that 4.7 should have been the whole time. But the real reason that I'm excited is because the thing that I keep coming back to is that 4.8 just feels like a 4.6 again.
15:29And if you've been around this channel, you know that 4.6 was the one that I loved and it was the first model where I could hand it something real, walk away from my desk and just actually trust that it would go and get the thing done. And that feeling, it went away for about six weeks and now with 4.8, it's back.
15:45So for me, that's the whole story right there. Now, I do want to stay fair here because if all you do is look at the benchmark chart, then yeah, this is a small step up from 4.7, like that part is true. And 5.5 from GPT, it's still a really good model.
15:58It still beats 4.8 on one coding test that I showed you. So I'm not telling you to go cancel anything, but here's the thing, you only learn by actually using these all day because the chart says small steps, using it says huge. So 4.7, felt like it was fighting it the whole time and 4.8 feels like it's finally on my side again and it takes the top spot back from GPT's 5.5.
16:17At Gemini, I would just wait. Google's real next model isn't even out yet, so I'm not about to judge them off an old model. So here's my bottom line, if you gave up on Claude over the last month or two, the way that I pretty much did, this is the one that's worth coming back for.
16:30Now, if you guys do wanna go deeper on the stuff and actually testing these models head to head and figuring out how to put them to work in a real business, that is exactly what I do inside of my free community. It's called the AI accelerators, over 18,000 people and they're figuring this out together and it's completely free to join.
16:42So link is down below in the description. Come hang out and thank you guys for watching. Do me a favor, hit the like, hit the subscribe, drop a comment on what models you guys are using now, what your systems are.
16:53Very curious on what you guys are using. But with that being said, thank you guys for watching and I'll see you in the next video.
The Hook

The bait, then the rug-pull.

The title earns its punch: this creator actually cancelled GPT-5.5 after six weeks of defection, and the honest why runs deeper than benchmarks; it starts with Anthropic's own confession that 4.8 is only a modest improvement.

Frameworks

Named ideas worth stealing.

05:05model

Three Test Axes

  1. One-shot build quality
  2. Design plus logic combined
  3. Long-running autonomous sessions

The implicit framework for evaluating AI models in real production use: a quick build, a hard combined test, and an unsupervised long job. Each axis reveals a different failure mode.

Steal forEvaluating any new LLM before committing to it for a production workflow
12:15concept

Dynamic Workflows

  1. Manager model receives large task
  2. Plans entire job
  3. Spawns parallel agent pool (hundreds)
  4. Each agent self-checks
  5. Final assembly returned to user

Manager-worker architecture for agentic AI: one orchestrator breaks work into parallel sub-tasks, each sub-agent verifies its own output before the manager assembles the result.

Steal forDesigning multi-agent pipelines for large build jobs
13:10model

Effort Dial

  1. High (default)
  2. Extra (recommended for long jobs)
  3. Max

Per-session thinking effort control in Claude Code. Extra is the sweet spot for heavy agentic work per Anthropic. Ultra Code mode auto-sets effort high and enables workflow spawning.

Steal forCalibrating compute cost vs quality for different job sizes
CTA Breakdown

How they asked for the click.

VERBAL ASK
16:10link
If you guys do wanna go deeper on actually testing these models head to head and figuring out how to put them to work in a real business, that is exactly what I do inside of my free community.

Positioned after a strong earned verdict; the CTA carries weight because the creator spent the whole video being fair about 4.8's one benchmark loss before recommending it.

MENTIONED ON CAMERA
Storyboard

Visual structure at a glance.

title card
hooktitle card00:00
talking head promise
promisetalking head promise00:35
session limits screenshot
contextsession limits screenshot01:41
two Claude chats open
demotwo Claude chats open05:05
4.8 Orrery live
demo4.8 Orrery live06:16
Test 2 encyclopedia
demoTest 2 encyclopedia07:32
Omnipedia 2026 flip
demoOmnipedia 2026 flip09:00
/goal command docs
explainer/goal command docs11:44
Dynamic Workflows diagram
explainerDynamic Workflows diagram12:15
Anthropic benchmark table
dataAnthropic benchmark table13:49
verdict talking head
ctaverdict talking head14:56
Frame Gallery

Visual moments.

Watch next

More from this channel + related breakdowns.

Chat about this