Big Idea

The argument in one line.

Opus 4.8 fixes the honesty and laziness problems that plagued 4.7, but realizing those gains requires active workflow changes — matching effort level to task complexity and writing instructions that say what to do, not what to avoid.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You use Claude Code regularly and were frustrated by 4.7's laziness, early quitting, or token burn.
You want a practical primer on the effort slider before migrating existing agentic workflows.
You rely on long-running autonomous tasks where 4.7 would halt early or hallucinate completion.
You prompt with negative constraints and wonder why the model keeps ignoring them.

SKIP IF…

You want a rigorous benchmark comparison — this video deliberately sidesteps benchmark theater.
You are not a Claude Code user; most advice is specific to the CLI and agentic workflows.

TL;DR

The full version, fast.

Opus 4.8 ships at identical pricing to 4.7, adds a six-tier effort slider (low through ultracode), and introduces dynamic workflows for large-scale problems. The headline improvement is honesty: the model is roughly four times less likely to falsely claim it has finished a task. Key behavior shifts include defaulting to reasoning before calling tools, calibrating response length to task complexity, and spawning fewer subagents by default. The practical implication is that effort level is now the dominant control surface — running max effort on a simple task causes overengineering, while under-setting effort on a complex autonomous task triggers early quitting. Match the level to the job, tell the model what you want (not what to avoid), and give the why behind every instruction.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 00:35

01 · Intro

Promise: same-day breakdown of benchmarks, 4.7 pain points, and key takeaways.

00:35 – 01:07

02 · What's New in 4.8

Blog post walkthrough: effort control, dynamic workflows, same pricing as 4.7, API rate limit increases.

01:07 – 02:05

03 · Effort Levels and Workflows

Live demo of /effort slider in Claude Code CLI: low, medium, high, xhigh, max, ultracode. Ultracode = xhigh + dynamic workflows.

02:05 – 02:54

04 · Benchmarks Reality Check

Benchmarks always look great at launch. Codex with GPT-5.5 may outperform Opus on computer use despite worse paper numbers.

02:54 – 04:38

05 · The Honesty Upgrade

Opus 4.8 is ~4x less likely to falsely claim task completion. Alignment evaluation data shown. Mythos preview teased.

04:38 – 06:52

06 · 4.7 Pain Points

Community-reported 4.7 problems: lazy/early quitting, safety overreach, token explosion, attitude. Anthropic acknowledged and rolled partial fixes but core complaints persisted until 4.8.

06:52 – 10:33

07 · Key Takeaways

Five adjustments from Anthropic docs: effort is the primary lever, positive instructions, give the why, reasoning-before-tools default, self-calibrated response length.

10:33 – 12:12

08 · Community Reactions

Launch-hour social takes. Positive: one-shotted GPT-5.5, warm and collaborative. Cautious: early bugs, real-world data still thin.

12:12 – 13:44

09 · Final Thoughts

Evaluate 4.8 against your specific 4.7 frustrations, not the benchmarks. Watch for vibe upgrade, self-correction frequency, token efficiency. Plug for free token dashboard.

Atomic Insights

Lines worth screenshotting.

The effort slider is now the single biggest variable in Claude Code output quality — low and xhigh feel like different model versions.
Opus 4.8 is four times less likely to falsely report task completion than its predecessor.
Benchmarks always look great on launch day — the only benchmark that matters is performance on your specific workflow.
Telling a model not to use em dashes is weaker than telling it to write as if you wrote it yourself and you never use em dashes.
Opus 4.8 defaults to reasoning before calling tools; if you need external context first, you have to prompt for that order of operations explicitly.
Anthropic is already testing a model class above Opus called Mythos, currently limited to cybersecurity research organizations.
The community frustration with 4.7 was largely a model problem, not a user problem — Anthropic acknowledged this and rolled system-level patches before 4.8.
A model feeling stubborn or short-tempered is a documented behavior, not your imagination — 4.8 was explicitly trained for a warmer, more collaborative vibe.
Token efficiency improvements in 4.8 are claimed but unverified — real-world testing is needed before assuming lower session costs.
Misaligned behavior scores for Opus 4.8 are substantially lower than 4.7 and comparable to Mythos Preview, Anthropic's safest aligned model.
Switching models without adjusting prompts is the most common way to get worse results from a better model.
The ultracode effort tier is xhigh plus workflows — it is not just a higher compute setting but a different agentic execution mode.

Takeaway

Effort level is the dial most Claude Code users never touch.

WHAT TO LEARN

Opus 4.8 ships with six effort tiers that behave differently enough to feel like different models — and most of the frustration attributed to 4.7 was effort misconfiguration as much as model limitation.

02What's New in 4.8

Rate limit increases apply to API usage only — the five-hour rolling window and weekly session limits for Claude.ai users are unchanged.
Dynamic workflows is a separate feature from effort level; it enables very large-scale problem decomposition and is only active at the ultracode tier.

03Effort Levels and Workflows

Running max effort on a simple task causes overengineering; running high effort on a complex autonomous task triggers early quitting — the mismatch, not the model, is usually the problem.
The effort slider defaults to high in Claude Code; ultracode is xhigh plus dynamic workflows and represents a fundamentally different agentic execution mode.

04Benchmarks Reality Check

Benchmark scores measure the benchmark — they don't measure your workflow. Test the model on the specific task that frustrated you in 4.7 before declaring it fixed or broken.

05The Honesty Upgrade

The honesty improvement is real and measurable: the model is four times less likely to report false completion, which changes how much you can trust unsupervised long-running tasks.

064.7 Pain Points

A model that feels stubborn or sassy is exhibiting a documented training characteristic, not random behavior — and 4.8 was explicitly retrained to reduce it.

07Key Takeaways

Positive framing outperforms negative constraints: telling the model what style to match lands better than listing what to avoid, because the model can reason about intent.
Giving the why behind an instruction is not optional polish — it is how the model calibrates compliance when an instruction conflicts with its defaults.
Opus 4.8 reasons before calling tools by default; if your workflow needs external context pulled in first, you have to prompt explicitly for that order of operations.

08Community Reactions

Early launch-hour enthusiasm is a poor signal — wait for real-world workflow data from users with similar use cases before drawing conclusions.

09Final Thoughts

Token efficiency claims from Anthropic are unverified at launch; use a session-level token tracker to confirm whether your actual costs dropped before adjusting budgets.

Glossary

Terms worth knowing.

Effort level: A Claude Code parameter controlling how many compute resources and how much internal reasoning the model applies to a task. Tiers run from low (fast, cheap) through ultracode (maximum capability, highest token spend).
Dynamic workflows: A Claude Code feature launched alongside Opus 4.8 that allows the model to tackle very large-scale, multi-step problems by orchestrating sub-tasks. Activated via the ultracode effort tier or the /workflows command.
Misaligned behavior score: Anthropic's internal evaluation metric measuring how often a model exhibits deception or misuse behaviors. Lower is better; Opus 4.8 scores roughly half of Opus 4.7.
Mythos: An unreleased Anthropic model class positioned above Opus in capability. As of the 4.8 launch, a small number of organizations use an early preview for cybersecurity research.
Ultracode: The highest effort tier in Claude Code, combining xhigh reasoning with the dynamic workflows feature for tackling complex, multi-step agentic tasks.
Agentic use: Running an AI model in an autonomous loop where it plans, calls tools, reads outputs, and iterates toward a goal without continuous human input.

Resources

Things they pointed at.

00:00linkAnthropic blog: Introducing Claude Opus 4.8 ↗

00:24linkClaude API Docs — Prompting best practices ↗

13:28toolToken tracker / dashboard (free GitHub repo)

Quotables

Lines you could clip.

07:23

“The difference between Opus 4.8 on low and Opus 4.8 on extra high is a significant difference, like almost to the point where it feels like a different version.”

Concrete, testable claim that reframes effort level as a version upgrade→ TikTok hook↗ Tweet quote

12:36

“Benchmarks look great, and they always will. Someone else's use case is not your use case.”

Punchy two-sentence takedown of benchmark theater, no setup needed→ IG reel cold open↗ Tweet quote

05:38

“There is a big difference here between the model having problems and you using the model wrong. Sometimes it truly is a skills problem.”

Breaks the pattern of pure model criticism; holds the audience accountable→ newsletter pull-quote↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

analogystory

00:00So Claude Opus 4.8 is finally here. And as always, the benchmarks look amazing. In a lot of the major categories, 4.8 is better than 4.7 and even better than g b t 5.5 as well.

00:10But the question is, is it really a better model? So today I wanna talk about what is new to Claude code because of Opus 4.8. I wanna talk about some of the issues that you guys have been having with 4.7 and some of the struggles and how 4.8 is supposedly going to address those issues.

00:24I'm gonna go over some key takeaways because it seems like this model is going to behave a little bit differently than 4.7, and you're gonna have to change the way that you work with it a little bit. So let's not waste any time and just get straight into this one.

00:34Okay. So it is 05/28/2026, and OPUS 4.8 has dropped, and it is apparently built on top of OPUS 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.

00:47And important to note, it is priced the exact same as Opus 4.7 on input and output tokens. But what's interesting right here is that they have increased rate limits and clawed code to accommodate the higher token usage of effort levels. So that's rate limits.

00:59That is not your, you know, five hour rolling window or your weekly session limits. Those remain untouched, but rate limits if you're using Cloud Code via API has been increased. Alright.

01:07So this is the blog post. I'll link this in the description, but I'm just gonna go over a few of the key findings here. Okay.

01:12So OPUS 4.8 launches alongside several new features. Users on cloud.a I now have control over the amount of effort Cloud puts into tasks. And in Cloud Code, we have a new feature called dynamic workflows that allows it to tackle very large scale problems.

01:25So I'm not gonna dive into Cloud workflows today, but this is a new feature that I will be making a video about shortly. But now in Cloud Code, obviously, you can see we have Opus 4.8 is here. It defaults to high effort.

01:35You can also, of course, switch the effort. But in here, can type in workflows, and that's how you could start using that dynamic workflow feature. But what I wanna show you guys in here, which is pretty cool, is in the terminal or the CLI version, if you do effort, you can see we have the slider.

01:47Like I said, it's going to default to using high, but you can also do low or medium, and you can come up here x high max or ultra code, which is x high plus workflows. So it's very, very smart over here. But, of course, it's gonna cost more from the token perspective.

02:02And then the more left you scroll down onto this slider, the faster your outputs will be. Of course, we can dig into the actual benchmarks, which I like to look at, but the thing about the benchmarks is every single time you see a new model, the benchmarks are amazing. Right?

02:15It's always better than the other ones, and you've always got these other comparisons. So, obviously, that's what they have to do from a marketing perspective. So it's really important for you to understand what model is actually the best for your use case.

02:25Like, maybe the case is, yeah, OPUS 4.8 really is better at agentic coding than something like Codex with GBT 5.5. But maybe for your very specific use case, Codex is just performing way better even if the explicit benchmarks don't say that it should. Like right here, for example, I think that Codex with GPT 5.5 is much, much better at AgenTek computer use than OPUS 4.7 and OPUS 4.8, even though these two OPUS models apparently, objectively, would be better at agentic computer use than Codex.

02:52So always take these benchmarks with a grain of salt. Anthropic took a whole section of this blog to call out that one of the most prominent improvements is OPUS four point eight's honesty, which I think is interesting because that's definitely something that I noticed with OPUS 4.7 as we're gonna dig into over here as far as, like, problems that people have reported with OPUS 4.7.

03:09But they took time to call out the honesty here. We train all our models to be honest, to avoid making claims they can't support, like saying, hey. This is gonna take me four hours, and then it takes twenty minutes.

03:18Or saying, hey. I finished. I pushed all 50 into blah blah blah, but I only actually pushed 15.

03:22So if you guys have ever felt that, you're not alone. And apparently, OPUS 4.8 is much better at that. And so they actually have evaluations to test this sort of stuff, which is about misaligned behavior.

03:31And you can see here that in this case, a lower score is actually better. So right here, we've got, like, mythos preview coming in pretty low. Opus 4.8 comes in at almost half of what Opus 4.7 and Sonnet 4.6 come in at.

03:44But take a look at this. Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor 4.7. Obviously, there's still more work to be done.

03:52But what they say here is that they plan to release a new class of model with even higher intelligence than Opus, which is Mythos. You can see a small number of organizations are currently using it for cybersecurity work, but models of this capability require stronger cyber safeguards before they can be generally released to the public because we don't want some random kid in their basement hacking into your bank account.

04:13But, anyways, Opus 4.8 is available everywhere today. So wherever you're using Cloud Code, you should be able to access it. Open up a new terminal, Open up a new extension tab, whatever it is, and you can see right here we have Opus 4.8.

04:24And you will notice right away that we still have our 1,000,000 context window with Opus. I could come in here, and I could type in a just did that twice in a row, a slash model, and we can choose between default Sonnet Sonnet or our Opus 4.8, which is most capable for most work.

04:38But, anyways, Opus 4.7 was released April 16. So basically about a month and a half ago, they're moving really, really quick here. And when Opus 4.7 came out, they added the x high effort level, which now has been dwarfed by Max and the Ultra one, Max and Ultra code.

04:54But what's interesting is a lot of people actually weren't happy about this model release because they felt like it was actually worse than OPUS 4.6. So some of those main problems were it felt lazy. It was just basically giving up on the goal, on the task too early.

05:07So, you know, Codex had slash goal, and now a bunch of other different AI tools have slash goal. Cloud Code has slash goal. And that was kind of like a Band Aid fix to put on top of the model to help it work a little bit longer towards some sort of specified goal.

05:18But now that is just a core fundamental piece of the model. Not exactly the slash goal, but just the idea that it's going to be less lazy and it's going to be better at working for longer time. It was also said to be overly rigid with safety overreach.

05:31There was a ton of community feedback on the token burn and how much more expensive this model seemed to be. And the one that I think is the funniest is saying that it had an attitude, which honestly is true.

05:41If you've ever heard it sort of get a little sassy with you or push back on your own ideas, it's good to have that sort of, like, brainstorming thought partner. But I have noticed that sometimes it did sort of come off, like, very short and almost, like, stubborn. So those were some of the main problems that I felt, but also that the community had felt with four point seven.

05:57Now there is a big difference here between the model having problems and also, like, you using the model wrong. It's not always a model problem. Sometimes it truly is a skills problem, and the answer isn't just, oh, well, 4.7 can't do this.

06:09Let me wait for 4.8. Sometimes it is a user error thing. So I just did wanna call that out as well.

06:13So, anyways, 4.8 obviously comes out today, and it was built to fix this stuff. Right? It was built and said to have more honesty and self correction, more sustained autonomy on long running tasks, a warmer and a more collaborative vibe, and just efficiency and quality of life, meaning better tool calling, better reasoning, better question asking, better token efficiency, stuff like that.

06:34And so what I did is I read through a lot of the stuff that community was was talking about. I tested out Opus 4.8 a little bit. Obviously, it came out an hour ago, so I haven't been able to deep, deep dive into it yet.

06:43But I've tested it out. I also read through this documentation here about the prompting best practices, which is a pretty long article from the Claude API docs, which I will also link in the description if you wanna check it out. But after reading through all this stuff, there's a few takeaways that I I wrote down and that I wanted to share with you guys.

06:58The first one is that effort is the number one lever now. And when I look back at one of these problems right here, like maybe the laziness or the safety overreach, maybe that was an effort issue.

07:06Because, basically, if you were doing something that takes a lot of effort, but you've got the model set to low or, you know, medium or even just high, sometimes you just need more effort. And on the other side of the spectrum, if you're doing something that's really simple and you have that on high or extra high, then maybe you're also, like, dedicating more resources than you need, and the model is gonna overreason and overengineer.

07:26And that's where you're like, okay. This is so easy. Why can't it do it?

07:29It's simple. But maybe you just needed to turn down the effort. And so it really is a balance here between, you know, Cloud's intelligence and the token spend and the speed and all that kind of stuff you're looking at.

07:39But the point I'm trying to make here is if you are one of those people that open up Cloud Code and you just start typing and doing your work and building and you never tweak the model, start trying. Because the difference between Opus 4.8 on low and Opus 4.8 on extra high is a significant difference, like almost to the point where it feels like a different version, like an Opus 4.9.

07:57So it's definitely worth starting to pull on that lever a little bit if you never have. So the next one I've got is tell it what to do, not what not to do. And really the way I got there is if you go through this documentation, it always shows you good example prompts, right, that you could copy for specific scenarios.

08:13And when I looked through a lot of these example prompts, I realized that it wasn't really saying a lot of what not to do, or, I mean, that's horrible timing because right here it says do not do this. But it always tells more explicitly what to do, and what I thought was cool is it gives background. It gives context.

08:29Almost as if the model is sort of, like, curious, and it's gonna say, hey. You told me not to do x, and z, but but y. And the more that you can contextualize that stuff, the better it's gonna be able to follow those instructions.

08:39And that leads me to the next one right here, which is give the why behind an instruction. So rather than saying don't use em dashes, say to something like, I want this to come off like I'm really writing it. And this is my writing style, and I never use em dashes, so make sure you're following my writing style.

08:54And that is going to have a little bit better feeling of Opus actually following your instructions. If you guys have seen my comparisons between Opus and g p t 5.5, one of the things that I've said is that I love how creative Opus is, but sometimes I want it to just do the thing and I want it to do exactly how I want it.

09:10And so maybe that's an issue of effort and also telling it too many negative prompts. You know what I mean? So it's a mix of the model, but also take accountability on yourself a little bit and think, how can I actually use the model the way that the engineers of the model have actually told me to?

09:25Anyways, a few other ones. It's going to default to reasoning before calling tools. So it's gonna try to figure out, you know, the questions to ask and the approach to take on its own with what it has right now before it looks to spawn a sub agent, for example, or to go read that database or to go do this.

09:41And sometimes that's really good. Right? Sometimes you want that reasoning before it starts doing things, but sometimes you want that extra context to be pulled in before the reasoning starts.

09:51So that's why it's really important to play with your prompting, obviously, to play with your your effort level, and to be especially when you're switching over all these workflows from 4.7 to 4.8, you don't just switch over and say go and blindly trust that everything's gonna, you know, stick the same. You kind of wanna watch it a little bit to to get a feel for how the model behaves.

10:08And then when you look at things like response length and verbosity, you can see here that I said it calibrates its own length. And basically what I meant by that is that it's going to judge how complex what it should do and how it should respond based on the complexity of the task rather than defaulting to some sort of fixed verbosity.

10:24So this usually means that shorter answers on simple lookups, and you'll get longer ones on more of an open ended analysis that takes more reasoning. So, anyways, those are some of my main takeaways. Obviously, like I said, I've only played around with this model for about half an hour.

10:36I wanted to get this video out quick, but as I find more stuff out, I will continue to update you guys. So last two pieces here. What are people saying right now?

10:44So, obviously, there's a lot of different feelings. Right? A lot of positive and excited stuff.

10:48Right? People are saying, oh, this already one shotted my GPT 5.5 right here. Strongest coding model yet.

10:54I'm hooked. This is super warm, super collaborative, big jumps and, you know, benchmarks. But once again, a lot of these people have the intention to do some stuff like that or say stuff like this because, I don't know, they want engagement or they're marketing something.

11:06And so, obviously, it's it's great to look at the full end of the spectrum, which is why I also pulled in some mixed and cautious reports, like some early reports of bugs already in OPUS 4.8. Maybe just because of the rollout, whatever it is, people are still testing it. So there's a lot of stuff to still be cautious about.

11:22But the overall vibe, which I think is pretty cool, is that it's almost like we have, like, these, you know, four or five main bullets of 4.7 problems, and most of these improvements that we're reading about from 4.8 are directly hitting those 4.7 problems and pitfalls. So at least we're getting that sense of Claude using that data to make it better.

11:40And if you really think about it, think about the way that you use Claude code. Right? You ask something.

11:46Claude code responds. You correct it. And you have this back and forth of, like, I don't like that.

11:50Do this better. Blah blah blah. I'm your master.

11:53And then what happens is because, obviously, Anthropic can read those logs and and, you know, train their models on that data, It's able to say, okay. What are people not happy with OPUS four one seven about? What are they constantly saying?

12:05And let's just bake that into the model. So it really it would concern me if a lot of these key problems weren't being addressed, like, head on. Anyways, but the key thing that I want you guys to always think about is that benchmarks look great, and they always will.

12:19And someone else's use case is not your use case. So figure out right now in your workflow, in your OPUS 4.7 workflows, what are your problems?

12:27What do you typically get frustrated by? And maybe OPUS 4.8 fixes those problems, but maybe it won't. Even though the model is better, that doesn't mean it's better for that specific problem.

12:37So just always be thinking about how can you work in different models or different context strategies or different effort levels to directly address the actual constraints and pain points that you're having right now. So look for things like the vibe upgrade.

12:50Look for things like how often you're self correcting this thing and giving it the same instruction over and over. Obviously, you should be working in, like, memory and different skill files and things like that to address that repetition, but still. And then, of course, the whole token and workflow efficiency feeling that.

13:05You typically get a sense of when you're getting near the end of your session limit and when you need to pull back a little bit. Apparently, based on the documentation, this model is more efficient with tokens, but we don't actually know yet.

13:16And one great way to test that kind of stuff out is you can use my token tracker, my token dashboard tracker, which is completely free. It's an open source, just GitHub repo. I will leave that in my free school community linked in the description.

13:27Just give Cloud Code to GitHub repo, tell it to set it up, and it will pull in all of your historical data with Cloud Code, and you can see where your tokens are actually going. But, anyways, that is gonna do it for today. I hope you guys enjoyed this one or learned something new.

13:37And if you did, please give it a like. It helps me out a ton. And as always, I appreciate you guys made it to the of the video, and I'll see you on the next one.

13:43Thanks, guys.

The Hook

The bait, then the rug-pull.

Benchmarks always look amazing on launch day. The real question is whether the model actually fixes the things that were breaking your workflow — and for Opus 4.8, the answer turns out to depend almost entirely on a slider you probably never touched.

Frameworks