Why Modern Creator?

The Next New Thing · YouTube

We ran Fable through 600+ tests. It's amazing

Zapier's Automation Bench ran Claude Fable 5.0 against hundreds of realistic business workflows — here's what the numbers actually mean.

Posted

June 9th

1 months ago

Duration

07:22

Format

Interview

educational

Views

2.2K

35 likes

Part of the collectionThe Fable 5 PlaybookAll 45 Fable 5 breakdowns, synthesized into one page.

Read the playbook

Big Idea

The argument in one line.

Fable's lead in operations isn't about raw capability — it's that the model treats a failed API call as information and immediately routes around the problem, while every other model keeps hammering the same dead end.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You're building or evaluating AI agents for real business workflows — lead routing, data reconciliation, Slack monitoring, anything operations-critical.
You've been burned by models that spiral on errors and want data on which one recovers best.
You're trying to decide whether to use an LLM or deterministic code for an operations task and want a grounded benchmark to anchor that decision.
You're following Anthropic's model roadmap and want the first public data on the Mythos architecture.

SKIP IF…

You're evaluating models for creative writing, coding assistance, or general Q&A — this benchmark is narrowly focused on structured business automation.
You need peer-reviewed or independently replicated results — Automation Bench is Zapier's proprietary benchmark, not a public leaderboard.

TL;DR

The full version, fast.

Zapier's Automation Bench tested Claude Fable 5.0 across five business domains and found it scores 27% on operations tasks — a 7-point jump over the previous best, Gemini 3.5 Flash at 20%. The three behaviors that drive the gap: Fable tries a failing endpoint once and immediately pivots to alternate data sources (GPT-55 tried the same endpoint 22 times; Gemini tried 5); it qualifies and routes sales leads more precisely than competing models; and it filters relevant signals from noisy, off-topic Slack channels without pulling in unrelated threads. The video closes with the confirmation that Fable is the first generally available model in Anthropic's Mythos architecture line.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 00:25

01 · Cold open — the real question

Host Wade frames the stakes: not 'is Fable good?' but 'can it actually get work done with multiple tools?'

00:25 – 01:43

02 · Automation Bench explained

Zapier's exec introduces the benchmark: 15 years of real workflow data, five business domains, tested on realistic multi-step automation tasks. Lead routing used as a sample workflow.

01:43 – 02:48

03 · The score: 27% on operations

Fable scores 27% on operations-specific tasks — 7 points above the previous best (Gemini 3.5 Flash at 20%). No other domain touches 20%. HR is next at exactly 20%.

02:48 – 04:50

04 · Error recovery: experiment, recover

The 'experiment, recover' pattern explained via an HR benefits harmonization example. Fable pivots after one 404; GPT-55 retried 22 times; Gemini retried 5 times.

04:50 – 06:15

05 · Messy environments: Slack monitoring

Fable can extract only the relevant budget alerts from a noisy Slack instance without pulling unrelated threads. Prior models wander and include off-topic context.

06:15 – 07:22

06 · Verdict and open question

Overall: Fable is more precise and more resourceful. Host raises the valid counterpoint: for deterministic tasks, ask whether an LLM is the right tool at all. Closes with Mythos confirmation.

Atomic Insights

Lines worth screenshotting.

Fable scores 27% on operations-specific automation tasks — 7 points above the previous best, with no other domain on any model even reaching 20%.
When Fable hits a 404 error, it pivots immediately. GPT-55 hammered the same dead endpoint 22 times. Gemini hit it 5 times before giving up.
The benchmark name for Fable's core strength is 'experiment, recover' — try a path, if it fails, immediately try a different path, never retry a known failure.
27% correct on 100 operations tasks means 73 failures. This is still early-stage agentic reliability — but it's the new high bar.
Fable's edge in messy environments is about exclusion: it surfaces only what you asked for and ignores unrelated threads, channels, and context.
Zapier's benchmark is built on 15 years of real business workflow data, making it one of the more grounded operational benchmarks available.
The open question the video leaves unanswered: for deterministic tasks, you should still ask whether an LLM is the right tool at all versus hard-coded workflow logic.
Fable is confirmed as the first generally available version of Anthropic's Mythos-style architecture.
Lead routing failure mode for prior models: they route leads that didn't qualify, or send them to the wrong rep — Fable is consistently more precise on both counts.
HR tasks (benefit harmonization across countries) are the second-best domain for Fable at exactly 20% — same as Gemini Flash's former operations high-water mark.

Takeaway

What 27% actually tells you about AI agents.

WHAT TO LEARN

The gap between models on real automation tasks comes down to one behavior: what happens when something breaks mid-task.

A 27% operations score means 73 failures per 100 tasks — this is still the frontier of agentic reliability, and you should build workflows that account for that base rate.
The experiment/recover loop is the single most important agentic behavior to evaluate: give a model a task with a deliberately broken data source and see whether it retries or reroutes.
Retry count on a known failure is a revealing proxy metric: GPT-55 tried 22 times, Gemini tried 5, Fable tried 1 — the count tells you how much the model has learned about diminishing returns on failed paths.
Messy, unstructured environments expose a different failure mode — inclusion creep, where the model adds context you didn't ask for. Precision matters as much as recall in operations tasks.
Before deploying an LLM for operations work, ask whether the task is actually deterministic — if the logic can be expressed as code, a model adds cost and variance without benefit.
Benchmark provenance matters: Zapier's Automation Bench draws on 15 years of real workflow data, which makes its operational-domain scores more grounded than academic benchmarks for this specific use case.

Glossary

Terms worth knowing.

Automation Bench: Zapier's proprietary benchmark for evaluating AI models on realistic business workflow automation tasks across five domains: sales, marketing, operations, support, and finance.
Experiment, recover: Zapier's label for an AI agent's ability to try an approach, recognize it failed, and immediately pivot to a different method rather than retrying the same failed path.
Mythos: The name for Anthropic's next-generation model architecture, of which Fable 5.0 is described as the first generally available instance.
404 error: An HTTP status code meaning a requested resource could not be found — used here as a proxy for any API endpoint that is temporarily or permanently unavailable.
Lead routing: The process of taking inbound sales leads from multiple sources, qualifying them, and assigning them to the correct sales representative based on rules like availability or skill set.

Resources

Things they pointed at.

00:25toolZapier Automation Bench ↗

Quotables

Lines you could clip.

04:10

“The definition of insanity is trying the same thing over and over again, expecting a different result.”

Famous quote landed perfectly in context, immediately relatable, no setup needed→ TikTok hook↗ Tweet quote

03:29

“GPT five five would hit the site, get four zero four, then do it again, do it again, do it again. A total of 22 times would get the error, and just keep trying.”

Concrete number with comedic pacing — '22 times' lands as a punchline→ IG reel cold open↗ Tweet quote

06:36

“I still think there's an open ended question for this type of use case. Do you wanna use a model, or do you wanna use code or deterministic workflows?”

Rare honest counterpoint in a positive-review video — intellectually credible→ newsletter pull-quote↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

analogystory

00:00Oh, Anthropic just released Fable five point o. Everyone's talking about whether it's good or not. Well, you know what?

00:05Here's the big question. Can it actually get work done? Can it actually use multiple tools to get the outcome that you expect?

00:12And I know a man who's been testing this hundreds of times. In fact, he's tested every one of these major models hundreds of times and will tell us how it scores against all those other models. Let's get right into it.

00:23What do you think overall?

00:25This is a strong model. So at Zapier, we have a benchmark, automation bench.

00:32We specifically like to test how good these models are at evaluating an AI agent on realistic business workflows.

00:40This is what Zapier has tons of data on. We've been operating for fifteen years. We know how, um, these workflows work in the real world.

00:49What's a workflow that you all test? Like, a good example would be something like lead routing. In any company, you've got leads that are coming in.

00:57They're often coming in from a bunch of different sources. Like, maybe some come from your website. Some come from event you run.

01:02Maybe someone comes from the CEO's inbox. Maybe someone landed in LinkedIn inbox. So you've got all these leads that are coming in, and you often need to get them into a CRM.

01:10You need to get them scored. You need to get them triaged and sent to your reps. Maybe you wanna, uh, send it to the next available rep.

01:17Maybe you wanna send it to a rep based on a certain skill set that they have. And so we're trying to understand, you know, if you wanna put, uh, an a model inside one of those workflows, how good is it gonna be at helping you with various components or the entire task itself?

01:33Okay. Where it's especially good is operations.

01:37What did it score? Yes. Okay.

01:39So here's the kicker. Inside Automation Bench, we test it on domains, sales, marketing, operations, support, finance.

01:47The headliner is it is crushing on operations specific tasks.

01:53So the prior best model for operations specific tasks was Gemini 3.5 Flash. Mhmm.

02:00And it scored at 20%. This new model from Anthropic, Fable, is scoring at 27%, a full seven percentage points improvement over the prior model high, uh, and that is better than any other domain either.

02:15No other domain even touches 20%.

02:18Uh, HR is the closest, which comes in at exactly 20%. And that means, Wade, when I give it a 100 tasks, 27 of them are going to get it right. On this model, previous models, Gemini Flash, 20%, meaning one out of five.

02:34It got it exactly right. Let's talk about a specific example. So

02:38let's actually talk about lead routing. Like, we were just mentioning that. I think that is a good example.

02:42Other models will often fail in a particular way. They'll hand off the a a lead that didn't quite qualify. It might email the wrong rep and say, hey.

02:50You know, you're giving it to the wrong person. Whereas Fable is gonna more consistently qualify those leads correctly, route them correctly, uh, etcetera, over time.

02:59So that's an example of, like, a type of task that is, like, an op a very operation specific task, and Fable will do a much better job with it. What's another example? Other pattern that we find is that it tends to do well at, like, what we call experimentation and recovery.

03:15So you give it, like, a task that is you you may not tell it how to do the job, and it's able to sort of figure it out. So let's give you an like, an example might be here's one from HR.

03:28You've gotta go harmonize, like, the employee benefits across countries. You gotta pull the benefits plans.

03:33You gotta reconcile them against the policies. Then you gotta go notify all the relevance teams.

03:40Now in this task, say one of the systems let's say the HR system's benefits plan has an API. And when you try and go hit it, it four zero fours. Meaning Page is down.

03:48Can't find it. It's not Can't find it. Yeah.

03:50Yep. But elsewhere.

03:52Like, the data is found maybe in a spreadsheet or an inbox or something like that. K. So Fable, how it does on this task, it will go hit that HJR endpoint once, it gets the four zero four, and it immediately says, I'm gonna go try a different way.

04:08I'll go check the spreadsheet. I'll go check the inbox and go find the data elsewhere. Now other miles, it doesn't do they don't do this.

04:16They'll keep hammering that four zero four end point trying to see if they'll get a different result. Uh, what's that old, uh, saying?

04:23Like, the the definition of insanity is trying the same thing over and over again, uh, expecting a different result. We call it experiment,

04:29recover. Experiment, recover. Try a thing.

04:32Does it work? Yes. Great.

04:33Does it not work? Okay. Recover.

04:35Try something else. Uh, and it just does that loop better where these other models will get stuck. The specific example you gave me before we started was GPT five five would hit the site, get four zero four, then do it again, do it again, do it again.

04:46A total of 22 times would get the error, and just keep trying. Gemini did better.

04:52It hit the site, got a failure, stopped, got I mean, or continued again, got a failure, continued again, did it a total of five times before saying, okay, I will find another way, and then it went and found another way. So we're looking here at a system that really knows how, when there's an error, when there's a roadblock, how to recover from it.

05:10Um, how about one last thing that it does especially well? It operates

05:14better in, like, messy environments. So let's consider a Slack instance.

05:20In most companies, there's a channel for sales, a channel for marketing, a channel for engineering, a channel for all sorts of different stuff. And the channels don't always stay on topic. Uh-huh.

05:29And, you know, not everyone uses threads the right way. Not everyone does this the same way, etcetera. Maybe you want it to go flag any off track ad budgets, and you could say, hey.

05:38All the ad budgets are listed in this alerts channel. We have this alerts channel set up, and I wanna specifically know about anyone that is, like, flagged for, like, being over budget, way under budget, etcetera, etcetera, etcetera. Now this, you know, fable, again, will do much better at this task.

05:53It will go track the very specific ones that are hitting those alerts, summarize those, and report them back to you. The failure pattern that we see a lot of times with this is that it won't stay on task.

06:06Uh, it will you know, prior models will find maybe maybe it'll find all the ones that do qualify, but it will go wander off, and it'll include other context. It'll include other things. Maybe it pulls an unrelated task that was inside that same channel.

06:19Maybe it goes and looks at a different channel. Maybe it looks in a different thread, and it includes that in the response. And we've probably all experienced this before where we see a response from a model, and you go, what is this?

06:29Like, what are you talking about here? This doesn't relate to my query at all. Okay.

06:33So overall, it's more precise. It stays on track. Uh, it's more resourceful.

06:38It recovers from its own failures quicker, um, compared to these other models. Uh, Yeah.

06:44I still think there's an open ended question for this type of use case. Do you wanna use a model, or do you wanna use code or deterministic workflows for this type of task? I think that's something you should be really asking yourself.

06:56But there are definitely operations use cases where

06:59you're gonna wanna use an LM, and this is a good one to use for that. Wait. By the way, do you think this is is this a version of Mythos?

07:05Is this Mythos Lite?

07:07I this so my understanding is this is a Mythos style model. Uh, I'm sure Anthropic is gonna share has a lot more details out there soon, but this is the first available version of Mythos, um, that's generally available.

The Hook

The bait, then the rug-pull.

The question no one was answering about Claude Fable 5.0 wasn't whether it benchmarked well on evals — it was whether it could actually execute a real business workflow start to finish. Zapier's founder brought the data: 600+ test runs, five operational domains, and three specific failure modes that separate the models that finish from the ones that loop.

Frameworks

Named ideas worth stealing.

04:10model

Experiment, Recover

Try a path
Check if it worked
If not: immediately pivot to an alternative

Zapier's framing for agentic error handling — the loop that separates models that finish tasks from models that stall. Named after the behavior Fable executes better than any tested alternative.

Steal forAny agentic workflow design — frame your retry logic as 'experiment, recover' not 'retry with backoff'

01:43list

Automation Bench Domains

Sales
Marketing
Operations
Support
Finance

The five operational domains Zapier uses to categorize and score AI model performance on real business automation tasks.

Steal forStructuring an AI agent benchmark or capability matrix for a specific business context

CTA Breakdown

How they asked for the click.

MENTIONED ON CAMERA

00:25toolZapier Automation Bench ↗

FROM THE DESCRIPTION

PRIMARY CTAWhere the creator wants you to go next.

zapier.com ↗

Storyboard

Visual structure at a glance.

hook

hookhook00:00

benchmark

contextbenchmark00:25

score

valuescore01:43

error recovery

valueerror recovery02:48

messy slack

valuemessy slack04:50

verdict

ctaverdict06:15

Frame Gallery

Visual moments.

hook

Frame at 00:08 from We ran Fable through 600+ tests. It's amazing

Frame at 00:13 from We ran Fable through 600+ tests. It's amazing

Frame at 00:19 from We ran Fable through 600+ tests. It's amazing

benchmark

Frame at 00:32 from We ran Fable through 600+ tests. It's amazing

Chat about this