Modern Creator
Nate Herk | AI Automation · YouTube

I Battle Tested Sakana Fugu's Fable Killer

38 tasks, 4 waves, one honest verdict: 5x the cost, 4.5x the wait, and a tie on quality.

Posted
2 days ago
Duration
Format
Tutorial
educational
Views
70.3K
1.7K likes
Part of the collectionThe Fable 5 PlaybookAll 45 Fable 5 breakdowns, synthesized into one page.
Read the playbook
Big Idea

The argument in one line.

Multi-model orchestration APIs can match frontier single-model performance on most tasks, but the 5x cost and 4.5x latency overhead make them a losing trade for knowledge workers until the economics improve.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…
  • You saw the Sakana AI announcement go viral and want a real benchmark before spending money on a $100-200/month subscription.
  • You already use Claude Code or Codex and are wondering whether a cross-provider orchestration layer would meaningfully improve your output.
  • You care about AI model unit economics and want a concrete cost-vs-quality data point for multi-agent orchestration.
SKIP IF…
  • You want deep academic analysis of multi-agent architectures -- this is a practitioner benchmark, not a research paper.
  • You are building large-scale software products with multiple teams; the reviewer explicitly notes his tests do not cover that scenario well.
TL;DR

The full version, fast.

Sakana Fugu Ultra is a single API that internally routes tasks to a pool of frontier models (Claude, GPT, Gemini) via a small manager model, then merges outputs -- essentially automated multi-agent orchestration. Across 38 AI-graded tasks covering puzzles, traps, specs, and heavy algorithms, Fugu tied Opus 4.8 on 36 and lost 2, while costing $53.60 vs $10.66 and taking 357 minutes vs 80 minutes. The orchestration concept is real and worth watching, but for individual knowledge workers the numbers do not justify switching today.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →
Chapters

Where the time goes.

00:0000:59

01 · Hook: viral claim + live demo

Opens on the Sakana announcement tweet, then immediately shows a live YouTube analytics dashboard built with one /goal prompt in Claude Code running Fugu Ultra.

00:5903:19

02 · How Fugu works

Explains the architecture: a small manager model routes to Claude, GPT, Gemini -- one API, automated delegation. Introduces benchmark claims vs. Fable and Mythos.

03:1905:45

03 · Orchestration 101

Breaks down orchestration into two questions: who does each part, and how do you combine outputs. Draws parallels to Claude Code sub-agents and manual multi-model workflows.

05:4506:43

04 · Fugu vs. Fusion API

Contrasts Fugu sequential delegation against OpenRouter parallel Fusion API. Both improve quality; both cost more.

06:4307:57

05 · Pricing and subscription reality

Shows Sakana billing dashboard: $200/month plan, 34% of weekly limit hit during testing. Notes it burns faster than Claude Code subscription.

07:5711:56

06 · The 38-task benchmark

Presents full test results: 4 waves, Codex grading, pass-fail scoring. 36 ties, 2 Opus wins, 0 Fugu wins. Cost: $53.60 vs $10.66. Time: 357 min vs 80 min.

11:5612:15

07 · Verdict and outlook

Will not use it for knowledge work. Sees the future in model-routing efficiency. Watching Sakana but sticking with Claude Code and Codex for now.

Atomic Insights

Lines worth screenshotting.

  • Fugu Ultra tied Opus 4.8 on 36 of 38 tasks and lost the other 2 -- it never won a single task.
  • Orchestration multiplies quality only when tasks genuinely benefit from specialist routing; most knowledge-work tasks do not.
  • 5x more expensive and 4.5x slower is a bad trade when quality is identical -- the benchmark math is simple.
  • A 6-second Opus answer took Fugu several minutes on the same easy prompt -- latency compounds badly on short tasks.
  • Fugu is not a better LLM; it is a manager model that delegates to existing LLMs including Opus itself.
  • Roughly 60% of the demo dashboard Fugu built was probably handled by Opus 4.8 internally.
  • The OpenRouter Fusion API uses parallel answering and judge merging; Fugu uses sequential delegation -- two different orchestration bets.
  • Fugu does not consume the Claude Code context window the same way, which could matter for very long autonomous runs.
  • Knowing which model to use for which task type is already a skill most heavy AI users practice manually every day.
  • The viral benchmark that sparked this showed benchmark wins -- but benchmarks are not the same as real-world latency and cost.
Takeaway

When to trust a viral AI benchmark.

WHAT TO LEARN

A benchmark win and a real-world win are different things -- speed, cost, and integration friction are the hidden variables that benchmarks ignore.

  • Benchmark performance tells you what a model can do under ideal conditions; it does not tell you how long you will wait or what you will pay across 38 real tasks.
  • Multi-model orchestration works by routing subtasks to specialist models and merging outputs -- the same pattern most heavy AI users already do manually when they switch between Claude and GPT.
  • The 5x cost and 4.5x latency overhead of automated orchestration matters most for short tasks; on a 6-second question, waiting several minutes is an unacceptable trade regardless of quality parity.
  • Knowing which model is cheapest for each task type without sacrificing quality is a practical skill that compounds over time as AI pricing evolves.
  • The viral claim was technically accurate (benchmark wins are real) but contextually incomplete -- quality ties at 5x cost is not a win for most individual users.
  • Orchestration APIs that avoid consuming context window may offer a structural advantage for very long autonomous agent runs, entirely separate from the quality question.
Glossary

Terms worth knowing.

Sakana Fugu
A multi-agent orchestration system from Sakana AI (Japan) that presents as a single model API but internally routes subtasks to a pool of frontier LLMs and merges their outputs.
Manager model
A small model trained solely to break a user request into subtasks and route each to an appropriate specialist LLM, then combine the results.
OpenRouter Fusion API
OpenRouter's feature that sends a prompt to three models simultaneously rather than sequentially, then uses a judge model to merge the three responses into one answer.
Unit economics
The cost-per-unit calculation for AI usage -- here, dollars per task and minutes per task -- used to evaluate whether a higher-quality or more complex system is worth its overhead.
Resources

Things they pointed at.

Quotables

Lines you could clip.

09:14
This is not a smarter model. It's just a manager.
One-line verdict that reframes the entire viral claim. No setup needed.TikTok hook↗ Tweet quote
07:57
Fugu was 4.5 times slower overall and five times more expensive. So if you're getting roughly the same results, why would you wanna wait longer and pay more?
The cost and speed punchline with a rhetorical close -- clips standalone.IG reel cold open↗ Tweet quote
10:38
Understanding how to play with the unit economics to understand what is the cheapest model that I can use for this task that doesn't sacrifice quality -- I imagine that becoming a very, very important skill.
Forward-looking insight, quotable for AI strategy content.newsletter pull-quote↗ Tweet quote
The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

analogystory
00:00Introducing Sakana Fugu. Our Fugu Ultra model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls. So, obviously, we had to put that to the test.
00:11You can see right here in my cloud code I am running on Fugu Ultra 1,000,000. You can see the task list that's carrying over is about our dashboard functionality and visuals. So what I did is I gave it this slash goal prompts, which you guys can read real quick if you want to.
00:24And this was just me, you know, talking into Glido. And then after almost an hour, I got this dashboard back, which is honestly really impressive. It can refresh live data.
00:32If I click on that button, you can see my stats. You can see the audience pulse. You can see my distribution of performance, median, outlier, stuff like that.
00:39I can see what's working. I can see what's rising. I can see what's underperforming.
00:42I can see recommendations. Not only am I getting real data on my YouTube dashboard, but I'm also getting AI analysis.
00:49I can look at videos and see specific metrics for each one. I can look at audience and comments. I can look at strategy.
00:54And all of this was a one shot slash goal prompt with Fugu Ultra. So this is brought to us by Sakana dot a I, which is a Japanese company.
01:03Sakana means fish, and that's kinda what you can see right here. We've got a bunch of little fish coming in to make one larger fish, and that's basically how the Fugu model works. It is a multi agent system delivered as one model.
01:15So, basically, we have one API to hit, and that API orchestrates and routes to different frontier models. So OPUS, GPT, Gemini, and probably some more.
01:24Fugu achieves superior performance by dynamically coordinating and orchestrating a diverse pool of powerful models. So honestly, it's really nothing new. It's basically a main agent orchestrating a bunch of sub agents that have different models and different specialties.
01:37This basically just wraps it up for you pretty nice. You know, I've always talked about when you're working with AI models, you always wanna think about it as each AI does one thing really well. One very specific thing, and that is how you achieve great results by chaining those outputs into the next.
01:51And that's how they're able to achieve some of these benchmarks here as you can see where we have Fugu outperforming Fable, outperforming GPT 5.5, outperforming Opus 4.8 because it's orchestrating these models together. So just to be clear, Fugu is not its own large language model that's better than Fable.
02:06It's saying that they were able to achieve better results than Fable and Mythos Preview on certain benchmarks by orchestrating models together. But this announcement tweet in one day went super viral, so, obviously, we had to check it out. So I ended up running Fugu Ultra versus Claude Opus 4.8 across 38 different tests, and I'm about to dive into what I actually found and the way that I feel about how I'd use this and stuff.
02:27So let's just dive into today's video. So first of all, yes, I was able to use Fugu Ultra inside of Cloud Code. If you guys wanna know exactly how, I'm going to attach a markdown file in my free school community that you can give to Claude code, and then all you have to do is give it that markdown file, give it your API key, and you'll be up and running.
02:41So my free school community is linked in the description. Join the free school, go to the classroom, go to all YouTube resources, and you will find all of my resources in there for free. This was that first main example that I showed you guys.
02:51Right? So Fugu built all of this, but it obviously used a combination of GPT, probably a little bit of Gemini for some design stuff. It used GPT 5.5 and maybe some other frontier models inside of that.
03:01And so, basically, what's happening is we're hitting one single API where we have a small manager model, which is literally trained just to break down a task and then hand it off to a bunch of different AI models. So we ask the question to the conductor, and the conductor outsources to specialists. So Claude maybe for writing, GPT maybe for coding and bug fixes, and Gemini for research and facts.
03:21And there are also other models that it will delegate to based on the complexity of the task. And it's all about orchestrating who decides who does the work.
03:29On the left side of the spectrum, you write everything. You decide who does what. And on the far right end, you have a model that's automatically doing all of this.
03:36And the interesting thing is this pattern is nothing new. You already are doing this probably every day without even realizing or you're realizing, but it's essentially the same way that Claude code spins up sub agents or dynamic workflows and outsources, breaks up a plan, and outsources things to a haiku worker or a sonnet worker or maybe a bunch of more opus workers.
03:54Except for instead of Claude code orchestrating across Claude models, we have Fugu Ultra orchestrating across different models.
04:01And that's where you can get really powerful and have this sort of mixture of experts here because we all know that GPT has some strengths over Opus and vice versa. And so you're able to just get the best of I was gonna say both worlds, but the best of all the worlds because Fugu is kind of trained to understand what models are good at what and where to delegate where.
04:18So that's all this is. It's just a really smart orchestration API. It wraps it up for you nicely.
04:23And so if a lot of you guys are already working with Codex and Cloud Code and other models on a same code base, that's basically this, except for you're just getting a little bit less manual. It's happening automatically, the delegation. And orchestration is really just two small questions.
04:35The first question is who does each part? So if we have job x and we have tasks a, b, and c, maybe model a does task a, maybe model b does task b, and so on. And then we also have the combination.
04:47So once all of the individual models give us some sort of response, we need obviously another LLM to combine everything together and then present that answer back to us. So it's very very similar to the way that you already work with your Cloud Code sub agents. Now that's basically it.
04:59This whole HTML I'm also going to include for free in my free school community if you wanna dig in a little bit deeper. Yes. I had Fugu build this entire HTML just so you guys know.
05:08But here's something else that's kind of interesting. I was thinking, okay. This is pretty similar to the open router fusion API, if you guys saw when that drops the other week, but it's actually different.
05:16Because the fusion API, what that does is it sends your prompts to three models all at once. So it doesn't break up the task and delegate. It just says, hey, all three of you guys answer this.
05:25And then it comes back with a judge that merges all three. And that alone has also shown to improve the results and the quality that you're getting. So moral of the story is when you're getting different perspectives and you're getting different large language models processing your stuff, you're probably gonna get better results, but what is that at the cost of?
05:41Typically speed and actual money, so actual cost. And this stuff is certainly not cheap.
05:46I went for a $200 a month plan, and I filled up my five hour window, and I'm at 34% of my weekly limit. Now I upgraded about halfway through from a $100 a month to $200 a month.
05:56So roughly on the a $100 a month plan, if I was to fill up three five hour windows, I probably would have hit my weekly already. So this did fill up very quick, much quicker than my Cloud Code subscription fills up. But if you guys wanna get started, you go to secana.ai, and then in there, you can go ahead and sign up for an account.
06:14You can get on a subscription, but you could also do pay as you go API billing as you see here. Now there is Fugu, and then there's Fugu Ultra. I only have tested with Fugu Ultra just for transparency for today's video.
06:27So let's take a look at the results. So what I did in here is I had Codex create a bunch of tests just so there was no, like, bias or contamination or whatever you wanna call it.
06:35And then I did this all through API billing, and I had OPUS 4.8 go through all the the tests, and I had Fugu go through all the tests. Now keep in mind, OPUS 4.8 is one of the models that Fugu chooses from.
06:49So it's really interesting. Like, when you see something like, you know, this, which I built, maybe 60% of this was built with Opus 4.8. And, you know, these HTML decks, these were also probably built maybe majority by Opus four point eight.
07:01So just something to keep in mind. But overall, we had 36 of the 38 tasks end in a tie. We had Fugu being 4.5 times slower overall and five times more expensive.
07:11So if you're getting roughly the same results, why would you wanna wait longer and pay more? You probably don't. And I'll get into the actual results in a sec, but I was hopping in here with my actual Cloud Code, and I was running Fugu through a ton of my regular stuff, my skills, doing research, and it was able to use them fine.
07:27So it worked in the harness of Cloud Code just fine. It just felt really slow. The other thing is it wasn't filling up the context window.
07:33So I could be talking for, you know, twenty, thirty rounds, and this would stay at zero. Because of the way that you're kind of playing with how you actually route this to Fugu's server to get the responses back, I'm not gonna dive into it right now. That's part of what that markdown file includes.
07:48So right here, you can see I dove into, you know, how this actually works, and there's a bunch of stuff I created. But it's not as simple as the way you connect to, like, GLM 5.2, for example, where you just change the the endpoint and add your API key.
07:59It's just a little bit different. So didn't wanna dive into that now, but markdown file in my free school. But, anyways, the point I was trying to make there is this felt pretty solid for my knowledge work.
08:08I mean, it's literally 4.8 in GPT 5.5. How can that not be solid for knowledge work? It just felt really, really slow.
08:14So the point I'm trying to make here is I don't think I'm gonna use this. I'll probably keep testing it a little bit, but for me, for my knowledge work, I don't need this at all. I'm gonna get more out of my Codec subscription and my Cloud Code subscription, but I also don't do heavy software development.
08:27I'm not building products. I'm not working with tons of teams, you know, all working on the same code base. And if you were, maybe there is a lot of benefit to using Fugu because you've got that, you know, GPT reviewer and the ClaudeCode, you know, planner or ideator all in one API, and I could see a lot of value in that.
08:45So take this analysis with a grain of salt. I didn't push this through tons and tons and tons of code refactors and stuff like that. I ran a bunch of AI created assessments.
08:54So once again, this is not a smarter model. It's just a manager. You can see here by the graphic how that all works.
08:59We did 38 tasks across four waves. We had puzzles. We had traps.
09:03We had specs. We had heavy algorithms, and then we had codecs grading all of this stuff. And both models were basically fed the same prompts and the same inputs, and we were just grading the outputs.
09:12So this is really interesting. Right? Overall, they were basically always tying.
09:16A lot of these were designed to be pass fail rather than, a score just because I wanted to make this hopefully as objective as possible, but they were basically tying every time except for two of the times Opus one, which once again, it's interesting if you think about the fact that Opus 4.8 is available within the Fugu Ultra.
09:30But what was way more interesting to me is how long I had to wait. So I had to wait in total for all of the Fugu runs, three hundred and fifty seven minutes, whereas for Opus, I was waiting total of eighty minutes across these 38 different tasks. And what was really interesting is some of these some of these very easy ones, Opus answered in, six seconds, whereas Fugu would take multiple minutes for that same thing that took Opus six seconds.
09:52And then on the cost side, Fugu was very expensive. It was running, you know, Opus and GPT and probably some other models, and it was just running so long that it was just costing way more. Across all of these calls, Opus only cost us about $10, whereas Fugu costed me $50, so about five times more expensive.
10:09I was definitely expecting this to be more expensive. Like, I I imagine the Fusion API from OpenRouter is also more expensive. I just wasn't expecting five times more.
10:18Or if I was getting that higher cost, I was expecting better quality, which I did not feel based on my use cases and based on my experiments. So I wanted to come in here and make this video because when you see something like this, 15,000,000 views, and you see, hey. This matches the performance of Fable and Mythos, Everyone's freaking out, everyone's gonna wanna try it.
10:36And so, obviously, I wanted to try it. And so my honest takeaway is, for my use cases, Fable noticeably felt better than OPUS 4.8, but Fugu Ultra does not noticeably feel better than OPIS 4.8.
10:48I really do think that this is amazing, these metrics that they were able to get. And I do think that this is the future, not locking yourself into one provider, understanding how to play with the unit economics to understand what is the cheapest model that I can use for this task that doesn't sacrifice quality.
11:03I imagine that type of stuff becoming a very, very important skill as we continue into the future of AI. I just don't think that this is the answer for me yet. I think that they're onto something awesome, and I think that, you know, OpenRouter clearly pushed out that API for a reason because that's something that I try to do manually for myself.
11:18I try to use Codex and Cloud Code together all the time based on what I feel like their strengths and weaknesses are. And think about the fact that when Fable does come back, if and when it comes back, and all of these models starting to probably maybe get more expensive, really being able to be a master at optimizing the efficiency is going to be a super important skill and a super important thing to be thinking about all the time.
11:40So that is my takeaway after I have battle tested Fugu. That is what I think, but I'm definitely gonna be keeping my eye on Sakana AI Labs and this whole idea of orchestration. So I know that this one was super quick, but hopefully you guys enjoyed or you learned something new.
11:54And if you did, please leave a like. It helps me out a ton. Don't forget to hop into our free school community.
11:58We are pushing way past 400,000 members. This has just been an awesome ride, and you can grab every single free resource ever in here. GitHub repos, skills, markdown files, resource guides, whatever it is.
12:08Grab it in here for free, And that's gonna do it for today. So I appreciate you guys making it to the end of the video, and I'll see you on the next one. Thanks, guys.
The Hook

The bait, then the rug-pull.

Sakana AI dropped an announcement that went 15 million views in a day: a multi-agent orchestration API claiming to beat Fable and Mythos on key benchmarks. So he ran 38 tests to find out if that is actually true in the real world.

Frameworks

Named ideas worth stealing.

04:25model

Orchestration = Two Questions

  1. Who does each part? (routing)
  2. How do you combine outputs? (merging)

Reduces any multi-model orchestration system to two design decisions: the routing policy and the combination/judge layer.

Steal forexplaining any multi-agent system design
CTA Breakdown

How they asked for the click.

VERBAL ASK
11:56next-video
Don't forget to hop into our free school community. We are pushing way past 400,000 members.

Soft community plug with a specific member count milestone. Free markdown setup file offered as lead magnet throughout the video.

FROM THE DESCRIPTION
Storyboard

Visual structure at a glance.

viral tweet hook
hookviral tweet hook00:00
live dashboard demo
prooflive dashboard demo00:22
Fugu architecture
valueFugu architecture01:39
orchestration model
valueorchestration model04:25
billing reality
valuebilling reality06:47
benchmark results
valuebenchmark results07:57
the manager never won
valuethe manager never won09:14
verdict + CTA
ctaverdict + CTA11:56
Frame Gallery

Visual moments.

Watch next

More from this channel + related breakdowns.

Chat about this