Big Idea

The argument in one line.

Claude Desktop gateway mode lets you substitute any OpenAI-compatible local model for the paid Anthropic API, making Claude Code multi-agent workflows accessible at zero inference cost.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You want to try Claude Code Dynamic Workflows without committing to an Anthropic API subscription.
You already run LM Studio or Ollama locally and want to pipe those models into Claude agentic interface.
You are exploring ways to reduce AI tooling costs while keeping access to multi-agent orchestration.
You are technically comfortable but new to the Claude Desktop gateway to LM Studio connection.

SKIP IF…

You need production-grade speed or reliability -- local Gemma is noticeably slower than the hosted API, especially at ingesting large prompts.
You want a deep-dive on prompt engineering or agent architecture rather than a click-by-click setup tutorial.

TL;DR

The full version, fast.

Claude Desktop ships a gateway mode that routes inference to any OpenAI-compatible endpoint instead of the Anthropic API. By pointing it at LM Studio local server and loading a Gemma model aliased as claude-opus-4.8, you satisfy Claude model-discovery without an Anthropic account. The Dynamic Workflows feature lets a single /deep-research command spawn up to 16 concurrent agents with a 1,000-agent-per-run ceiling, all on your local machine. The main tradeoff is prefill latency: Claude Code first message includes roughly 30,000 tokens of system context, and local models are slower at ingesting that than at generating responses.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 00:55

01 · Hook + gateway overview

Introduces Claude Dynamic Workflows, names the two backend options (LM Studio local, OpenRouter cloud), and frames the cost angle.

00:55 – 02:00

02 · Claude docs: Cowork on 3P

Walks through official setup guide. Step 1: download Claude Desktop. Step 2: do not sign in with an Anthropic account.

02:00 – 02:28

03 · Configure third-party inference

In Claude Desktop settings: set connection type to gateway, credential kind to static API key.

02:28 – 04:22

04 · LM Studio overview

Download, model search interface, GPU compatibility green-tick indicator, three setup steps.

04:22 – 05:14

05 · Developer mode + gateway URL

Enable developer mode in LM Studio, open Developer tab, copy local server URL, paste into Claude gateway base URL field.

05:14 – 06:46

06 · The model-alias trick

Claude rejects models not named Sonnet/Opus/Haiku. Fix: rename the model API identifier to claude-opus-4.8 in LM Studio load settings.

06:46 – 07:30

07 · Disable built-in tools + BraveSearch MCP

Local models lack built-in web search. Toggle disables native tools so Claude looks for MCP connections.

07:30 – 08:21

08 · Sign-in flow

If already signed in to Claude, sign out first to see the gateway login screen. Continue without account.

08:21 – 09:53

09 · Adding MCP via config file

Copy NPX install snippet from BraveSearch docs, use Claude to merge it into existing claude_desktop_config.json.

09:53 – 10:52

10 · Dynamic Workflows explained

16 concurrent agents, 1,000 total per run. /deep-research is a bundled slash command. Live business-plan demo.

10:52 – 12:15

11 · Live demo + prefill/decode explainer

Agents running in real time on Gemma 26B. Explains why first message is slow (30K tokens). Suggests leaving the computer for 1-2 hours.

Atomic Insights

Lines worth screenshotting.

Claude Desktop has an official gateway mode that swaps Anthropic API for any OpenAI-compatible endpoint including a localhost LM Studio server.
Claude model-discovery rejects model IDs that do not contain Sonnet, Opus, or Haiku -- renaming your local model to match is the entire fix.
Dynamic Workflows cap at 16 concurrent agents and 1,000 total per run, a ceiling designed to prevent runaway loops.
The /deep-research slash command in Claude Code is a pre-built Dynamic Workflow -- no custom JavaScript scripting required.
Local model latency is front-loaded -- the first response is slowest because Claude Code system prompt is around 30,000 tokens on the first message.
Disabling built-in tools in gateway settings forces Claude to use MCP connections for web search instead of its native tools.
BraveSearch MCP still needs its own free API key -- the MCP handles the interface but not the credential.
Ollama and LM Studio are interchangeable in this setup using the same gateway URL pattern and model-alias trick.
OpenRouter plugs into the same gateway slot and gives access to hundreds of cloud models at roughly 98 percent less than direct Anthropic API pricing.
A paid Claude.ai subscription coexists with gateway mode -- switching to local inference does not conflict with the paid plan.

Takeaway

How to run Claude agents without an Anthropic bill.

WHAT TO LEARN

Claude Desktop gateway mode is an official feature that lets you substitute any local model for the Anthropic API, and one naming convention is the only non-obvious requirement.

Claude Desktop model-discovery filter only accepts model IDs containing Sonnet, Opus, or Haiku -- renaming your local model to match this pattern before loading it is the single step most tutorials skip.
Dynamic Workflows are capped at 16 concurrent agents and 1,000 total per run, which is more than enough for most research, analysis, and code-generation tasks running locally.
The /deep-research slash command is already bundled into Claude Code -- there is no scripting required to access multi-agent behavior, just type the command.
Local model latency is front-loaded: the first response in a Claude Code session is the slowest because the system context runs around 30,000 tokens, and prefill speed is where local hardware trails hosted inference most noticeably.
BraveSearch and other web-search providers require their own API keys even when connected through an MCP -- the MCP provides the interface but not the credential.
OpenRouter is a drop-in alternative to LM Studio in this same setup and gives access to hundreds of cloud models including free tiers and paid options at a significant discount over direct API pricing.

Glossary

Terms worth knowing.

Cowork on 3P: Anthropic official term for running Claude Desktop against a third-party inference provider rather than Anthropic own API. Documented at claude.com/docs/cowork/3p/overview.
Dynamic Workflows: A Claude Code feature where Claude writes and executes a JavaScript orchestration script that can spin up hundreds of sub-agents to work on a single task in parallel.
Gateway mode: A Claude Desktop configuration that replaces the default Anthropic API endpoint with any OpenAI-compatible URL, enabling local or third-party model inference.
Prefill (TTFT): The first phase of LLM inference where the model processes the incoming prompt. Time To First Token is determined by prefill speed, and large prompts like Claude Code system context make this the main bottleneck on local hardware.
Decode (TPS): The second phase of LLM inference where the model generates each output token. Local hardware often performs well at decoding relative to prefill.
MCP (Model Context Protocol): An open protocol for connecting external tools and data sources to AI models. In this context, an MCP server provides web search to a local model that has no built-in internet access.
Model alias / API identifier: A custom name assigned to a local model in LM Studio that overrides its default identifier. Used here to rename Gemma as claude-opus-4.8 so Claude Desktop model-discovery accepts it.

Resources

Things they pointed at.

00:55linkClaude Docs: Cowork on 3P ↗

02:28toolLM Studio ↗

03:38toolOpenRouter ↗

06:55toolBraveSearch MCP ↗

08:45toolFireCrawl MCP ↗

Quotables

Lines you could clip.

00:09

“Instead of us using the paid API from Anthropic or even needing an Anthropic account, I'm gonna show you how to do this by using local AI models that are running completely on your computer.”

lead-off cost hook, complete standalone thought→ TikTok hook↗ Tweet quote

05:14

“The gateway returned no usable models -- Claude is only looking for things that have Sona or Opus or Haiku.”

names the blocker that stops most people, high search value→ IG reel cold open↗ Tweet quote

10:03

“A dynamic workflow is a JavaScript that lets you basically deploy hundreds of sub agents.”

clean one-liner definition, no setup needed→ newsletter pull-quote↗ Tweet quote

11:48

“From the very first message, we're sending like 30,000 tokens. It's a lot. But then from here, you can literally just leave your computer.”

honest about the limitation, ends on the payoff→ TikTok hook↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

analogy

00:00Hello, legends. In this video, I'm gonna show you how to use the new Claude dynamic workflows feature, which lets you generate up to a thousand agents to work on really complicated tasks. And instead of us using the paid API from Anthropic or even needing an Anthropic account, I'm gonna show you how to do this by using local AI models that are running completely on your computer.

00:19And this is possible because we're using the gateway version for Claude. The gateway version is still an official Claude product. It's literally the Claude desktop app, which we get access to the Claude code and the Claude co work.

00:31But by using the gateway, we're able to plug into any LLM. So we can either use something like LM Studio, which we're gonna be doing in this video, to download and use local models directly with Claude Cowork and Claude Code, or we can connect up to something like OpenRouter, which has got access to hundreds of cloud based models.

00:50Some are free. Some are paid. But even the paid ones, you will save, like, nine just over 98% to get really, really good models that you can use.

00:57To get this working, what we need to do is just read the documentation on a thing called Cowork on three p. Once again, that's just the version of Claw desktop app that lets you plug into a gateway. So we're just gonna go across to this documentation.

01:09So over here, we can see run Cowork against your own cloud inference provider or, in our case, our own local inference provider. And I'm just gonna go into the next steps to figure out how to install and set this up. So our first step is to download the Claw desktop app.

01:22If you don't already have this, just click this button, and then download the desktop app for yourself. Works on a Mac and Windows, so just download and install. Once you're done with that, the step two is, uh, explicitly stated, do not sign in or do not create an Anthropic account because once again, you don't need to have an account or to be using the Cloud API to make this work.

01:41And once your app is open on your screen, you just go into the top left hand corner if you're on Mac OS and click on help, drop down to troubleshooting, and then, uh, enable developer mode. Once you enable developer mode, in that same top menu bar, you see a new menu button called developer.

01:56Once you drop that down, you'll see configure third party inference. When you open the configure third party inference settings, you have an option to choose the, uh, connection type. We're just gonna leave it as gateway, and, uh, we have credential kind.

02:09We'll drop down. We'll select static API key. Now in this video, I'm just gonna show you how to do it with LM Studio, or this would also work if you have OLAMA.

02:18And it also works if you're using OpenRouter. So if you wanna follow-up video for OpenRouter, just let me know below. For the gateway base URL, we're gonna get that directly from LM Studio, and then we'll come back to the API key and figure out our credential type.

02:31So if you haven't heard of LM Studio, I'm gonna drop a video somewhere on screen right now that'll give you a full run through, especially if you're brand new to this tool. But, essentially, it's a free desktop app that you can download onto your computer, and you can browse free local AI models and then download them onto your device.

02:47And then you can use them either directly in the app like a chat mode, or you can plug them into different tools like Hermes or OpenClaw or, in our case, into Claude. Now you can download LM Studio for Mac and Windows, so it's gonna run for both operating systems. So once you open up LM Studio, you're gonna see a window like this.

03:04Now there's three things we need to do here. First The is we need to download a model so that we can then plug it into Claude. The second is we need to get ourselves this gateway based URL, so that's gonna be in a settings in LM Studio.

03:16Then the final thing is when you download a model, it's actually just living in your, like, storage. It's technically asleep on your computer. In order for it to be useful, we need to wake it up and just kind of keep it turned on.

03:26So I'm gonna show you how to do that as well. So now the first thing we wanna do is just download one of these local models. So just gonna go to model search.

03:32And now this tab, everything on the left hand side, these are all free local models that you can download. Just be mindful as you're browsing the model, if you get a red warning that says likely too large, it just means it's too big to run on your computer and you wanna just find a different model. We're gonna get a green tick like this.

03:48So the Gemma e four b and the e two b are fantastic models. They're very small, and they're really good for, like, agentic tasks. Pretty much what we wanna do in co work.

03:56And as you can see, got a green tick saying full GPU offload possible. And in that case, I would just download this model. The next thing you wanna do is click on to settings.

04:06Open up the settings panel, go across to developer, and then you wanna turn this setting on. So by default, if it's your first time using LM Studio, it's gonna be turned to off.

04:14Developer mode will be off. You just wanna flick it across to on. Then you can close this panel, and you should see a new menu bar over here called developer.

04:22Now when you open up developer, this is the access where we can manage our our model. We're able to get our model loaded up into memory so it's awake. And then by using this URL, this is the gateway URL, we can actually plug this directly into Claude.

04:37So as you can see here, I've got a bunch of loaded up models. They're not all of the models that I have on my computer. These are just the ones that are awake and ready to do some work.

04:45Now while we're here, I'm just gonna delete this model here. I'm I mean, I'm not gonna delete it. I'm just gonna put it back to sleep.

04:51It says Claude Opus 4.6. We'll come back to this. I actually don't have a Claude model on my computer, but that's important for us to know in just a second.

04:57I'm gonna copy this URL, and let's paste it into this gateway base URL. And now we need an API key. Since I'm doing this locally using my local LM Studio, I wanna put a default value of LM dash Studio, leave it as bearer.

05:12And for now, let's just test the connection. So scrolling down, the gateway returned no usable models, which is a little bit strange because actually in l m studio, I've got two models that are actually loaded up and they're ready to go.

05:24But the one caveat is that the desktop app is actually searching to see what your model alias is or, like, what the actual model name is. In my case, I've got a Gemma and a Minibax, and Claude is only looking for things that have Sona or Opus or Haiku. So in this case, none of the models that we have will have this, um, will have this convention.

05:45So what you can do to bypass that issue is when you're loading up your model, which means you're taking it from sleeping to awake. I'm just gonna go through some of these models here. I've got my Gemma four twenty six b.

05:56When I click this, now I'm in a settings panel to basically wake this up and configure the settings. I can get this API identifier. I'm just gonna backspace this.

06:04I'm gonna type in Claude Opus 4.8. And for me, I just wanna get my context window to be as big as possible.

06:13Once again, watch that instructional video. All this kind of stuff will make sense. Most important part is that you wanna have Claude Opus 4.6, and now we can load our model.

06:22So as we see, we're gonna be loading our model, and it's got this convention here, 4.8, but we just wanna confirm it's the Gemma model, but it's gonna be, uh, identified as Opus 4.8.

06:34So now if we come back to our settings, and let's just test model discovery, there we go. One model found.

06:40So we just kind of refresh everything. We found the model. Everything's fine.

06:44We found the Opus 4.8. So now before I save these settings and apply anything here, the one final thing that I wanna do is when I'm using the paid API service from Claude, part of the tools part of the built in tools that we get are things like web fetch and web search. So when you're using Claude in a desktop app or on a a web or whatever and you ask it a question to, like, search the Internet for some something or whatever, it's already built in.

07:08That web search is built in. But since we're using our local models, we don't have built in web search. We'll have to introduce an MCP, basically, like a connection that we can search the web by ourselves.

07:17So this disable built in tools just means that the model is never gonna call this. It's gonna look for MCP connections, uh, once again because our local model doesn't have this.

07:27So I'm gonna go to apply changes, save, and restart. And now if this is your first time using Claude and you didn't have the desktop app open or signed in, you will see this window.

07:35But if you are already using Claude and you were already signed in before starting this process, you're not gonna see this window. All you need to do is just sign out. Just open up your Claude, the desktop app, and just sign out of it, and then you'll be able to see the screen.

07:48Now we still have two ways to sign in. So the first way is using claw.ai, so our paid subscription, which we don't lose that privilege even if we do this third party LLM provider, or we can do what we wanna do here, which is use our local model.

08:01So I'm gonna click on continue, and here we go. Let me just drop down. I can see my Opus four.

08:06I'm in Cowork right now, but I wanna get across to Claude Code. Now before we actually fire off our agents, I wanna make sure that we have Internet search plugged in. I've already configured BraveSearch MCP, and I'm gonna show you how to do the same thing.

08:17So to do this, we're just gonna go into this gateway settings button, click on settings, just go across to developer, and we're gonna click on edit configuration.

08:26And then you wanna open the configuration file. And once you open your file, you see a bunch of different settings inside that file. They all relate to your Claude desktop app configuration.

08:36You will not see this. I have an MCP server plugged in, which is the BraveSearch. Now you can actually use whatever provider that you wanna use.

08:43Most providers online will have an MCP connection. All you need to do is just Google, you know, BraveSearch MCP or FireCrawl MCP, whatever you wanna use. Scroll down until you find the NPX install.

08:56This is what we need to get the MCP plugged in. You can now copy this, then copy everything that is in this configuration file, and just go across to a new Claude session and paste in the MCP settings, paste in the configuration file that you had, and then ask Claude to combine those two together.

09:12Once you get it combined, you can take the output and just paste it into configuration file. And then as you go through and you wanna find different connectors, like you wanna use a ClickUp MCP or, I don't know, a Gmail MCP, whatever is available, you can then keep coming back into this Claude session, plug in, uh, give the new MCP, and then ask Claude to add it for you.

09:30Now just be mindful for the brave search. You are gonna have to have an API key. So in this case, just sign up, create a new account, and then generate a new API key.

09:39And then once you're done, you'll be able to see your brave search as an option on your on your connectors. Just make sure that it's turned on. And now the final thing we wanna do is figure out how to create those hundreds of agents to do work for us, and you can do that by using a new feature called dynamic workflows.

09:53So this was released a few days ago with Opus 4.8. A dynamic workflow is a JavaScript that lets you basically deploy hundreds of sub agents. Now the specifics around this are you you can have up to 16 concurrent agents.

10:09So 16 agents working at one time and a total of 1,000 agents per run. So let's say you have a big project. You have an office.

10:16You can have 16 employees working in that office at any one time. Let's say this whole project takes you five hours. Across that five hours, you would have had a thousand people come through doing work at different times.

10:29So, yeah, at any one time, it's 16, but a total per task is 1,000. And then inside Claude code, we have a slash command, which is deep research, and this is already a bundled workflow. So as long as we use this slash command, Claude's already gonna know to basically generate hundreds of agents for the task.

10:46So back in Claude, I'm just gonna use the slash command, find deep research, and then paste in the command that I used before. I'm basically saying, hey. I wanna start a local AI agency in Australia, find my 10 competitors, find 10, you know, types of customers that are looking for these services, and then build me a business plan around this.

11:03Now as you can see here, this is literally real time processing. I'm using my m three ultra, uh, with five twelve gigs of RAM, and I'm using the Gemma 26 b.

11:12It's a small model. It doesn't have a lot of strain from my MacBook, uh, from my m three Studio. But at a very high level, when using local AI models, there's two main components to be able to get a response.

11:22The first is prefill. So, like, how fast can your model intake the prompt that you're sending it? Um, and then you have decoding, which is how fast your model can generate a response.

11:32The Mac Studio is pretty fast at generating responses, but it's a little bit slow at ingesting and kind of, like, processing the prompt. Plus since we're using Claude, this is like yeah.

11:42There there's a lot of tokens that are already prebuilt. Basically, from the very first message, we're sending, like, 30,000. It's a lot.

11:49But then from here, can literally just leave your computer. You can come back in one or two hours, and then you would have had, you know, a couple of 100 agents do a bunch of work for you. Alright, guys.

11:57Thank you very much for watching this video. If you enjoyed it, I'd appreciate if you could, uh, like the video, drop a comment, or subscribe to my channel. And if you'd like to see a follow-up of me plugging into OpenRouter so that you can run free cloud models or really, really cheap, uh, paid cloud models, uh, let me know in the comments below.

12:13Alright. See you the next one.

The Hook

The bait, then the rug-pull.

A thousand agents, zero API bill. The Claude Desktop app ships a gateway mode that routes all inference to a local LM Studio server -- and the only non-obvious step is a model-renaming trick that takes thirty seconds.

Frameworks

Named ideas worth stealing.

11:08concept

Prefill vs. Decode

Prefill (TTFT -- prompt ingestion speed)
Decode (TPS -- token generation speed)

Two phases of local LLM inference. Prefill determines time to first token; decode determines generation speed. Local hardware bottlenecks at prefill for large contexts.

Steal forexplaining why local model latency feels front-loaded on long prompts

00:55concept