Big Idea

The argument in one line.

GPT Realtime 2 bridges conversational voice AI with real computer control through tool calls, and four plain-English prompts to a coding agent are all it takes to build a working voice-operated Mac.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You already use an AI coding agent and want to extend it into a voice-controlled layer over your whole computer.
You spend significant time switching between apps, searching the web, or saving notes and want to offload those tasks to voice commands.
You work in creative apps like Premiere Pro and are curious whether voice commands can replace repetitive timeline interactions.
You want to understand how GPT Realtime 2 differs from ChatGPT voice mode and what the practical unlock actually is.

SKIP IF…

You need a production-ready always-on voice assistant -- latency and cost make this better suited for deliberate commands than ambient control.
You are on Windows -- the demo relies on macOS-specific APIs and the accessibility tree approach does not translate directly.

TL;DR

The full version, fast.

GPT Realtime 2 differs from ChatGPT voice mode in one way: it fires tool calls mid-conversation, meaning it can take real actions on your computer while still talking back. The build takes four prompts to a coding agent: open a WebSocket to the model, add push-to-talk to stop always-on cloud streaming, add web search via three Chrome tools, then connect apps through MCP servers or the macOS accessibility tree plus agent-desktop. Honest caveats: latency, patchy accessibility support in some apps, and per-command API costs that compound fast without push-to-talk.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 01:45

01 · Intro / Demo

Cinematic CRT-to-studio open, then live demo: voice commands control Premiere, Spotify, Claude Desktop, and a script rewrite in real time.

01:46 – 04:45

02 · What is GPT Realtime 2?

Explains the tool-call distinction from standard ChatGPT voice mode. Shows OpenAI playground JSON function schema. Visualizes terminal logs of tool calls firing.

04:46 – 07:52

03 · Setting it up (no coding)

Pastes GPT Realtime 2 API docs into Claude/Cursor. One prompt builds the WebSocket app. First test opens Safari by voice. Always-on streaming problem surfaces.

07:53 – 09:40

04 · Push-to-talk fix

One prompt adds a global hotkey, floating waveform HUD, stops streaming until the key is held. Resolves always-listening problem and cuts costs to pennies per command.

09:41 – 11:37

05 · Demo: searching the web

Adds three Chrome tools (open tab, type query, click link via vision). Searches for local restaurants and clicks first result. Grants accessibility permissions once.

11:38 – 14:12

06 · Demo: Obsidian via MCP

Installs Obsidian Local REST API MCP plugin, hands URL to agent in one prompt. Creates note, fetches GPT Realtime 2 docs from web, pastes full page content into Obsidian by voice.

14:13 – 17:59

07 · Demo: Premiere Pro via accessibility tree

Introduces accessibility tree concept and agent-desktop (open-source). One prompt wires it up. Live demo: pause/play, nudge playhead by frames, cut, mark in/out, ripple delete by voice.

18:00 – 19:16

08 · Honest caveats + outro

Three caveats: not every app exposes full controls, real API costs per command, noticeable latency. Frames this as a near-future preview, not a daily replacement.

Atomic Insights

Lines worth screenshotting.

GPT Realtime 2 differs from standard ChatGPT voice mode in exactly one way: it can fire tool calls mid-conversation, which lets it take real actions on your computer.
A tool call is just a small JSON function you hand the model -- name, parameters, description -- and the model decides when to trigger it.
Always-on voice streaming costs money every second you are silent; push-to-talk cuts that to pennies per deliberate command.
Every MCP-enabled SaaS gives the voice agent full programmatic access the moment you hand it the server URL -- one prompt, done.
The macOS accessibility tree is a structured map every app must expose for screen readers -- an AI agent can read that map to find and click any button without those apps having an API.
agent-desktop translates the accessibility tree from information into action: it gives the model hands to press buttons and navigate menus in native apps.
The entire build -- WebSocket, push-to-talk, web search, MCP integration, native app control -- is completed with four plain-English prompts to a coding agent, zero manual code.
Web is the easiest tier: open tab, type query, click link via vision. MCP apps are mid-tier. Native GUI apps like Premiere are hardest and need agent-desktop.
Latency is the honest ceiling: voice-to-action takes a visible moment, so this works better for deliberate commands than replacing fast repetitive clicking.
Leaving the WebSocket streaming without push-to-talk picks up ambient conversation and bills you for every word -- the creator hit this live during the demo itself.

Takeaway

Voice commands can reach your whole computer now.

WHAT TO LEARN

The gap between speaking a command and a computer acting on it has collapsed to a few prompts and a WebSocket -- and the three-tier access model tells you exactly which approach works for any app.

GPT Realtime 2's tool call mechanism is what separates it from standard voice chat -- it fires real actions mid-conversation, not just text responses.
Every app falls into one of three access tiers: browser URLs are easiest, MCP-enabled SaaS apps connect with one URL and a key, and native GUI apps require reading the macOS accessibility tree.
Always-on voice streaming costs money every second -- push-to-talk keeps costs to pennies per deliberate command instead of dollars per session.
The macOS accessibility tree is a structured map every app publishes for screen readers; an AI agent can read that map to locate and click any button without needing a dedicated API.
agent-desktop turns the accessibility tree from read-only information into real action -- it gives the model hands to navigate menus and press buttons in apps that have no AI integration.
The entire build -- WebSocket, push-to-talk, web search, MCP app integration, native app control -- is completed with four plain-English prompts to a coding agent, no manual code written.
Latency is the honest ceiling: voice-to-action takes a visible moment, so this is better for deliberate commands than replacing fast repetitive clicking workflows.
Connecting an MCP server scales to every SaaS that publishes an MCP plugin -- the one-prompt process shown with Obsidian applies identically to any compatible tool.

Glossary

Terms worth knowing.

GPT Realtime 2: An OpenAI voice model that can hold a live conversation AND fire tool calls mid-sentence, enabling it to take real computer actions while still responding verbally.
Tool call: A small JSON-defined function the model can trigger on its own during a conversation. Tells the model what the action is named, what parameters it accepts, and what it does.
WebSocket: A persistent bidirectional connection between your app and a server. GPT Realtime 2 uses it to stream audio and receive tool call responses in real time.
Push-to-talk: A hotkey pattern where you hold a key to stream your microphone to the model and release to trigger action. Prevents always-on streaming and keeps API costs to pennies per command.
MCP server: A standardized local API that exposes an application's read/write/search capabilities to AI agents. Apps like Obsidian publish MCP plugins so agents can interact with them programmatically.
Accessibility tree: A structured map of every UI element in an application that macOS generates for screen readers. An AI agent can read this map to locate and interact with any button, field, or menu.
agent-desktop: An open-source repository by Lahfir on GitHub that reads the macOS accessibility tree and gives an AI agent the ability to click, navigate, and control native desktop applications.

Resources

Things they pointed at.

03:01toolOpenAI Realtime Playground ↗

05:54toolOpenAI API keys ↗

12:29toolObsidian Local REST API MCP plugin ↗

15:53toolagent-desktop (Lahfir) ↗

00:00toolvoice-os GitHub repo ↗

00:00linkAI for Mortals newsletter ↗

Quotables

Lines you could clip.

02:18

“Voice mode will chat with you, but it can't reach out and touch anything else on your computer. GPT real time two... can also call tools as in fire off real actions while it's still in conversation with you.”

Clearest one-sentence explanation of the GPT Realtime 2 unlock→ IG reel cold open↗ Tweet quote

14:00

“This isn't just for Obsidian. You can apply this same setup for any application, any SaaS with an MCP server or an API.”

Scope expansion -- makes the lesson universally applicable→ TikTok hook↗ Tweet quote

18:13

“It's cheap. It's a few pennies per command, but it's not free. And if you leave it streaming to the cloud all day... those costs are just gonna balloon really quick.”

Honest cost framing counterbalances hype and builds trust→ newsletter pull-quote↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

metaphoranalogystory

In the beginning, it was just a command line, then a mouse, then a whole desktop, then we put it in our pocket.

We even tried talking to it. Sorry.

I didn't quite get that. And now?

Okay. Cut. Cut.

Cut. This intro is not working. Hey, chat.

Pause Premiere.

Paused.

Now, uh, go back a couple frames. Went back two frames. Yeah.

I'm just not loving this intro. I need to think through this more. Let's throw out some music to lock in.

Open Spotify.

Spotify is open. Nice. Now, put on some Herbie Hancock.

Putting on Herbie Hancock now. There we go. Much better.

Now open Claude desktop.

Claude is open. Nice. Now open my YouTube script project.

Your YouTube script project is open. Perfect.

And ask Claude to rewrite the intro.

Okay. I'll pass that rewrite request along to Claude now.

Claude responded. This is GPT real time two in action. Alright.

Nice. I've got it from here. This is GPT real time two in action and in this video, I'm gonna show you exactly how to build this yourself.

Everything from opening your apps to fully commanding them just by talking. You will get a glimpse into the future of a new kind of operating system, one you can run entirely with your own voice. And the best part, no coding or technical knowledge is required.

All it takes is a few prompts to clawed code. So what would it actually take to run your entire computer with just your voice? Well, turns out it's way less than you think.

Hey, chat. Open the real time two GPT docs from Google. Got it.

Pulling up those docs in your browser now. The search results are open in your browser now. Thanks.

Now can you click the GPT real time two model, that first link below AI overview? Sure. Looking at first result now.

It's crazy. I'm still not used to this. So we're gonna feed these GPT real time two docs to Claude Code in a second.

But first, just to explain what GPT real time two actually is, you have almost certainly talked to ChatGPT's voice mode before or advanced voice mode, but this is not that. Voice mode will chat with you, but it can't reach out and touch anything else on your computer.

GPT real time two is the same kind of model, so you can talk with it, it can reason, it can talk back, but this one can also call tools as in fire off real actions while it's still in conversation with you. And that really is the whole unlock of GPT Real two. With these tool calls, they're just little actions you hand the model.

So a a small function, a little bit of code that GPT real time two can trigger on its own. So let me show you two quick visualizations of this. If we go into GPT real time two in OpenAI's playground so go to what is this URL?

Platform.openai.com/audio/realtime/edit. I'll go back to that main page here, and I've already created this test. But what you can do is just click create, and you can start to experiment with this.

So if I click in a test, I have these basic system instructions. It doesn't really matter with this test, but I'm saying you're a fast, friendly voice assistant that controls the user's computer. Yada yada.

Now the important pieces of this are the actual functions. So you can see I have two functions here, open app and web search. So if I just click into open app, you can see it's just a little bit of JSON, and this is what tells the agent to do.

So it instructs the agent on how to actually open an app. Now these are just sample functions. They won't actually do anything when I hit start session, but let me show you what happens when I talk with it.

So we'll just save this. I'll also show web search real quick too. It's a similar kind of JSON, and this actually instructs the agent on how to search the web.

If I just go and I hit start session hey, chat. Can you hear me?

I hear you loud and clear. Nice. Uh, how's it going?

I'm doing well and happy to be here. How are you? Good.

Can you open an app on my computer? Open Spotify.

So just like that, you can see. I don't wanna keep talking to it. App name Spotify.

Now it's not gonna actually open up the apps because it doesn't have actual access to my computer, and that's exactly what's happening with the real app that we've built where it actually has computer access. And then the second visualization I wanted to show are actually these logs here. So this is in terminal.

This is actually live in our app. You can see when I said, can you click the GBT? It didn't transcribe it right, but GBT Realtime Tube model, it was able to infer.

I meant GBT Realtime two, and it actually clicked. You can see the link. That's the tool call there.

Click the link position first. And then above when I said open the Realtime two GPT docs from Google, got it. And then you can just see the web search call, query Realtime two docs.

And that's essentially what we're doing. We're just mapping out in an easy to understand way exactly how to do things on my computer, like open apps, like do web search, like open notes. So now let me show you how to actually set this up yourself.

Of course, it's actually really easy to do. As you might imagine, all we need to do is go over to code, and we're just gonna hand it those docs that we were looking at a second ago, the g b t real time two doc. These ones here.

So let's just go here, copy this, go over to cursor, and open up Clot in a new session. We're gonna type in this prompt. The prompt doesn't really matter, but if you want, you can pause your screen and grab this, and I'll quickly gloss over this as well.

So I say here are the g p t real time two docs, build me a simple voice assistant, or my Mac using GPT real time two, which should open a live WebSocket connection to the model. I'll explain that in a second. Capture my microphone.

Let me control my computer through voice, and we're just gonna start with one tool call, and we'll keep track of the logs by printing every tool call to the terminal as it fires. So it's really as simple as that, or you could just not say that and say, here are docs. Set this up for me.

And so we'll let Claude just do the coding here. This is a relatively simple build. What Claude's doing is opening a live connection to g p c real time two over what's called a WebSocket.

It's gonna hook up by microphone in a second and actually all give permissions there, And then it's wiring up all the tool calls we just talked about as well. It's just starting with that one tool call, but we'll slowly start to stack in more complicated ones as we go through this build. One thing, of course, you need as well is just an OpenAI API key, which you can get at platform.openai.com.

This is exactly where we were in the playground when we were testing the audio model. So just go to API keys and then create a new key, and then just drop it into a dot env. Local or just give it to your agent.

And I'll talk about API costs here in a second, the normal way where it's streaming to the cloud the whole time. Cost will add up very quickly. But the method all detail, once we get this wired up, keeps it super efficient to where each call ends up costing, like, pennies on the dollar.

So let's go back over to Claude. And in just a couple minutes, like, literally two and a half minutes, it looks like everything is set up. So it just needs to do some sort of NPM install.

I also need to add my OpenAI key, so I didn't give it that yet. So we'll do that. Okay.

OpenAI API key is in dot ENV dot local now. Go ahead and run this build entirely, and then just do a quick test.

And Claude ran a couple tests. It looks like everything is working. Now we just need to do NPM start, and we need to get microphone capture set up.

So Claude needs to connect to my microphone. So I'm just gonna tell Claude, please do NPM start and wire up my microphone so I can talk to it. You can't change your app's notification logic for me.

I just started. If you to watch it. If it's crashing, try force quit with command plus option plus escape.

Tell me the app name, and I can open it. Tell me which one Okay. I need to stop it.

So what's happening is I was talking to Claude in another session because I'm also doing five things at once. So I was just talking to Claude about something else, and then it just started listening and then picking up my conversation with Claude separately. So this is this is one of the problems of having real time streaming just constantly listening.

So we need first to set up some sort sort wake word. But before we set up a wake word, let's just have it do something to show that it is in fact working. Hey, chat.

Can you hear me? Yep. I can hear you.

What's, uh, what's going on? Just hanging out with you. What's up on your side?

Nothing. Can you open Safari? Safari opened.

Woah. Look at that. So just like that, you saw exactly what happened in the terminal.

I went and called it. Yep. Oh, okay.

Okay. Okay. Got it.

Got it. Okay. Turn off.

Stop. Turn off. Alright.

Okay. I'll keep it brief and leave things idle now. Got it.

I'll stay quiet and wait. Okay. Now that that's off, you can see how easy that was to set up.

Right? It was just a couple back and forths with Claude to actually get this set up, and really the only problem was it just listening nonstop, which is annoying.

Of course, we don't wanna do that. So what we're gonna do is set up a couple different things. So you might have noticed that in my original demo, I had that little waveform displaying when I was talking to it.

But right now, in the version that we just built, it's always listening. So it's streaming my voice to the cloud nonstop. It's constantly transcribing everything I say while it waits for a command.

Not to mention it's just sitting there every time I say something, it's just interrupting me because it thinks I'm talking to it. So here's what we're actually gonna do. One hotkey, I'll hold it down and that waveform will pop up exactly like you saw in the demo, and then I talk to it, I let it go, and then it acts.

So nothing's listening, nothing's streaming to the cloud until I press that key. And to do that, we're just gonna go back to our agent and set it up with a very simple prompt. Here's the prompt too.

You can screenshot this if you like, but it's relatively straightforward. It's just saying add push to talk to my voice assistant, set up a global hotkey while I hold the key down, capture my microphone, and show a clean on screen HUD so I know when it's listening. While I release it, stop listening, and then I have the model act on what I said.

Okay. So we've got our push to talk set up. Just hold command q, and let's see if this thing works.

Hey, chat. Can you hear me? Yep.

I can hear me loud and clear. Cool. What's going on?

I'm here and ready when you are. K. The only thing that's just annoying me, I just wanna fix this, that looks like the speaking, like, the waveform doesn't show up in real time, so I'm gonna just tell Claude that.

This is a very minor UI thing, but I know I'm gonna be bugged by that as we move through this demo, so I'm just gonna tell Claude. The only thing is when it's speaking so I see thinking, and then it's weird to have thinking be a waveform. Right?

Can we just, like, not do that? I don't know if you just need, like, a loading indicator. Just do it in the same HUD.

And then when it's speaking, it just kind of, like, either lags with the waveform or just kind of, like, is stuck. I don't know if we can get like a real time streaming of the waveform generation showing up when real time two is talking, but that would be tight.

Hey, chat. Can you hear me? Yep.

I can hear you. Nice. Alright.

There we go. That real time streaming waveform looks great. So now we've got real time two built out and fully configured.

It can trigger with a hotkey. Now it is time for our first real test. We're gonna be searching the web.

So to do that, all you need to do is go back to Claude and ask it to add three new tools. One that will open a Chrome tab, one that types a query into the search bar and hits enter, and then one that can use its vision to click a link. So again, really easy to do.

We're gonna go back to Claude and just type in this prompt. Add web search to my voice assistant. Three tools.

One, open a new Chrome. Two, type a query to the search bar. Hit enter.

Three, click a link using vision. Take a screenshot of the page. Using the model's vision, define the link I described and click it when asked to search yada yada.

So I give the actual example of that, but it's really that simple. So you just describe what we wanna do, and then Claude can reason how to actually set this up for the real time two assistant. Okay.

And we've got this set up with a couple different caveats for you to know where you'll just need to grant permissions. So the minute I ask chat to open up something here in a second, it will open up my accessibility settings, and I just need to connect some stuff for screen recording for the actual automation with controlling Chrome.

And once I grant permissions once, it should be able to just run automatically. I don't need to give permissions every single time. Well, let's run this thing and see if we can do a Google search.

Search for the best ramen in Brooklyn. It. I'll search for you now.

Search is open. K. You didn't see that because I was just in another window, but it immediately searched that.

Like, it was so fast. Let me see if I can actually show this on my monitor. Maybe I just need to try it again.

So if I say search for the best salad in Brooklyn. Okay. Look how fast that is.

That's crazy. Good job, chat. Now click the first link.

Okay. I'll open the first link on the page. Okay.

So this is where the permissions come up as you can see. Alright. We're gonna turn on okay.

Try clicking now. Which link should we click? Say something like the first result or the link's name.

I can see how fast this is too. It's crazy. Uh, just click the first link.

Click the first result. Click the Alright. So it's clicking the images.

Click the first link that's a URL, like a website URL.

The first URL link. Look at that. That is cool.

Now, okay, searching the web, that's relatively easy. Let's raise the stakes a little bit by having it open and interact with applications. So what if I wanted to, I don't know, save these GPT real time two docs somewhere or, I don't know, save any website URL for that matter.

What if I could just save this straight into Obsidian? Well, we can do exactly that. Let's go ahead and set this up in Claude to where we're now training the agent on how to actually open and access and interact with our applications.

And by train, I just mean literally showing the agent how to access this application. In this case, Obsidian, like a lot of apps you'd be using, has documented MCP server, which means just by connecting this, it'll give our real time two agent basically full access to my Obsidian vault. So reading, writing, creating, searching notes, all of that.

Now I'm not gonna get too into the weeds, but it's genuinely easy to do this. We're just gonna go over to Obsidian, go to settings, click community plugins on the left hand side, go browse, and search for MCP, and it's the local REST API with MCP.

We're gonna install this, then we'll enable it, and then we'll click options. And then we've got an API key here, which I'm blurring out, and we also have the MCP endpoint, which we need to give the agent, which is this 127Dot0Dot0Dot1 URL here.

And then I'll just take this. I'll go back to Claude code and tell it, connect my Obsidian MCP server to the voice assistant. It's running at that URL.

And then I said the API key is in dot e v dot local, wired up so I can read, write, create search notes in Vault by voice. Okay. And in six minutes, we have our Obsidian app connected to Realtime too.

So I can just go ahead and talk to the agent, trigger it with control q. Hey, chat. Open Obsidian.

Obsidian is open. Cool. Now create a new note.

Got it. I'll create a new note in Obsidian. New note created.

Sweet. Uh, can you open up the GPT real time two docs via Google search? Copy all of those docs and then paste that into this quick note.

GPT real time two documentation,

then we'll handle what's possible with it. I opened the OpenAI from search. I can't copy all their content into your note.

Yeah. You accidentally clicked, uh, AI mode in, uh, Google. Go back.

That AI overview and returned to the regular results. I reopened the search results. Yeah.

Now I just need you to click that first link there, GPT real time two. Got it. I'll try that first link again directly.

Done. It's open now. Cool.

Now can you take these docs, copy them, and then paste it into Obsidian into that note that you just created? Okay. Grabbing the page and dropping it into that note.

Done. Full page saved into ClickNote. Crazy.

So we just go back to Obsidian. There it is. Wild.

And then you can just feed this into an agent. Really freaking cool. Of course, obviously, like, you you can there's easier ways to access docs that the an agent could just do web fetch, but the idea is you just copy and paste anything online and paste it into Obsidian for whatever reason.

Okay. So we've got Obsidian connected. And remember, this isn't just for Obsidian.

You can apply this same setup for any application, any SaaS with an MCP server or an API. Just give that to Realtime two and it'll be able to access that programmatically. So to recap, our Realtime two agent can now interact with your browser via Google search setup and it can interact with tons of applications in SaaS via an MCP server.

This covers almost all computer use, but the kicker is what if an application doesn't have an MCP or an API? What if it's an application like say Premiere Pro? Well, that case, we have officially reached the highest difficulty level for GBT Real Time two.

But this is actually really easy to set up as well. We can pull this off with a combination of two things. The first is what's called an accessibility tree.

Every operating system has a feature that's built for people who are blind or who have low vision called a screen reader, which will read everything on screen out loud. And for that to work, each application is required to hand the computer a structured map of itself. This is the accessibility tree and we can have GPT real time too leverage this same thing as well.

Now having said that, that map is just information on its own. It doesn't really do anything. So that is where this second piece comes in.

And it is an open source repo called agent desktop. Shout out to this guy Lafir. I'll put his x profile on the screen.

But what agent desktop will do is it will read that map for the agent, and then it gives that to the agent. The agent can actually drive the app. So it can click the buttons, it can press play, it can move forward, it can go into all of these different menus.

So the accessibility tree tells the agent what's there, and then agent desktop is the actual hands that operate this. By the way, all of this is going to be included in the GitHub repo down below, including my fork of agent desktop. But for this example, we're just going to go here.

We're gonna copy this agent desktop GitHub, and I'm just gonna feed this into Claude. And so if you wanted to just use Lafayer's agent desktop, you can do that as well. And here in this prompt, I'm just saying add the desktop app control to my voice assistant using agent desktop.

I give it the GitHub repo there explaining it should read the Mac OS accessibility tree so the agent can see any apps, buttons, fields, menus, act on them, and then wiring it in as tools as I can control apps by my voice. Leverage agent desktop to actually give your agent hands and start controlling applications. Hopefully, I'm not getting too in the weeds with those explanations, but I wanted to at least give people context on how this is actually being built.

But if you're lazy like I am, you can just feed the GitHub repo that I have included to your agent. Controls are now live in Premiere. It looks like everything is set up, so I'm just gonna talk to the agent and test this out.

We'll open up Premiere. So I'm in the middle of editing this actual video that you are currently watching, and let's have the agent do something. Hey.

Can you move three frames right in Premiere? Stepping forward a few frames. Move three frames forward.

Few frames forward. Let's, uh, let's play this. On your entire computer.

Can you pause? Can you pause? Oh, it's oh, it's playing my program monitor right now.

Okay. That's why. Need to quit out of this track, I'm assuming.

Okay. There we go. Let's just bring this program monitor here so you can see what's playing right now, and I will have a real time to pause this.

So let's hit play.

Can you hit pause on Premiere?

Boom. Look at that. Go back three frames.

That's sick. Can you cut the timeline? I don't know what the term is.

Cut the playhead. It should actually do. That's crazy.

Look at that. Look at that cut. Boom.

So it's working. Okay. Now mark in.

Marked in. Okay. Move the playhead four frames to the right and then mark out.

Got it. Stepping in. Sipping the moved four frames That's crazy.

Now ripple delete. Ripple delete. Done.

Jeez. Alright. I don't want it to mess with my timeline.

I think you get the idea. Alright. So now you have an ability to, via GPT real time two, have an agent basically control your entire computer with just your voice.

You know how to open Chrome and search the web. You can work with applications that have an MCP server like Obsidian, and you can build out the ability for your agent to handle even the most complicated applications like Premiere Pro. But before I let you go, let's run through a few honest caveats.

First, it doesn't work perfectly on every app. The web and most standard apps are easy, but some apps just won't expose many of their controls. When that happens, it takes a bit more back and forth with your coding agent to get to the bottom of it, or you have to duct tape something together.

But between the GitHub repo I'm giving you and agent desktop, you should be able to, with an agent, figure out how to access and control a lot of these applications. Second, there is a real cost. It's cheap.

It's a few pennies per command, but it's not free. And, of course, if you leave it streaming to the cloud all day, which was that initial version that we did at doing, which I highly recommend you don't do, but those costs are just gonna balloon really quick. And then third, this is where it gets a little less realistic.

It is not instant. I tried to show in real time as much as possible how long this takes, but there is a little bit of latency. And so if you're used to, you know, working in a GUI and just clicking things and having it be automatic, it's gonna be a little bit weird to have an agent do this.

But I think this is important to emphasize, you can see in a very near future where this thing is almost in real time. And I, for one, since I hate clicking in applications because it wastes so much time, I'm excited for that future. Anyway, uh, let me know if you're experimenting with GPT real time too.

If you have any questions, happy to help, and, uh, I'll see you in the next one.

The Hook

The bait, then the rug-pull.

The video opens with a dramatic rewind through computing history before cutting mid-sentence to a live edit suite where the creator simply says 'hey chat, pause Premiere' and it does. That 15-second sequence is the entire thesis made visceral before a single word of explanation.

Frameworks

Named ideas worth stealing.

11:40model

Three tiers of app control

Web/browser: open tab + type query + vision-click
MCP-enabled SaaS: standard protocol, full read/write/search
Native/legacy apps: accessibility tree + agent-desktop

Every app falls into one of three tiers for voice control, each requiring a different implementation strategy. Add each tier with a single prompt to a coding agent.

Steal forFraming AI automation capabilities in a talk, workshop, or product demo

07:53model

Push-to-talk pattern

Hold global hotkey
Waveform HUD appears
Speak command
Release key
Model acts

Solves always-on streaming cost and ambient-pickup problems. Nothing streams to the cloud until the key is held, keeping costs to pennies per deliberate command.

Steal forAny voice-controlled app where cost or ambient noise is a concern

CTA Breakdown