Big Idea

The argument in one line.

One free, offline app covers both ElevenLabs-style voice cloning and Wispr Flow-style dictation, eliminating up to $37 a month in subscriptions while keeping your voice data entirely on your own machine.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You are paying for ElevenLabs or Wispr Flow and want to evaluate whether a free local alternative is ready for real use.
You are on a Mac and want to clone your voice locally without uploading audio to a cloud service.
You are building AI agents and want to give them a custom voice via MCP without recurring API costs.
You have a slower machine and need guidance on which model size to download without wasting gigabytes.

SKIP IF…

You need voice cloning on mobile — Voicebox is a desktop app only.
You need production-quality voice output today without testing — this series promises an honest verdict but has not delivered it yet in Part 1.

TL;DR

The full version, fast.

Voicebox is a free, open-source desktop app (MIT license, 29k GitHub stars) that bundles voice cloning and local dictation into one offline tool. It targets the combined $37/month cost of ElevenLabs and Wispr Flow. Part 1 covers the install path on Mac including the in-app updater bug and Gatekeeper warning, then walks through the models tab to help you pick between Qwen TTS 1.7B (best quality), Kokoro 82M (low-spec fallback), and Chatterbox Multilingual (emotion control, multilingual). The Captures tab does hotkey dictation locally via Whisper and never sends audio to any external API.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 01:13

01 · The claim

Hook and series overview: free local voice cloning, mini-series structure covering install, cloning test, dictation, and MCP/agents.

01:13 – 02:23

02 · What Voicebox is and what it costs vs ElevenLabs and Wispr Flow

Feature breakdown: clone voices, seven TTS engines, dictate into any app. Cost comparison: ElevenLabs $22/mo, Wispr Flow $15/mo, Voicebox free.

02:23 – 03:28

03 · Website, GitHub, and docs

voicebox.sh, Jamie Pine as creator, 29k stars, MIT license, 4k forks, documentation at docs.voicebox.sh.

03:28 – 06:53

04 · Installing on Mac — updater bug and Gatekeeper fix

Download from voicebox.sh/download, Apple Silicon vs Intel vs Windows builds. In-app updater broken on Mac — download fresh copy instead. Gatekeeper workaround via System Settings > Privacy & Security.

06:53 – 07:34

05 · Inside the app — models tab

First tab to check on fresh install. Two categories: voice generation models and transcription models. Shows what is already downloaded.

07:34 – 09:05

06 · Which TTS models to download

Qwen TTS 1.7B recommended (4GB, best quality). TTS 0.6B for mid-spec. Kokoro 82M for low-spec/low-disk fallback. Parameter count explained simply.

09:05 – 10:52

07 · Chatterbox models and transcription models

Chatterbox Multilingual (3GB, emotion tags, multi-language). Chatterbox Turbo (English-only, faster, distilled). Whisper base/small/medium for transcription — pick based on disk space.

10:52 – 12:01

08 · Generate and Stories tabs

Generate tab: voice generation UI with engine and language selection. Stories tab: multi-track timeline editor to combine voices and produce podcast-style content.

12:01 – 13:11

09 · Captures — the dictation side

Hotkey-activated dictation into any app. Drop in audio files for transcription. Runs Whisper locally — data never hits OpenAI or Anthropic.

13:11 – 15:34

10 · Voices, effects, settings, and look ahead

Voice library (cloned + licensed built-ins). Effects: robotic, radio, echo, deep voice, custom presets. Settings overview. MCP/API interface teased for part 4.

Atomic Insights

Lines worth screenshotting.

One free open-source app — Voicebox — now covers the same jobs as ElevenLabs ($22/mo) and Wispr Flow ($15/mo) combined.
The in-app updater for Voicebox on Mac is broken; download the latest version directly from the site and reinstall — your voices and presets survive.
Mac Gatekeeper blocks Voicebox on first launch: right-click open or go to System Settings > Privacy & Security to allow it.
Qwen TTS 1.7B is the recommended starting model; Kokoro 82M is the fallback for machines with limited disk space or slow hardware.
Chatterbox Multilingual adds emotion control (laugh, sigh) and supports languages beyond English; Chatterbox Turbo is faster but English-only.
The Captures tab transcribes locally using Whisper — your audio never reaches OpenAI, Anthropic, or any external server.
Voicebox has an MCP interface, meaning AI agents can speak in a cloned voice without a cloud subscription.
A 1.7 billion parameter model is not categorically better for everyone — for slow machines the 82M Kokoro model is the honest choice.
The Stories tab is a multi-track timeline editor for combining multiple voices, enabling locally produced podcast-style audio.
With 29,000 GitHub stars and 4,000 forks, Voicebox is an active project, not an abandoned side experiment.

Takeaway

One free app replaces two paid subscriptions for local AI voice.

WHAT TO LEARN

When a tool bundles voice cloning and local dictation under a single MIT license, the model selection decision — not the software itself — is where most people waste an afternoon.

Voicebox covers both ElevenLabs-style voice cloning and Wispr Flow-style dictation in one local app, eliminating up to $37/month in subscriptions.
The in-app updater on Mac is broken as of v0.5 — download a fresh copy from voicebox.sh and reinstall; your existing voices and presets survive.
On Mac, Gatekeeper will block the app on first launch; the fix is System Settings > Privacy & Security > Allow.
For most machines, Qwen TTS 1.7B is the right starting model; only drop to Kokoro 82M if disk space or hardware is genuinely constrained.
Chatterbox Multilingual adds emotion tags (laugh, sigh) and works across languages; Chatterbox Turbo is faster but English-only — the choice depends on your use case, not the spec sheet.
The Captures tab runs Whisper transcription entirely offline, meaning your dictated audio never reaches OpenAI, Anthropic, or any third-party server.
Voicebox exposes an MCP interface, which means AI agents can call it to speak in a cloned voice without paying per-character API costs.
A 1.7B-parameter model is not automatically the right choice — the honest recommendation here is hardware-gated, which is rare in AI tool tutorials.

Glossary

Terms worth knowing.

TTS (Text to Speech): Software that converts written text into spoken audio using an AI model. The model size and architecture determine voice quality, speed, and hardware requirements.
Voice cloning: Creating a synthetic copy of a specific human voice from a short audio sample, which can then generate new speech in that voice from any text input.
Qwen TTS: A 1.7-billion-parameter text-to-speech model from Alibaba, used in Voicebox as the primary high-quality voice generation engine.
Kokoro: A lightweight 82-million-parameter TTS model. Much smaller than Qwen, suited for machines with limited RAM or disk space, at some cost to output quality.
Chatterbox TTS: An emotion-aware TTS model by Resemble AI that supports tags like laugh and sigh. The Multilingual variant works across languages; the Turbo variant is faster but English-only.
Whisper: OpenAI's open-source speech recognition model. Voicebox ships it locally so transcription never leaves the user's machine. Available in multiple sizes (base, small, medium, large).
MCP (Model Context Protocol): A protocol that lets AI agents call external tools and services. Voicebox exposes an MCP interface so agents can generate or stream voice output without a cloud subscription.
Gatekeeper: A macOS security feature that blocks apps not downloaded from the App Store or notarized by Apple. Voicebox triggers it; the fix is System Settings > Privacy & Security > Allow.
MIT license: A permissive open-source software license that allows free use, modification, and distribution with minimal restrictions.

Resources

Things they pointed at.

00:00toolVoicebox ↗

02:23productElevenLabs ↗

02:23productWispr Flow ↗

02:35linkVoicebox GitHub ↗

02:50linkVoicebox docs ↗

15:34linkCreator strategy call ↗

Quotables

Lines you could clip.

12:24

“It never leaves your system — it does not hit any APIs from OpenAI, Anthropic, or anyone else because the models are running on your machine.”

Clear, punchy privacy argument with named competitors→ TikTok hook↗ Tweet quote

00:08

“I cloned my own voice on my Mac last night completely for free with a tool that claims it can replace ElevenLabs and Whisper Flow completely.”

Strong cost-elimination hook, specific tool and competitors named→ IG reel cold open↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

metaphoranalogystory

00:00I cloned my own voice on my Mac last night completely for free with a tool that claims it can replace 11 laps and Whisper Flow completely for free.

00:12It runs locally on your machine. The tool is called voice box. You can see it here.

00:19And in this mini series, I will go into detail on what voice box is, how to set it up.

00:28This part here is the general part where I explain everything around Voicebox. We do the install.

00:35We download the models that we need to run on your local machine to actually do this. And in the next parts, we're gonna concentrate on the voice cloning.

00:46We go in there. We use different voices. We see if this actually holds against the claim that this can completely replace 11 labs.

00:58We also check if this can replace WhisperFlow in another part, and we definitely will check out the MCP function and how to use this with your AI agents.

01:10So let's get started. What is VoiceBox?

01:15VoiceBox started out I saw it first in January. So it's been out there for quite some time.

01:21The first version I downloaded was the o point one two, I think. Now we are on version o point five.

01:30This is the one that we're gonna install together in this video. You can clone voices and again this completely on your machine. Generate speech across seven TTS engines.

01:40TTS stands for text to speech, and you can dictate into any app. So basically, the first thing, the clone voices is what eleven Labs does.

01:49And eleven Labs costs depending on the plan and if you actually really wanna use this and have professional voice cloning, this costs $22 a month. And if you want to use the dictates function across your system into any app, then this is Whisper Flow.

02:07And this basically costs you 15, uh, euros or dollars per month depending on how strong you use it because you hit the limits rather fast on the flow basic.

02:20So in this one, we're gonna try it out together. Let's scroll through the website together. You can go either on the website.

02:28As always, all the links are down in the description. While you're down there, make sure to subscribe. This is voicebox.sh.

02:35You can go either to the site directly or go to the GitHub. It's from Jamie Pine.

02:41He's the one behind this. You can see this here. This has 29,000 stars on GitHub already.

02:50MIT license. This is open source.

02:54Everyone can work on this. There's already almost four k forks of this. So this is an project that's actually very much alive.

03:04You see the issues to pull requests. And to be honest, everything is very well documented. This is the documentation docs dot voice box dot sh.

03:13You can see this here everything is in here. The introduction, the installation, quick start, and everything that you need to know but so you don't need to go through everything and read it.

03:28I got you covered in this video here. Enough talk. Let's get in there and actually install this on your local machine.

03:37As you can see here, if we get started on the voice box documentation, you can either go through here. You can also download it via the GitHub or the page itself.

03:48You just click on download and then no. We don't want this to start like this because I already downloaded it.

03:57What to look out for? This is depending on what machine you have.

04:02It should start automatically once you go to voicebox.sh/download. For me, I got a m two max.

04:10So this is Apple silicon in my case. If you have a Mac or MacBook with an Intel processor, then you should go with the Intel x 64.

04:21If you're on windows, it's rather easy. Just go with, uh, windows 64 bit and Linux you know how to do it.

04:31Well, you could buy Jamie a coffee because this open source is completely free. Uh, maybe after you watch this video and if you tried it out and like it, buy him a coffee. By the way, this is not an ad or sponsored video at all.

04:45I just try this out, test this, and in the end I would give my honest verdict on whether I think this can hold up against 11 labs and or whisper flow. We will focus on that after we tried it out together.

05:01After you downloaded it, it's rather simple. Just install it the usual way. DMG for Mac, just put it in your applications folder and you're ready to go.

05:11Linux is basically the same and on Windows just go through the wizard. And then you have it installed. Quick finding from my side, as I said the first version I had was the o point one or something like that.

05:26So one of the first versions. I tried to update it within the app that did not work at least on the Mac for me. So if you run into the same problem, you should download it again the newest version from the site or from GitHub wherever you want.

05:42Just make sure it's an official source and then just put it back in your application folders after you deleted the old one.

05:52And my presets and my voices, everything was still there. One more thing before we go into the actual app. On Mac, it might be the case because you're not downloading it via the App Store and depending on the settings that you have.

06:08If you double click on it, I'll try to get the DMG open. It might warn you that this is not possible either depending on the version you're on.

06:19Right click open and then it should show a dialogue that you want to open it anyways. If this isn't possible, then you probably need to go to your system settings in your Mac.

06:32Privacy and security in there, it should show up voice box that Mac prevented this from actually opening and then you can allow it. Then you should be able to actually install this one.

06:45So this is a rather important fact for some of you that are not used to this on the Mac. But now let's get into the actual app once you install it.

06:55Alright. We're in the actual app. Depending on the version and system you're working with, there might be a dialogue asking you which model you want to use.

07:07This is why I'm showing the models tab in the app first. Because then once you install the app and open it for the first time, you know which models to download. Why do you need a model?

07:18Those are the AI models that generates the voice and also do the transcriptions. So you would need one for the voice generation and one for the transcriptions.

07:30What to start with? For the actual text to speech, the go to standard model would be QuenTTS and this is the better one.

07:39So if you have this place just download this one depending on the internet connection. It might take some time so maybe start this right away. It is for gigabytes.

07:51If you have a super slow machine, then you could go with Kokoro. This is a super small model.

07:59You see the difference 1,700,000,000 versus 82,000,000.

08:04Without going into detail this basically means more capacity in the actual model, better model.

08:12We don't need to go into detail here. But QUANTTS would be the go to model to actually get started.

08:18If you have displays, go with a 1,700,000,000 and also a good machine.

08:23If not, go with the TTS o point six b. And if you have a super slow machine with not a lot of disk space, you could try the Kokoro 82 m.

08:35We will definitely try all those out so you get a feeling for them in the next video once we focus on the actual voice cloning. Again, this is the installation and walkthrough of the app in this part here.

08:49The next one will be the actual voice cloning with what we're setting up today. So please follow along so you got everything set up in the way that I have. So you have everything set for the actual voice cloning part in part two.

09:04One addition, you can see that I got Chatterbox TTS multilingual installed as well.

09:11Why did I install this one on top of the Quen TTS? The thing is this one gives you control over emotions.

09:19So basically, find control over expressions. You can tell it to to laugh, to sigh.

09:26So all of that is possible with Chatterbox. TTS is three gigabytes.

09:33This is multilingual. So for example, I will try it out in German and show you how I'm speaking in German and then do the German voice cloning, see if it actually works in German as well.

09:46Maybe also French because I know a little bit of French so we have different languages to try out and see if laughing or sigh and all that kind of stuff actually works. Chat box turbo is a distilled model which only works in English and with the text for example like laugh or sigh.

10:05This is faster and optimized but only works for English. Alright.

10:10You know the models. Decide on which ones to download. I will definitely show you how those for the voice generation work for the transcription, whisper base, small medium depending on how much disk space you have.

10:26You can download the different ones. Depending on the outcomes in video three, you can decide which one to go with. But maybe depending on the disk space either start with the small or medium for the transcriptions.

10:40If you don't have a lot of disk space then probably go with Kokoro and Whisperbase to get an absolute basic setup here.

10:50Now let's go through the other tabs in the app. We got the generate part here. This is where the voices actually live.

10:59You can see this here yesterday to see if this works in the new version. I cleaned everything out so we got a fresh start here. You can see this it's set to English with the Quen three t t s 1.7 b.

11:13No effects on it. We'll definitely try out the effects as well. And you can import voices.

11:18You can create voices. You can already see that here it's my voice. It's showing you exactly what I'm saying right now.

11:25You can then start the recording. But again, we will do this as soon as we get to video. So you don't miss video two, make sure to subscribe.

11:33This is where we actually get to the voice cloning. This is the stories tab.

11:39This is basically a multi track timeline editor. You can combine different voices.

11:46You can basically cut a podcast here, create a podcast. We will go into this also in the next one and create a little story together here and go through this step by step. Captur is a dictation side basically WhisperFlow.

12:01You can see this here. Whisper Turbo is active in QUEN three right now. You can use a hotkey and capture your voice basically everywhere on your Mac and everything shows up in here.

12:15You can also drop in an audio file. It gets transcribed. So this is super helpful.

12:20For example, if you have a voice message that you want to transcribe or some notes or anything and this doesn't leave your local machine. And this is super important.

12:33Right? So it doesn't hit any APIs from OpenAI, Anthropic or anyone else because the models are running on your machine.

12:40So it never leaves your system and this is why this part is super interesting. Deep dive on the captures and the transcriptions and everything will be video three in this mini series where we go deep into this feature, talk about the benefits that running something like that on your local machine has and maybe also the disadvantages that come with this if you only have it on your local machine.

13:08This is where your voices live. You can see this here.

13:12This is the one that I set up so I have at least something in here to show you for this first video. Everything in here is either a clone from your own voice, from a voice sample, or something provided by someone else.

13:26First disclaimer here, always be careful that you only clone voices that you have the rights to.

13:36This is something important. We're gonna go into detail once we go to the actual voice cloning. You can see this here.

13:42You also have the bid invoices if you want. So there's a lot of voices that you definitely are allowed to use. For example, if we choose Nicole, then you can give it a name and, uh, also use the engine.

13:57This is definitely possible, but you can also clone your own voice.

14:02Again, all on that in the second going to come out very soon.

14:07We also have the effects for example, the built in ones robotic radio, echo chamber, deep voice. I'm gonna preview all of them as well. You can also set up your own presets that you can put in here at effects like reverb, delay, compressor gain.

14:23We'll definitely build a little bit of fun stuff in here as well. And we already talked about the models part. Then we definitely got the settings.

14:33There's a lot of stuff to set up. I don't wanna go through all of the settings right now because I think this video is long enough already. Most important thing, read the docs.

14:44You can always find them here in the official app. I think those are all self explanatory. We got the generation with different settings that we're gonna take a look at once we get there.

14:55Same for the captures. The MCP is something that we definitely gonna talk about how to get your AI agents ready.

15:02Well, I think that's it for part one of this series. The next one will be the actual voice cloning where we compare VoiceBox to Eleven Labs.

15:12How well does this hold up if you clone your own voice on your machine? Make sure to subscribe so you don't miss the next part where we focus on the voice cloning, the generation, the captures, and also on the AI agent side of things.

15:30Thank you so much for watching and see you again in the next video.

The Hook

The bait, then the rug-pull.

A free, open-source app called Voicebox quietly arrived on GitHub and promised to do the job of two paid subscriptions — voice cloning and local dictation — entirely on your own machine. Part 1 is the setup: installation pitfalls, model choices, and a tour of every tab before the real tests begin in parts 2 through 4.

Frameworks

Named ideas worth stealing.

07:34list

Model size decision tree

Good machine + disk space: Qwen TTS 1.7B
Mid-spec: TTS 0.6B
Low-spec/low-disk: Kokoro 82M
Emotion/multilingual: Chatterbox Multilingual
English-only, fast: Chatterbox Turbo

A decision framework for picking which TTS model to download based on hardware constraints.

Steal forAny tutorial where the reader must choose between model tiers — frame it as a decision tree, not a ranked list.

CTA Breakdown

How they asked for the click.

VERBAL ASK

13:27subscribe

“So you do not miss the next part where we focus on the voice cloning, the generation, the captures, and also on the AI agent side of things.”

Repeated subscribe CTA at natural series break points — mid-video before voice cloning section and at the end. Honest framing: the real test is in future parts.

MENTIONED ON CAMERA