Modern Creator Network
Brad | AI & Automation · YouTube · 08:36

My Claude Code Can INSTANTLY Watch Any Video (Here's How)

An 8-minute walkthrough of the Claude skill that replaced hours of manual video scrubbing with a URL paste.

Posted
2 weeks ago
Duration
Format
Tutorial
educational
Channel
B|A
Brad | AI & Automation
§ 01 · The Hook

The bait, then the rug-pull.

Brad opens with a claim that doubles as a threat to every expensive AI video tool on the market: for free, with no proprietary video model, Claude can now watch anything. Before you've hit play, Claude's already an expert on what's in it.

§ · Stated Promise

What the video promised.

stated at 00:39I'll walk you through exactly how it all works, the use case that completely changed how I consume content, and how to set this up in your own Claude Code in under five minutes.delivered at 07:42
§ · Chapters

Where the time goes.

00:0000:52

01 · Cold open

Problem stated: other transcript tools only read words and miss half the video. Promise: how it works, the life-changing use case, and a 5-minute setup.

00:5202:43

02 · Watch videos in minutes — live demo

Side-by-side screen recording: 45-minute Sam Altman YC lecture ingested in under 2 minutes. Claude returns structured speaker summary, queryable in terminal.

02:4303:43

03 · Setup

GitHub link (free), install commands, automatic dependency install, API auth on a free-tier transcription service.

03:4305:18

04 · Under the hood

Core insight: a video is just two things — frames and a transcript. yt-dlp + FFmpeg do the heavy lifting locally. No MCP, no third-party wrapper, no cloud service.

05:1806:30

05 · The cost math

Frame scaling table: 1 min = 60 frames / $0.70; 1 hr = 100 frames / $1.62 (capped). YouTube captions are free; Groq Whisper free tier covers everything else.

06:3007:03

06 · Analyze video hooks

Use case #1: content research — paste a winning video URL, ask Claude to break down the hook. Replaces 10 min/video of manual scrubbing.

07:0307:42

07 · Debug screen recordings

Use case #2: developer QA — drop in a 30-second screen recording of a UI bug; Claude pinpoints the exact frame the state change happens.

07:4208:36

08 · Content intelligence / second brain

Use case #3: Obsidian second brain — Claude auto-watches competitor videos and feeds structured notes in. Compounds over time.

§ · Storyboard

Visual structure at a glance.

open
hookopen00:00
promise
promisepromise00:39
live demo
valuelive demo00:52
framework
valueframework03:43
cost math
valuecost math05:18
second brain
valuesecond brain07:42
CTA
ctaCTA08:25
§ · Frameworks

Named ideas worth stealing.

03:43concept

A video is just two things

  1. Frames
  2. Transcript

Instead of paying for an expensive multimodal video model, decompose any video into the two things Claude already reads natively — screenshots and timestamped text. Feed both together.

Steal forAny explainer on giving LLMs multimodal context; the "decompose the expensive thing" framing is a teachable pattern
04:40concept

Battle-tested tools, not new wrappers

  1. yt-dlp (universal video downloader)
  2. FFmpeg (frame + audio extraction)

Brad explicitly contrasts his use of decade-old, rock-solid CLI tools against MCPs and third-party wrappers. Trust signal: millions of developers, no vendor risk.

Steal forPositioning own-your-stack tools against SaaS wrappers — direct language for MCN+ messaging
05:18model

Frame cap cost scaling

  1. 1 min -> 60 frames / $0.70
  2. 10 min -> 80 frames / $0.82
  3. 30 min -> 100 frames / $0.95
  4. 1 hr -> 100 frames / $1.62

Capping frames at 100 beyond 30 minutes means cost is nearly flat at scale — a key objection killer for "this will torch my token budget."

Steal forCost math slides for any AI tool pitch; the framing of "surprisingly cheap" deserves its own section in any tutorial
§ · Quotables

Lines you could clip.

01:30
Half of the interesting stuff in a video isn't said out loud. It happens on screen.
Tight, self-contained truth — no setup needed.TikTok hook
02:23
You're not watching content anymore. You're actually downloading context automatically and putting it to work straight away.
"Downloading context" is a strong, memorable metaphor.IG reel cold open
02:20
That's the matrix moment.
Three words, universally understood, maximum punch as a standalone cut.TikTok hook
06:20
I've used this skill every day for two weeks, and I'm still on the free tier. It's crazy.
Personal proof + disbelief = social proof without feeling like marketing.newsletter pull-quote
07:40
Whatever you're using video for, you can probably stop watching it manually because of this skill.
Broad permission statement that directly challenges current viewer behavior.IG reel cold open
§ · Pacing

How they spent the runtime.

Hook length47s
Info densityhigh
Filler5%
§ · Resources Mentioned

Things they pointed at.

§ · CTA Breakdown

How they asked for the click.

08:25subscribe
If that's where you wanna take this, that's the next video to watch. It's linked up here. If this was useful, hit subscribe.

Clean, no hard sell. Next-video link appears visually. Subscribe ask is brief and earned after a dense value delivery.

§ · The Script

Word for word.

metaphoranalogy
00:00When you give Claude code the ability to instantly watch any video on the Internet for free, it becomes genuinely unstoppable. With this Claude skill, Claude can understand video as well as it reads PDFs.
00:12Hours long YouTube videos, Instagram reels, looms, local files, anything. Before, Claude was just guessing.
00:18Now it can watch the whole thing frame by frame instantly. It's like Neo plugging into the matrix. By the time you've hit play, Claude's already watched the whole thing and become an expert.
00:28I've tried a bunch of transcript tools before developing this one, and they all let me down. They either cost way too much or they only ever read the transcript and missed half the video. This skill gives Claude the frames and the audio together, so it actually sees what's happening on screen.
00:42Right now, I'll walk you through exactly how it all works, the use case that completely changed how I consume content, and how to set this up in your own Claude code in under five minutes. Here's what it actually looks like on a forty five minute video done in less than a few minutes. On the left, I have a YC lecture from Sam Altman about how to start a startup.
01:00I'm gonna press play on that now and then grab the URL. All I have to do is go over to Claude and type slash watch and then paste the URL here. Then Claude gets to work grabbing the subtitles from YouTube for extracting the frames, and analyzing them all together.
01:14So the reason this is better than just pulling the transcript is because Claude can actually grab the frames from this video. In this lecture, Sam goes through and shows a bunch of really great graphs. And this is important context for Claude because if you're only getting the transcript, you're only getting half of the information.
01:30Now here's where most of the existing video tools fall short because they base everything around the transcript. When something happens on screen and it's not explicitly referenced in the transcript, Claude doesn't know about it, and you miss out on key context, which matters because half of the interesting stuff in a video isn't said out loud.
01:45It happens on screen. So this skill actually watches. It pulls frame by frame screenshots and puts it together with a per second time stamped transcript to get Claude the full picture and full context.
01:56And just like that, we're only two minutes into the lecture. Sam is still introducing what he's gonna talk about today, and Claude has already ingested the entire thing. I have a structured summary of all of the speakers.
02:08I can see exactly what they talk about, and now I can actually query Claude on anything about this context and then start to put it to work instantly right here in the terminal. That's a forty five minute video done in less than two minutes watched, analyzed, and applied.
02:24That's the matrix moment. You're not watching content anymore. You're actually downloading context automatically and putting it to work straight away.
02:31And you're probably thinking at the moment, there's some expensive API doing the heavy lifting here, but there isn't. But before we get into that, let's get into the setup. By the way, I'm giving this whole skill away for free on GitHub.
02:41The link is in the description below. Just run these install commands and the setup takes care of the rest. Once the skill is installed, Claude runs the setup script and installs any dependencies that you don't have already.
02:53It authenticates with the transcription API. Don't worry. This one is pretty much free and we'll get to it in a second.
02:59But under the hood, the pipeline is actually surprisingly simple. Now here's the part that nobody really talks about. Claude can't actually watch video because Anthropic doesn't have a video model yet.
03:09There are some other providers that can, like Google's Gemini model, but they're pretty expensive and they don't integrate nicely with Claude. So if you're watching a lot of content, that bill stacks up pretty fast. Luckily, there's a smarter way to do this because if you really break it down, a video is just two things.
03:22It's a bunch of frames and a transcript. That's it. So instead of paying for another expensive model, I can just split the video into those two pieces and hand it to Claude in a format that it already knows how to read, pictures and text.
03:33Now this is the part I love because the skill is doing this with two of the oldest, most battle tested line tools on the Internet, YouTube DLP and FFmpeg. These aren't MCPs. They're not some new wrapper.
03:44There's no third party service involved in the middle here. They install one song right on your machine. Millions of developers have used them for over a decade now.
03:51They're rock solid and completely free. And they're what every video tool you've ever touched is probably using under the hood. YTDLP is the downloader.
03:58You can think of it as a right click save video, but it works on basically the whole Internet. FFmpeg is the video engine.
04:05It takes the video and turns it into two things that Claude actually wants. First, screenshots, which are taken every few seconds all the way through the video, and then second, the audio file, which is pulled out as a clean little file ready to be transcribed into text using Whisper. Now Claude has the full picture when we put these two together.
04:20It's flipping through the screenshots like a flipbook, reading the transcript like a script, and the time stamps line up exactly so it knows on screen when something is being said. So that's the whole pipeline. YouTube DLP and FFmpeg doing all the heavy lifting locally on your laptop for free.
04:34The only thing we actually have to pay for here is the transcription and Claude usage. Caption's transcription is pretty much free. The skill just pulls them.
04:41And if it doesn't, it transcribes the audio using Whisper hosted on Grok or OpenAI. I prefer Grok because it's extremely fast, and their free tier covers basically anything you throw at it. So most videos cost you literally nothing to transcribe.
04:54I even used this exact skill to grow a universal context layer for content research, and I'll show you exactly how it works in a minute. And I can literally hear the keyboards clattering right now. Brad, this is gonna torch your token budget.
05:06But this actually surprised me, so let's do the math. The skill scales frame count to video length, and it actually caps anything over thirty minutes to a 100 frames. So thirty minute video and a one hour video pretty much cost the same amount in dollar terms, and that's about $1 per run.
05:20I ran every test in this video three times in parallel and burned less than 10% of my session, and that's over five hours of video watched live by Claude with transcription. And the transcription part's where it gets ridiculous. Every YouTube video comes with a free transcript.
05:34The skill just pulls them. There's no Whisper, no API call. It's totally free.
05:38And that goes for a bunch of other sites too. Whisper only kicks in for the stuff without captions, like a raw m p four, a Loom, or Instagram Reel. Grok's free tier actually gives you two hours of transcriptions per hour, which covers more than you'll realistically throw at it.
05:52I've used this skill every day for two weeks, and I'm still on the free tier. It's crazy. Look.
05:56I'm not saying this is perfect, and there's probably optimizations I haven't thought of, but for most people watching, this is essentially free. If you got ideas to make it cheaper or quicker, drop them in the comments below.
06:07Once I realized this was basically free, I started running it on everything, which is how I ended up building the system I'll show you at the end, and it's one that's generally changed how I consume content. Here's the part that actually makes this skill a must have.
06:20It works on any URL YTTLP supports, which is over a thousand sites. So this isn't just limited to the big social media companies or YouTube. And it even works if you have the files locally downloaded.
06:31So that opens up a bunch of use cases that you probably wouldn't expect. So this is what I'm doing for content research now. I take a winning video from the Internet, and I ask Claude to break down the hook.
06:40Claude tells me the visual setup, the exact words, where the pattern interrupt lands, and what's on screen at the moment of the hook. That used to take me ten minutes per video pausing and scrubbing, now it's just a paste. And for developers, there's another use case, debugging screen recordings.
06:53If you're a developer and a UI bug shows up, you record a thirty second screen recording, drop it into Claude, and ask what happens right before the crash. Claude reads the frames around that moment, finds a state change, and tells you the exact frame the issue starts with. That alone has saved me hours.
07:09The skill also has a zoom flag, start time and end time. So you can drop those in and click and focus frame by frame extraction on a specific window of a video. So you can ask about a ten second segment of a two hour video without burning your entire context window.
07:22Whatever you're using video for, you can probably stop watching it manually because of this skill. So earlier, I told you that once you start using this thing, it seriously starts to change how you consume content. Now I wanna show you my personal favorite use case for it, which is feeding my second brain.
07:36I keep a knowledge base in Obsidian with notes, snippets, ideas for content. And the bottleneck for me has always been throughput because there's just so much good content out there by creators at the moment. There's not enough time to watch it all and write it all down.
07:47So I let Claude do both. I give it every single competitor that I think makes great content, and then from there, Claude uses the watch skill to automatically watch it and feed it straight into my second brain. So Claude watches each of these videos, frames, audio, everything, and then comes back with a clean structure and notes about what made the video work.
08:05It fills that straight into the second brain. And this is where things start to compound because the skill and your second brain are watching more and more videos, getting more and more context, and it's getting better and better over time, getting smarter automatically. The second brain side of this whole thing is a video on its own, and I walk through exactly how I run mine, content research, competitor intel, every podcast video I've ever listened to all in one searchable layer in Obsidian.
08:28If that's where you wanna take this, that's the next video to watch. It's linked up here. If this was useful, hit subscribe.
08:34Thanks for watching, and I'll see you in the next one.
§ · For Joe

Lead with the outcome, not the tool.

Format to steal

Brad never explains yt-dlp until after you've already watched a 45-minute lecture get ingested in 90 seconds — the demo sells, the explainer closes.

  • Open with the biggest possible before/after. The "matrix moment" lands at 2:23 before a single line of pipeline explanation.
  • The "decompose the expensive thing" framing is reusable: any time AI cannot do X natively, ask what X is made of and feed those pieces instead.
  • The cost math section is a trust unlock — show the pricing table, kill the "this will be expensive" objection, then say "I have been on the free tier for two weeks."
  • The second-brain use case buried at 7:42 is actually the strongest hook in the video. Lead with compound intelligence, not one-off demos.
  • The GitHub link early (2:43) plus no-account-needed install turns a tutorial into a distribution channel — every viewer becomes a potential user in 5 minutes.
§ · For You

Stop watching videos manually.

What this means for your time

If you spend time watching tutorials, lectures, or competitor content to extract information from them, this skill replaces that work with a URL paste — and the result is queryable.

  • Paste any YouTube URL (or local video file) and ask Claude specific questions about it — no watching required.
  • A 45-minute lecture takes Claude under 2 minutes to fully process; you can then ask follow-up questions like it is a conversation.
  • It works on over 1,000 sites beyond YouTube: Loom, Instagram Reels, TikTok, local MP4s.
  • Cost is roughly $1 per video or completely free for YouTube (captions are pulled directly — no transcription API needed).
  • Setup takes 5 minutes. The skill is free and open-source on GitHub at bradautomates/claude-video.
  • For developers: drop in a screen recording of a UI bug and ask Claude which exact frame the crash starts on — saves hours of scrubbing.
§ · Frame Gallery

Visual moments.

§ · Watch next

More from this channel + related dossiers.