Modern Creator
Taoufik · YouTube

This Claude Skill Watches Videos So You Don't Have To

A 4-minute tutorial on giving Claude the ability to truly watch any video -- by replacing timer-based frame grabs with FFmpeg scene detection.

Posted
5 days ago
Duration
Format
Tutorial
educational
Views
4.3K
266 likes
Big Idea

The argument in one line.

Timer-based frame sampling makes Claude effectively blind on long videos -- switching to FFmpeg scene detection fixes this by capturing one frame every time the content actually changes, regardless of duration.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…
  • You use Claude Code or Claude CoWork and want it to analyze long YouTube videos, Loom recordings, or Zoom calls without missing visual content.
  • You have tried giving Claude a transcript and found it missed on-screen graphics, diagrams, or code demos.
  • You are building a personal knowledge base (Obsidian or similar) and want to auto-ingest video research.
SKIP IF…
  • You only need quick summaries of short YouTube videos where the transcript alone is sufficient.
  • You are looking for a hosted tool that requires no local setup -- both install paths require yt-dlp and FFmpeg on your machine.
TL;DR

The full version, fast.

The old approach of grabbing 100 frames at fixed intervals leaves Claude nearly blind on videos longer than an hour -- one frame every six minutes on a ten-hour course. The fix is FFmpeg scene detection, which captures a frame every time the actual content changes rather than on a clock, paired with yt-dlp to pull YouTube captions for free. The result is a /watch skill usable from both Claude's desktop app and Claude Code CLI, with an optional Obsidian ingestion step that connects each video to an existing knowledge graph.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →
Chapters

Where the time goes.

00:0000:47

01 · Hook -- result first

Opens on completed output with caption 'it already did everything.' States the skill's capabilities (YouTube, Zoom, Loom) and previews the video structure: how it works, what it does, how to install.

00:4701:50

02 · The problem with the old approach

Explains the original Brad skill and its blind spot: 100 frames at fixed intervals = 1 frame/36s on a 1-hour video, 1 frame/6min on a 10-hour course. Claude was reading transcript with a few random screenshots.

01:5002:17

03 · The FFmpeg scene-detection fix

Introduces the scene-detection approach: one frame every time the content actually changes. Animated timeline shows the dense, event-triggered frame capture vs the sparse timer-based grid.

02:1702:51

04 · Audio pipeline

yt-dlp pulls YouTube captions for free. Non-YouTube sources (Loom, Zoom) fall back to Whisper or Deepgram. Illustrated with animated audio-to-text flow diagram.

02:5103:37

05 · What it does -- Obsidian integration

After analysis, Claude asks: save to knowledge base? Demo shows Obsidian graph view with the new video node connected to existing research. Positions the skill as a second-brain feed, not just a one-off summarizer.

03:3704:10

06 · Install -- desktop (Claude CoWork)

GitHub releases page, download the .skill file, Settings -> Capabilities -> enable Code Execution + File Creation, import skill, type /watch <link>.

04:1004:53

07 · Install -- Claude Code CLI + live demo

Clone the repo (or paste the GitHub URL and Claude clones itself). Demo on a Karpathy video: /watch <url>, Claude runs, extracts frames, pulls transcript, delivers analysis. Closes with Obsidian ingestion prompt and teaser for next video on second-brain setup.

Atomic Insights

Lines worth screenshotting.

  • Grabbing 100 frames from a 10-hour video means one frame every 6 minutes -- Claude is reading a transcript with a handful of random screenshots.
  • FFmpeg scene detection captures one frame per content change, not per clock tick, making the frame set representative regardless of video length.
  • YouTube captions pulled via yt-dlp are free; only non-YouTube sources (Loom, Zoom) need a paid fallback like Whisper or Deepgram.
  • A Claude skill can be distributed as a single .skill file for desktop users, or as a GitHub repo that Claude Code clones itself from a pasted URL.
  • The /watch skill was originally built by someone named Brad; this video documents one person's improvements and redistribution of that original work.
  • Showing the finished output before the explanation is a reliable hook structure for tool-demo videos -- the result is on screen within the first five seconds.
  • Obsidian's graph view becomes substantially more useful when video research is auto-connected to existing notes rather than saved in isolation.
Takeaway

Scene detection is what makes Claude actually see a video.

WHAT TO LEARN

Fixed-interval frame sampling fails silently on long videos -- switching to scene-change detection produces a frame set that is always proportional to content density, not duration.

  • A 100-frame budget spread across a 10-hour video yields one frame every 6 minutes -- Claude reads the transcript with a handful of screenshots and misses every diagram, code demo, and motion graphic.
  • FFmpeg scene detection triggers a capture when the image content actually changes, so a fast-cut tutorial gets dense coverage and a static talking-head section gets sparse coverage automatically.
  • yt-dlp can pull YouTube captions directly at no cost; only non-YouTube sources like Loom or Zoom recordings require a paid transcription fallback such as Whisper or Deepgram.
  • Distributing a Claude skill as a single .skill file lowers the install barrier enough that non-technical users can add it via drag-and-drop in the desktop app.
  • Connecting video analysis outputs to an existing knowledge graph (Obsidian) turns one-off research sessions into a compounding asset -- each new video becomes a node linked to prior context.
Glossary

Terms worth knowing.

Scene detection
An FFmpeg algorithm that compares adjacent frames and triggers a capture when the visual content changes beyond a set threshold, rather than at fixed time intervals.
yt-dlp
A command-line tool that downloads YouTube videos and can extract auto-generated or manual captions as a text file without re-transcribing the audio.
Claude CoWork (desktop)
The desktop client for Claude that supports drag-and-drop skill installation via .skill files and exposes a /watch command when the skill is loaded.
Second brain
A personal knowledge management system (here, Obsidian) where notes, video analyses, and research are stored and cross-linked for future retrieval.
Resources

Things they pointed at.

00:44channelBrad | AI & Automation (YouTube channel)
02:17toolyt-dlp
02:17toolFFmpeg
02:17toolWhisper (OpenAI)
03:10toolObsidian
03:37linkclaude-watch GitHub repo
05:16channelAndrej Karpathy YouTube channel
Quotables

Lines you could clip.

01:09
On a one-hour video, that is one frame every thirty-six seconds. On a ten-hour video, one frame every six minutes.
Concrete numbers land the scale of the problem in one sentenceTikTok hook↗ Tweet quote
01:30
Clothe was basically just reading the transcript with a few random screenshots.
Pithy summary of a relatable limitationIG reel cold open↗ Tweet quote
02:10
So instead of grabbing frames on a timer, it grabs one frame every time the scene actually changes.
Clean one-sentence explanation of the core fixnewsletter pull-quote↗ Tweet quote
The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

00:00While still on the first chapter. As you can see here, it gives me all the information that I need.
00:06So I'll show you right now the cloth scale that I gave it to Claude. So it will be able to watch YouTube videos, Zoom calls, or VLOOM videos. Anything with a link, you give it to Clothe, and it will be able to watch it.
00:17So I'm going to show you how it works, what it does, and how to install it in your own machine. Link's in the description. Before, I used to use transcript tools on YouTube videos when I wanted Clothecode to analyze the transcript as well the transcript for meetings and the Zoom calls and so on.
00:35However, there was always a missing piece, which was the visual side of the video. Then I discovered this skill shared by a guy named Brad. Shout out to him.
00:45So when I gave it something like this, a ten hour course from NatureC, it just couldn't keep up. It skipped what was on screen.
00:52So I've spent some times and fix it, and it can actually see what's happening, the motion graphics, the edits, the b roll. So here's how it works. A video is at two things, frames and transcripts.
01:05The transcript is the easy parts. However, for frames were the problem.
01:10The old way grabbed a 100 frames from the whole video. On one hour video, that's one frame every thirty six seconds. On a ten hour videos, one frame every six minutes.
01:21So Clothe was basically just reading the transcript with a few random screenshots. So after a lot of trials and errors, I switched to FFmpeg scene detection.
01:31So instead of grabbing frames on a timer, it grabs one frame every time the scene actually changes. For the audio YouTube, the LP pulls the caption straight off YouTube when the video has them, which is free. If it doesn't, like Loom or Zoom calls, in this case, it fall back to Whisper or Grok.
01:50And when Cloud Code done watching, it asks you one question. Do you want to save this into your knowledge base? For this, I use Obsidian.
01:58And as you can see right now, I have whenever I will watch a new video or I want to analyze a new video, as you can see, this got added to my second brain where it lives everything, all my context. So everything got connected here and can leverage this on the future when I want. And let me show you right now how you can use this skill.
02:17So it's pretty simple. You just go to my GitHub repository. And if you want to use the desktop version, which is using Cloud CoWork, we just go and you download from releases.
02:28Here, you go to the release section, and you'll find here watch skill. You download this one.
02:34And after downloading it, you just go to settings. After that, you go to capabilities, and you need to ensure that code execution and file creation is enabled.
02:45Then you import the skill. So right now, the skills lives on customized. So you you can see here the watch dot skill, and you are done.
02:52Right now, it is uploaded, and you can see all the information about this skill. So in order to use it, you just simply type slash watch, and you can see here the first option with the description.
03:03So you give it just the link to a YouTube video or a Zoom call or a Loom, and it would go and see the whole video. And, however, if you want to use the Clothecode version, you just need to go to my GitHub repository, and you can clone the link.
03:18You can even simply just take the link to my GitHub, and it will automatically clone it for you, and you can start using it right away. For example, I will take a video from, let's say, Androsh Karpati. I hope I pronounce his name correctly.
03:33So I just go and paste the link here. First, you do slash watch.
03:40You will have the skill, and you paste the URL and why you are watching it.
03:45So for my case, it's just to analyze it. And I can simply say, analyze this video and tell me if there is some graphics or visuals that are mentioned on this video and what are the key concepts from this video. In your case, it may as well first to download the YouTube DLP and the FFmpegs.
04:03So you don't need to do anything right now. It will watch the video, and it gives you all the analysis.
04:12So here, it's finished watching the video. So it gives me all the information as you can see, and I can go through this deeply and analyze it. And as well, it will ask me if I want to ingest this now.
04:24Does this mean that it can add it to my context so it will be related to the other pages that I have on my second brain? So this is how you can watch YouTube video while you are doing the other work. And when it's finished, it will give you the full reports.
04:38And in in the next video, I will show you how to set up this second brain. This took me a month in order to build it correctly. I will show you how you can do it as well.
04:47So that's it for this video, and I hope you find this useful.
The Hook

The bait, then the rug-pull.

The video opens in medias res on a finished Claude analysis -- frames, transcript, and key concepts already populated -- with the on-screen caption 'it already did everything.' The spoken hook follows immediately: a result-first structure that shows the payoff before a single line of explanation.

Frameworks

Named ideas worth stealing.

01:50concept

Scene-detection frame sampling

Replace fixed-interval frame grabs with FFmpeg scene-change detection so the frame set is always proportional to content density, not video length.

Steal forany pipeline that needs to give an LLM visual context from video without ballooning token count on static sections
00:00concept

Result-first hook

Show the finished, impressive output in the first five seconds. Explanation follows only after the viewer has already seen proof that it works.

Steal forany tool-demo YouTube video or product intro
CTA Breakdown

How they asked for the click.

VERBAL ASK
04:36next-video
In the next video, I will show you how to set up this second brain. This took me a month in order to build it correctly.

Soft close with a compelling hook for the follow-up. GitHub link placed in description rather than spoken.

MENTIONED ON CAMERA
02:17toolyt-dlp
02:17toolFFmpeg
03:10toolObsidian
FROM THE DESCRIPTION
PRIMARY CTAWhere the creator wants you to go next.
Storyboard

Visual structure at a glance.

result-first hook
hookresult-first hook00:00
skill overview
promiseskill overview00:42
old-way problem
valueold-way problem01:00
scene detection fix
valuescene detection fix01:50
audio pipeline
valueaudio pipeline02:17
Obsidian output
valueObsidian output02:51
install desktop
valueinstall desktop03:37
install CLI + demo
valueinstall CLI + demo04:13
outro / next video
ctaoutro / next video04:36
Frame Gallery

Visual moments.

Watch next

More from this channel + related breakdowns.

Chat about this