Big Idea

The argument in one line.

Prompt caching is what makes Claude Code's session limits feel generous, and three behavioral habits capture 95% of the savings: respect the one-hour TTL, start fresh between tasks, and avoid mid-session model switches.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

A Claude user running multi-turn coding sessions or long projects who wants to understand why their token limits feel tight and how caching actually works.
Someone using Claude via API or sub-agents who's building automation and needs to optimize token spend without overhauling their existing workflow.
A developer or technical founder who's curious about prompt caching but finds the official Anthropic docs overwhelming and wants the 80/20 version.

SKIP IF…

You're using Claude only for occasional chat or one-off queries — caching won't meaningfully impact your token usage or session limits.
You've already optimized your Claude workflows and understand cache TTL rules, hit rates, and session handoff patterns — this is introductory material.

TL;DR

The full version, fast.

Claude prompt caching quietly cuts token costs to 10% of normal input, but most users burn through session limits by triggering recaches they don't understand. Cached tokens live for one hour on a Claude Code subscription, five minutes on API or sub-agents, and the cache rebuilds in three layers: system instructions and tools, project files like CLAUDE.md, and the growing conversation. Switching models mid-session, including the opus-plan toggle, invalidates the cache entirely, as does pausing past the TTL or editing the system prompt. Three habits cover almost everyone: don't pause longer than an hour, start fresh when switching tasks with a handoff summary instead of compact, and put large documents into projects rather than pasting them into chat.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 01:25

01 · What caching actually costs

Hook on real dashboard numbers. 10% cost for cached tokens. 1hr TTL on subscription, 5min on API/sub-agents. Thoric/Anthropic quote on cache hit rate monitoring.

01:25 – 02:47

02 · How the cache grows per turn

System layer globally cached, Project layer per-project, Conversation layer grows every turn. Prefix-matching via Thoric diagram.

02:47 – 04:32

03 · The 4-turn visual example

Four-turn diagram showing what is cached vs processed fresh each turn. Danger: changing system prompt at message 16+ means full recache.

04:32 – 05:05

04 · Three layers and what breaks each

System / Project / Conversation table with exact events that bust each layer.

05:05 – 06:12

05 · Cache lifetime TTL table

Subscription within plan ~1hr. On usage credits: 5min. API key: 5min. Sub-agents: 5min. Addresses April Reddit panic.

06:12 – 07:34

06 · Three habits that cover 95%

1. Do not pause too long. 2. Start fresh when you switch. 3. Do not paste big one-off docs. Demo of session-handoff skill.

07:34 – 09:01

07 · What else breaks the cache

Model switching = full recache. Opus plan mode = cache-breaking on every plan/execute toggle. Editing CLAUDE.md mid-session is safe.

09:01 – 10:43

08 · Token dashboard and CTA

Free GitHub repo in School community. Tracks sessions/turns/tokens/cache reads/cost. Local device only. Setup via one Claude Code command.

Atomic Insights

Lines worth screenshotting.

Cached tokens cost only 10 percent of normal input tokens — saving 91 million tokens in a single day is the equivalent of processing just 9 million.
The Claude Code cache window is one hour on a subscription; leave a session idle for an hour and you pay full price to reload everything.
If your Claude subscription runs out and you tip into API usage, the cache TTL drops from one hour to five minutes by default.
Changing your system prompt mid-session forces every token to recache from scratch, no matter how far into the conversation you are.
Anthropic runs internal alerts on cache hit rates and declares service-level incidents if they drop too low.
The three habits that fix 95 percent of caching problems are: do not let sessions sit idle for over an hour, do not change the system prompt mid-session, and manage sub-agent cache TTL explicitly.
Sub-agents using the API default to a five-minute TTL, which means rapid-fire multi-agent runs burn through tokens at an explosive rate unless configured otherwise.

Takeaway

Three habits stop 95% of cache waste

What it teaches

Claude's prompt cache is on by default and saves up to 90% on repeated tokens — but three common habits silently bust it, and knowing the TTL rules is the whole game.

01What caching actually costs

Cached tokens cost only 10% of normal input, so a high cache hit rate is the primary lever on both session limits and per-token costs.

02How the cache grows per turn

The cache grows in three layers — system instructions cached globally, project files cached per-project, and conversation turns cached and extended each turn.

03The 4-turn visual example

The cache is rebuilt from scratch on three events: waiting past the TTL, changing the system prompt, or switching models — any of which can turn a cheap session expensive.

04Three layers and what breaks each

Each cache layer has its own failure mode — system layer breaks on prompt edits or TTL expiry, project layer breaks on CLAUDE.md restarts, conversation layer resets every session.

05Cache lifetime TTL table

On a Claude subscription, the cache window is one hour; on API access or sub-agents, it drops to five minutes by default, which is easy to exceed when managing multiple sessions.

06Three habits that cover 95%

The three habits that cover most users: do not let a session sit idle past the TTL, start a fresh session when switching tasks, and avoid pasting large one-off documents into the chat.
A session-handoff pattern — summarize state, clear the session, paste the summary into the new session — preserves continuity without the cost of a cache miss on a long conversation.

07What else breaks the cache

Switching models mid-session resets the cache entirely because the cache relies on prefix matching, and a model switch changes the prefix.
The Opus-plan-mode setting causes a model switch on every plan/execute toggle, which resets the cache each time — useful to know before enabling it as a token-saving move.
Editing CLAUDE.md mid-session does not break the cache because the change is not applied until the session restarts.

08Token dashboard and CTA

For large documents, dropping them into a project rather than the chat gives them better caching treatment, even though the exact mechanism is not fully documented.

Glossary

Terms worth knowing.

Prompt caching: A feature that stores parts of a conversation (system instructions, tools, project files, prior turns) so they don't need to be reprocessed on every new message, dramatically cutting cost and speeding up responses.
Cache read: Tokens that the model reuses from an existing cache instead of processing fresh. These are billed at roughly 10% of normal input token cost.
Cache create: The one-time act of writing content into the cache the first time it's seen. It costs more than a normal input token but pays off on subsequent turns that read from it.
Cache hit rate: The percentage of incoming tokens served from cache versus reprocessed from scratch. A high hit rate means cheaper, faster responses and longer practical sessions.
TTL (time to live): How long a cached snapshot stays valid before it expires and must be rebuilt. On Claude subscriptions it's one hour by default; on direct API calls and sub-agents it's five minutes unless explicitly extended.
Claude Code: Anthropic's command-line coding tool that runs Claude inside a terminal or editor extension, with access to file reads, writes, shell commands, and project-level memory.
Sub agents: Secondary Claude instances spawned by a main session to handle delegated subtasks. They run on a separate five-minute cache TTL regardless of the parent plan.
Session limits: Per-week token or usage caps tied to a Claude subscription tier. Hitting them pushes the user into pay-per-token API billing for any additional work.
System prompt: The baseline instructions and tool definitions loaded at the start of every Claude session. Changing it mid-session invalidates the entire cache and forces a full rebuild.
CLAUDE.md: A per-project markdown file that Claude Code automatically loads as persistent context, holding rules, conventions, and reference material specific to that codebase.
Prefix matching: How the cache decides whether a new request can reuse stored tokens: the incoming conversation must start with the exact same sequence of tokens as what was cached. Any change earlier in the prefix breaks the match.
/compact: A Claude Code slash command that summarizes the current conversation into a shorter version to free up context. It breaks the existing cache because the conversation prefix changes.
/clear: A Claude Code slash command that wipes the current session and starts a fresh one with no prior conversation history loaded.
Session handoff: The practice of summarizing an in-progress session's key files, decisions, and next steps so the work can be resumed cleanly in a new session without losing context.
Plan mode: A Claude Code mode where the model drafts an implementation plan before writing code. It's commonly paired with a stronger model for planning and a cheaper one for execution.
Opus / Sonnet: Two tiers of Anthropic's Claude models. Opus is the most capable and expensive; Sonnet is the mid-tier workhorse used for most coding tasks.
model opus-plan: A Claude Code setting that uses Opus during plan mode and switches to Sonnet for execution. The mid-session model switch resets the cache, so the savings come from cheaper execution, not from cache reuse.
Claude Projects: A feature inside the Claude.ai web app that lets users attach reference documents to a workspace so the model can access them across conversations, with caching optimized for repeated document lookups.
Skill: A reusable prompt or workflow packaged for Claude Code so it can be invoked on demand, like a custom slash command that performs a specific multi-step task.

Resources

Things they pointed at.

01:13linkThoric X article: Lessons from Building Claude Code Prompt Caching Is Everything

06:12toolSession Handoff Claude Code Skill ↗

09:01toolToken Dashboard GitHub Repo ↗

10:19linkLance Martin X post: Prompt auto-caching with Claude

Quotables

Lines you could clip.

01:13

“We run alerts on our prompt cache hit rate and declare SEVs if theyre too low.”

Anthropic-sourced authority, highest clippability→ TikTok hook↗ Tweet quote

07:58

“Keep it alive. Keep it focused. Start fresh when you switch.”

Three-word triptych, zero jargon→ newsletter pull-quote↗ Tweet quote

08:28

“If you switch the model, you are recaching everything.”

Counterintuitive gotcha, actionable→ IG reel cold open↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

metaphoranalogy

00:00So look at this. On this day, I saved 91,000,000 tokens because of cache read, and in the past week, I've saved over 300,000,000 tokens because of it. Now don't freak out.

00:07This isn't anything that you have to go change. This is happening automatically if you are using Claude code or Claude. I And know that the concept of prompt caching might seem a little bit overwhelming, but today, I'm gonna make it as simple as possible and only really tell you what you need to know in order to make sure that you are saving your session limits and saving tokens.

00:22I'll also give you guys this entire token dashboard for free so you can actually start tracking your tokens a little bit Anyway, so let's talk about prompt caching, why your sessions burn out, and how to stop it. So what does caching actually cost you? Well, cached tokens only cost you 10% of normal input.

00:37So all the tokens that are getting cached are saving you a ton of money. So if we go back to this example, on this day when I had 91,000,000 tokens cached, that costed me only as if I was processing about 9,000,000 of those tokens.

00:48The cache window on a cloud subscription is an hour. Meaning, if you're working with cloud code and you don't touch it for an hour and then you send another message, everything in that session gets uncached. So if you leave a session sitting for an hour or longer, then you're gonna pay more for it.

01:03And if you're using Cloud via API or sub agents, then the TTL or the time to live is only five minutes. You can change that, but it's just a little bit more expensive. You could bump it up to an hour if you want.

01:11But for Claude code inside of your terminal or your extension, whatever it is, that's an hour. And now here's a quote from Thoric from Anthropic. He said that we actually run alerts on our prompt cash hit rate and declare SUVs if they're too low.

01:22So, basically, them saying we take this stuff really, really seriously. And if we see that the hit rate isn't very high for users' Cloud Code caching, then we do something about it immediately.

01:32And that's very nice of them, but also, of course, it benefits themselves because with a high cache hit rate, Cloud Code feels faster, their serving cost is lower, subscription limits feel more generous, you know, because you're using less, and long coding sessions stay practical. And then if you have low cache hit rate, this is what happens.

01:49And, obviously, it's just a lose lose for everybody. And that's why I said, like, prompt caching can get very, very complex. And if you wanna check out more, then I'll link this article right here, which Thorik really goes into some depth here.

02:01But if you read this, at least when I did, was like, okay. This is a little bit overwhelming. I have a feeling I don't actually need to know all of this, but I do need to know at least a little bit, at least, you know, the eighty twenty of prompt caching so that I can get the most out of my session limits, and that's what I'm gonna break down today.

02:15So let's take a look at an example of how this actually grows. So by default, when you shoot off a message to Claude, there's going to be some information that needs to be cached right away.

02:25And, actually, let me just switch back to one of Thoric's graphics real quick. You can see here that we have the base system instructions get globally cached. We have tools like read, write, bash, grab, glob globally cached.

02:35We have per memory or sorry, per project things like Cloud. M d and memory, and that gets cached per project. We've got session state, and then we have user messages which grow each turn.

02:45So now that we take this into context, when we flip back over here, this is what it looks like. This is an example where we have four turns.

02:52So on turn one, there's no cache. Basically, we're matching on a prefix. So don't really have to worry about what that means, but I might mention that later.

03:00So, anyways, on turn one, there's nothing. Right? We're opening up a fresh session.

03:03We load in the system prompt, the project context, and we shoot off our first message. And all of this is kind of in this, like, brown highlight border, which means that this is new, and it has to be fully processed, and it's being written to the cache here.

03:17So before I continue down this graphic, in this dashboard, you can see that we have the difference between cache create and cache read. So on these days, you can see what are my input tokens, my output tokens, and my cache create. And And then over here, you can see my daily cache reads.

03:31And just a quick explanation, a cache create is writing something into cache for the first time. It's a onetime cost, and it pays off the next turn, unless, of course, everything gets uncached.

03:41And the cache read is tokens that Claude reused from a cache, like your claude.m d or some of the files or some of the global system instructions. And these are the things that are 10 times cheaper than fresh input. So anyways, on turn two, given that we're within that one hour TTL window, everything here is already in context, so it's cached.

04:00And then all that Claude actually has to process for the first time is reply one and message two, and it caches that. So then down here in turn three, all of that's cached, and we are bumping up a reply and a message, and those are the things that only get processed each time. But if we waited an hour and then we sent another message, or if we change the system prompt, then everything from the very beginning has to get fully recached.

04:22So imagine if you were on message, like, you know, 16 and you're way, way, way over here on the right and you change the system prompt or you wait an hour, then everything getting recached is going to be a pretty expensive move that you just made.

04:34So, anyways, once again, we have the system layer, the project layer, and the conversation layer. The system layer has instructions, tool definitions, output style, and here's where it might break. The project level or the project layer has Cloud.

04:46M d memory and rules, and then here's when that might break. And then we have, of course, the conversation, which is just like the replies and the messages, which gets recached every time, but that's how it should be. So here's where there's been some confusion among the community.

04:59So how long does the cache snapshot live, which is kind of called the TTL, the time to live? So on your Cloud subscription, you have an hour by default because it uses your subscription.

05:11But if you go over that weekly limit and you are now playing in your extra usage territory where you are paying per token API, then by default, that will be five minutes, which is very dangerous if you're managing multiple sessions and you're constantly recaching everything because five minutes is passing. You gotta be careful about that.

05:28And people were kinda suspicious. I don't know if you remember, like, a month or so ago when everyone was complaining about their clawed subscriptions, how quick they were eating it up.

05:36People thought maybe that they switched the cache TTL from an hour to five minutes without, like, saying anything to anybody. It turns out they didn't.

05:43So it is an hour, but that's just like you know, there was a lot of confusion around that. And I get why because, honestly, it's not super clear. Like, if you're on an API, you have five minutes by default, but you can increase the cost and you can do an hour, and then your sub agents on any plan are gonna be five minutes.

05:58And for some reason, all of this is documented about Cloud Code and the API, which are two very different things. But the cloud.ai, like, the web, we don't know exactly how that works.

06:08At least, I haven't found documentation on that exact. I'm assuming it's the same as your subscription, but I don't know a 100% for for fact.

06:15Anyways, three habits that cover 95% of people. Don't pause too long. So if you've gone over an hour on a session, just hand it off to a new session.

06:25Obviously, start fresh when you switch tasks. So do a slash compact, which will break the cache, or do a slash clear. Or you can also use my session handoff skill, which I will include as well for free.

06:35So both the token dashboard GitHub repo and the skill will be in my free school community. The link for that's down in the description. But, basically, what that means is let's say right here, I've got this project which helps me build this HTML file you guys are looking at.

06:46It's got 205,000 tokens in here. And if I come in here and just do a session handoff, this basically summarizes everything we've done, all the important files that we've built, all of the open decisions, exactly where to pick back up.

06:56And then I basically am able to just copy that summary, do a slash clear, and then keep going. And it feels like I haven't actually lost anything.

07:03So that has been basically my replacement for doing slash compact. I've just enjoyed doing this better. And sometimes the compact takes a long time.

07:10This typically doesn't take anywhere over a minute. There you go. So that is my session handoff.

07:14I do a slash copy, and then I just go ahead and clear that, paste it in, hit enter, and now I'm basically right back where I was. And then this last one is for if you're using Claude Chat specifically. If you're gonna be pasting big documents in there, you're probably better off doing a project because like I said, I don't know exactly how the caching works in Cloud Chat, but we do have some confidence in saying that projects, those files are cached a little bit differently and probably more optimized for storing a bunch of documents compared to just dropping them into your Claude chat.

07:42So keep it alive, keep it focused, and start fresh when you switch. Now there's a few other things that were a little bit confusing to me as far as, like, what breaks the cache. So the first one is if you switch the model.

07:53So, you know, if you're in here and you're talking to Claude, hello, hello, hello, and then you go in here and you do a slash model and you actually switch the model, that's going to recache everything. Because if you remember earlier, said it's prefix matching, which I'm not gonna dive into right now. But if you switch the model, then you are switching essentially the prefix, and it can't match on that same cache.

08:11So if you switch the model, you are recaching everything. Now I do wanna apologize for something here because if you do model opus plan, which is something I've shown before in, like, token hacks videos, this basically means it uses opus for plan mode and then it switches to sonnet for the execution.

08:28But if you do that, just keep in mind, that's actually gonna break the cache because you're switching model halfway through. So right here, you can see each model has its own cache. Switching with model means the next request reads the entire conversation history with no cache hits.

08:40Even though the context is identical. The Opus plan model setting resolves to Opus during plan mode and Sonnet during execution, so each plan toggle is a model switch and starts a fresh cache. So it's very interesting because typically the point of that is to save your session limit, and I think ultimately in long run it does, But it is important to understand that doing that does reset the cache.

08:59Now what you can do is you can edit your cloud.md, and you can do that mid session because the edit actually doesn't apply until you restart that session, so the cache stays safe. And then, of course, the cloud.ai projects caching.

09:11It's not exactly documented, but pretty confident that it does help to drop docs in projects rather than in the chat. But, anyways, this token dashboard, like I said, is very helpful to just be able to understand, get a little bit more visibility into your tokens. This does track your tokens on a local device.

09:27So if you switch over to a laptop, then your dashboard is gonna look different than on your main, like, PC or whatever you use. But it's very, very simple. It is a GitHub repo.

09:35You will go to my free school community. The link is in the description. You'll click on classroom.

09:38You'll click on all YouTube resources, and then you'll be able to find it right in there. And once you get that GitHub repo, all you have to do is give the link to Claude code and say, hey. This is a token dashboard.

09:47Set this up on a local host. Boom. You've got it open.

09:50And it will pull in all of your past sessions. So it's not like you're gonna start fresh as soon as you, you know, link in this repo.

09:57It will read your past files, it will pull in your tokens. And then, of course, I will also include that session handoff skill that I just mentioned to you guys. So I know this one was super quick.

10:05Hopefully, this one was helpful, though. It's just important. Like I said, when I hear about stuff like this, I love to understand it to the point where I know how to use it and I know what's going on under the hood.

10:15But truthfully, if I looked at some of these other articles, like how in-depth they go and how much nuance there is, most of the stuff right now, I just don't need to know because I'm not using the the API in this way super heavily. So the reason I wanted to throw that out there is because it's important to stay updated and follow things, but just understand what do you really need to know at its core.

10:34So if you guys enjoyed the video or you learned something new, please give a like. Helps me out a ton. And as always, I appreciate you guys making it to the end of the video, and I'll see you on the next one.

10:41Thanks, guys.

The Hook

The bait, then the rug-pull.

Nate Herk opens on a live token dashboard: 91 million tokens saved in a single day, 300 million in a week, all from prompt caching running silently in the background. The pitch is disarmingly simple: you do not have to change anything, but you do need to know the two or three things that can quietly blow it all up.

Frameworks