Big Idea

The argument in one line.

Token waste in Claude Code concentrates at three chokepoints -- session startup overhead, redundant input context, and verbose model output -- and a small set of targeted tools addresses each one directly.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You are actively using Claude Code or another AI coding CLI on a real project and noticing sessions burning through budget faster than expected.
You work on an existing codebase and want the model to orient faster without reading dozens of raw files.
You want a concrete checklist of installable tools rather than abstract prompting advice.
You are spending $50+ per month on AI coding and want to understand where the money actually goes.

SKIP IF…

You are still learning basic AI coding workflows -- these tools assume you already have a functioning session habit.
Your project is small enough that token costs are not yet a real concern.

TL;DR

The full version, fast.

Token waste in Claude Code clusters into three zones: session startup, input context during active work, and output verbosity. Token Optimizer audits the startup problem by revealing exactly how many tokens each skill, MCP, and CLAUDE.md file burns before you type a single prompt. Intent Layer and Code Review Graph address input waste by giving the model pre-indexed context instead of making it read files in chunks. The Caveman skill cuts output bloat by forcing terse, technically accurate responses. Handoff manages context across sessions, and RTK silently filters noisy shell output before the model ever reads it -- saving the host 192K tokens in a single recorded session.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 00:25

01 · Intro -- three areas of waste

Frames the problem: session startup waste, input token waste, output token waste.

00:25 – 04:41

02 · Tool 1: Token Optimizer

Install and run the audit plugin; inspect the dashboard showing 25K startup tokens broken down by component.

04:41 – 08:02

03 · Tool 2: Caveman

Matt Pocock skill that forces terse output; live comparison showing ~20% cost difference on the same query.

08:02 – 12:42

04 · Tool 3: Intent Layer

Skill that generates nested AGENTS.md files with token-efficient directory maps and anti-pattern rules.

12:42 – 14:48

05 · Tool 4: Handoff

Context-transfer skill; export research session summary for a clean next-session start.

14:48 – 20:14

06 · Tool 5: Writing a Good CLAUDE.md

Whiteboard section covering seven elements of an effective CLAUDE.md under 300 lines.

20:14 – 23:37

07 · Tool 6: Code Review Graph

AST knowledge graph via Tree-sitter; blast-radius analysis; query demo on fork functionality.

23:37 – 28:13

08 · Tool 7: RTK

Shell proxy filtering git, ls, test outputs; 192K tokens saved in the recording session; four filtering strategies.

Atomic Insights

Lines worth screenshotting.

Session startup can consume 25,000+ tokens before you type a single prompt -- most of it from skill files and MCPs you forgot you installed.
The Caveman skill cuts output tokens by roughly 40% in practice, and those savings compound across every subsequent message in the session.
Intent Layer is the highest-impact tool on the list -- it generates nested AGENTS.md files that orient the model to each directory without requiring raw file reads.
A good CLAUDE.md should be under 300 lines and document only what the model cannot infer from your code structure.
RTK saved 192,000 tokens during a single 28-minute recording session -- nearly all of it from git diff and git status operations during a merge conflict.
Code Review Graph gives the model a semantic AST of your codebase, enabling blast-radius analysis -- every file that depends on something you changed gets flagged automatically.
The Handoff skill lets you close a research session cleanly and open the next one with a structured summary instead of a compacted context window.
Documenting naming conventions and file structure in CLAUDE.md is waste -- models infer these from code patterns; document only tribal knowledge and hard constraints.
Output tokens are the most controllable form of waste because you choose how verbose the model needs to be.
A session that uses only 25% of a 1M-token context window is deliberate cost management, not underuse.

Takeaway

Seven tools that each target a different token drain.

WHAT TO LEARN

Most Claude Code token waste is structural, not conversational -- it happens before you type, while the model reads files, and in how verbose it decides to be when it answers.

Session startup tokens are invisible by default: a single audit tool can reveal that your installed skills and MCPs are consuming 9,000+ tokens before every session.
The Caveman constraint does not sacrifice accuracy -- it forces the model to strip narrative filler while preserving technical precision, producing roughly 40% fewer output tokens.
Intent Layer solves the new-session context gap in existing codebases by pre-generating directory-scoped AGENTS.md files the model reads instead of your raw source files.
A useful CLAUDE.md documents only what the model cannot infer from code structure: tribal knowledge, hard constraints, anti-patterns, and non-obvious tooling choices.
The Handoff skill is the right tool when you need to move research findings into an implementation session -- it produces a structured transfer document rather than requiring context compaction.
Code Review Graph speeds up context retrieval more than it cuts token count -- the real win is that the model arrives at correct answers faster because it queries relationships rather than scanning raw text.
RTK is set-and-forget: once installed, it filters every shell command output without any per-session configuration, and the savings compound across all git-heavy workflows.
The four tools the host uses weekly -- RTK, Intent Layer, Caveman, Handoff -- map directly to the four phases of a development session: environment setup, context loading, active work, and session handoff.

Glossary

Terms worth knowing.

Token Optimizer: A Claude Code plugin that runs parallel audit agents across your session configuration and generates a dashboard showing which files and tools are consuming your startup token budget.
Intent Layer: A skill that scans app directories and writes nested AGENTS.md files summarising what lives in each directory, so the model reads a map instead of the raw files.
Caveman mode: A prompt constraint that instructs the model to respond in minimal, technically precise language -- stripping filler sentences and narrative while preserving accuracy.
Handoff: A skill that exports the current session context as a structured summary document, designed to be loaded as the starting point for a new clean session.
Code Review Graph: A tool that parses a codebase into an AST-backed knowledge graph using Tree-sitter, then serves it as an MCP so the model queries relationships instead of reading files in chunks.
Blast-radius analysis: A Code Review Graph feature that, when a file changes, automatically identifies every other file, function, and test that depends on it so the model can proactively check for breakage.
RTK: A command-line proxy that intercepts shell outputs before Claude Code reads them, filtering out noise via smart filtering, grouping, truncation, and deduplication.
AGENTS.md: A directory-scoped context file similar to CLAUDE.md but nested per folder, giving the model a token-efficient summary of what lives in that directory and what rules apply there.

Resources

Things they pointed at.

00:38toolToken Optimizer ↗

04:44toolCaveman (Matt Pocock) ↗

08:02toolIntent Layer (Crafter Station) ↗

12:42toolHandoff (Matt Pocock) ↗

14:48linkIdeal CLAUDE.md template ↗

20:14toolCode Review Graph ↗

23:37toolRTK ↗

00:00productTech Snack Pro (Skool community) ↗

Quotables

Lines you could clip.

12:19

“This is probably the easiest to implement yet highest impact thing on this list.”

Strong declarative claim -- no setup needed.→ TikTok hook↗ Tweet quote

15:17

“A lot of conventions that people use inside of their Claude markdown file are honestly, kind of old and out of date.”

Contrarian opener challenging assumed best practices.→ IG reel cold open↗ Tweet quote

24:48

“A 192,000 tokens saved by using this tool just in the time that I have been recording this video.”

Concrete number, self-contained proof point.→ TikTok hook↗ Tweet quote

27:50

“The RTK proxy that we just went through, the intent layer tool, and then both of Matt Pocock's caveman and handoff skills. I use those on a weekly basis.”

Personal endorsement summary with no fluff.→ newsletter pull-quote↗ Tweet quote

The Script

Word for word.

Read-along

Don't just watch it. Burn it in.

See every word as it's spoken — crank it to 2× and still catch all of it. The same dual-channel trick behind Amazon's Kindle + Audible.

metaphor

I've had a lot of requests for this video topic. How do we stop burning tokens so quickly? This problem comes back, I think, to three core areas, waste at session startup, input token waste, and output token waste.

So in this video, I'm gonna go through seven ways to chop down your context window size to something more reasonable. I'll go through each tactic, talk about what it is, and show you how to do it in a real project.

Starting with the most important piece, auditing where your waste currently lives. So you can't really solve that, which you do not know is a problem. And this first tool called token optimizer helps us identify what is actually eating up our context window.

So the way that we get started with this is pretty simple. We can come down here and we can copy this Claude code command. And then we can hop down into our terminal, start up Claude, and install the plugin.

So first, we're gonna add the marketplace, and then we are going to actually add the plugin. So once this thing is installed on our machine, we can come through and we can run this command for token optimizer. So what this thing is gonna do is it's gonna run through.

It's gonna look at all of your session history. It's gonna look at your clawed code setup on your machine, and it's going to try and determine where exactly your tokens get used.

So while that's running, if we were to go look at the GitHub repo to see exactly what it's doing, it's running six different review agents. So it's gonna look at your Claude markdown file, any sort of memory markdown file you have.

It's gonna look at all of your skills, your MCPs, your custom slash commands, and then any sort of, like, settings and more advanced things like your hooks and general, like, clawed code settings.

So if we were to pop in then and look at, well, what are each of these sub agents doing? They're basically receiving this command to go through and read certain aspects of each of these things.

So for example, the Claude Markdown auditor is moving through, and it is finding the Claude Markdown file, and it's measuring certain things like the line count, how many tokens roughly we estimate that that clawed markdown file is using, and then it's identifying optimization targets.

Now I will say the one downside of a tool like token optimizer is that a lot of these things that it is doing right now could be done deterministically. Like, we could have a script, for example, that actually counts the total number of skills, the total amount of front matter overhead.

All of these things could actually be calculated with a basic script. So this is gonna consume more tokens than it needs to. But if you're only running this as, like, a one off audit, it's really not the end of the world.

Okay. So now that that whole process is done running, what it does is it actually spins us up a dashboard where we can look at the findings in a, like, kind of easy to navigate way.

So we can see in this case, we have about 25,000 tokens in this case that get loaded up on session start. Now for a lot of people, this number is significantly larger, which is obviously a problem.

So if we were to come down, for example, and try to look at this based on, like, where does that actually go, we have around a thousand tokens coming straight from our clawed markdown file. We have about 9,000, which is crazy coming from our skill files.

750 come from slash commands, and 450 are coming from MCPs and tools.

So for example, if we were to pop in and look at these skills and think through, like, well, why is this using so many tokens? Every single skill definition that you have, the name and the description, gets loaded in when Claude starts a new session.

The reason it does that is because it needs to understand if based on the definition and description of this tool, is it something that should actually be used right now? Now, the reason that I have 82 different skills is I'm building a plug in library, so I have a lot of these skills installed globally on my computer. But if you're somebody that tests a lot of different skills, this is something that you're gonna wanna come in and address.

Same thing with the MCP tool. So most likely the case that between your skills and your MCP tools, you have a big chunk of tokens being used.

Now there's a lot of other stuff that you can come in here and explore that makes it really easy to move through and disable skills. Look at things based on the severity. Look at things based on your habits, and generally look at trends across how you use things.

So if you really wanna optimize your token usage, you have to know where you're starting. But what do we do from here? Like, if we've identified this, this is just looking at the things that happen for the most part when a new session starts.

But what do we do from there? How can we cut down on our token usage as we are actively building things? So the next tool up is actually a very simple skill from Matt Pocock, and it is called caveman.

So we've seen a lot of different implementations of this. The main idea behind it is that you tell Claude Co.

To essentially talk like a caveman, talking very basic terse short language to cut out all of the filler and stories that language models like to spin up about things. Because the reality is that results in a lot of token waste. Now the thing that is really valuable about Matt Pocock's library specifically compared with other implementations of this is that it's optimized to keep full technical accuracy.

So what that means is if there's important technical language that is needed to explain what is happening, it's not going to cut that stuff down, but it is going to communicate it in a very direct straightforward way.

So maybe an example is you ask Cloud Code to explain what database connection pooling is. It's going to say pool equals reuse DB connection, skip handshake, fast under load.

Right? So instead of spinning up some narrative that consumes, like, a ton of tokens, it's going to be very direct and to the point.

So an example of how we might use this could be something like slash caveman. Explain how our app manages context summarization.

So in the case of our app, it is a chat based recipe companion that often needs to summarize conversations, and we have some built in functions for that. So let's see how the caveman can explain how that works.

So after, like, about forty seconds, we can see that it's come through, and it's given, like, a very basic summary of how this works. So for example, it fetches the context summary from the database, loads the last 20 messages, injects it as part of the system prompt, and then this is what Claude sees.

And so we can see it's a very direct explanation of exactly how things work. And this is realistically all Claude code needs in order to understand how it actually works.

So if you're moving through and doing, like, a lot of planning, this is a great tool to consider using. If we were to scroll down, for example, and just look at the usage on this one, it took about forty seconds and 42¢ worth of Opus 4.7 credits.

And if we were to pop back in and run this in another chat saying, explain how our app manages context summarization, we can see how there's, like, a lot more narrative being formed around some of these things. So for example, if a summary already exists, it prompts CloudSonic to merge the existing summary with the latest assistant response capped at 500 words.

Like, most of this sentence is not needed in order to explain how it actually functions. Now if we were to scroll down, we can see it took a little bit longer, and it was about, like, 20% more expensive on the token cost to give us this explanation.

So this is, like, a very straightforward example of how that works, but it's a very powerful tool to consider using. Now the skill itself claims to cut token usage by about 75%. I haven't seen that in my testing, but even, like, a 40% reduction in tokens being used, especially output tokens, is a huge win.

Because realistically, this was just one pass of the planning phase. If we are now continuing this conversation and going through multiple passes, those types of token savings will really compound over time.

So if you wanna get a lot more mileage out of each of your sessions without wasting tokens needlessly, this is a great tool to consider using. But this is really only solving the problem of output tokens.

So what do we do about input tokens? Because a massive amount of waste actually takes place before the model even responds to you. So one big problem that I get asked about a lot is specifically how to manage this stuff in the context of an existing code base.

A lot of YouTube tutorials in particular tend to focus on, like, Greenfield things, showing you some new tool and building something from the ground up. But they often don't look at, like, how do you manage this type of stuff in a project that you already have. And intent layers is one tool that can really help us save on tokens in an existing project.

So we can install the skill and then come down into our terminal and actually run the command slash intent layer. So the reason this thing works is that generally speaking, when Cloud Code starts up a new session, it doesn't really have much context at all about things that have already happened inside of the project.

And what that means is when you ask it a question, it needs to spend time and tokens trying to understand the actual grounding of the project, reading files, basically trying to understand based on what you have just asked, which files and functions are even relevant.

It will try to read those things in chunks, send it back to their APIs, and then try to start building a plan from there. And the problem with that is that, number one, that is a token intensive procedure.

And number two, it might not necessarily gather all of the context about what you know about your project and how it works and the caveats and edge cases that you've run into before. So basically, the way that this works is that it's going to look at all of your app directories, and it's going to try to parse out and understand how many tokens are inside of that directory.

Now anytime there are more than roughly 20,000 tokens, it's going to put a nested Claude markdown or agent markdown file in that directory that gives a lot of detail about what is inside of it and how it's meant to actually work.

So now that this thing is done running, we can see that we have this new update inside of our projects agents dot markdown file. We have this intent layer section, which has a listing of, like, the primary pieces of our app architecture and where everything lives, and the types of rules that will be found inside of those sections.

And then we have this global invariance section. What are things about our project that aren't immediately clear simply by reading the code?

So an example of, like, one of those conventions might be that any error that affects billing and authentication must go through our, uh, Sentry configuration. So we can actually capture those errors and get notified when they're happening. Now the thing that's really cool about this is that it's creating a separate agent markdown file that will be read by Claude code or codex or whatever anytime you go to work in one of these areas.

So say, for example, we were gonna make an update to our Stripe payments. Well, when the session starts and this file gets read, it's going to know that it needs to read this specific file first. So what lives inside of this file?

If we were to go into our project and actually look at this new file that's been created, it's giving a very token efficient explanation of everything inside of this directory and what specific files handle what specific things.

And so one example might be like in the anti pattern section. Something that a language model might decide to do is to maybe manually call the Stripe API in order to, like, work around some sort of bug that you don't even realize it's trying to make this work around as it is building. And so we can give an instruction that it's never allowed to bypass this get or create Stripe customer function, which is actually what handles the logic for, in this case, creating a new customer.

So similarly, if we were to look inside of, like, our main app directory, we have this agent markdown file, and it's again going to explain how everything works. What is the logic of the route grouping?

How does the middleware actually work in the app? What are the important patterns or conventions that should be followed? And then what are our related context files that you should be aware of if you are, again, working with any of this.

So this is probably the easiest to implement yet highest impact thing on this list, especially if you are working on an existing project. And so in a bit, I'll show you another tool that solves a similar problem in a different way.

But first, let's talk about how we can manage the context that gets generated in the middle of a conversation. So one convention that most good AI coding skills and plug ins follow is having scratch pad systems in place. Meaning, we generate a bunch of context, we keep what matters, and then we move it into the next stage of the conversation, typically with a clear context window.

And so especially with these 1,000,000 token context windows, you can spend a lot of money very quickly if you are not paying attention, which is why I tend to limit the usage of those windows to 25 to 30%.

So one example of a skill that can really help us out with this, again, comes from Matt Pocock's skill library, and it's called handoff. And so I personally find a lot of the time that compacting a conversation is not really the thing I would want to do.

I just wanna have a detailed summary where I can completely clear the session and just move into a new chat to actually implement on the thing that I just spent time brainstorming around. So let's say, for example, that we wanna improve the rate limiting inside of our app, and a really helpful pattern is to go out and explore or research the problem domain first.

And then based on those findings, we can start building a plan out for how we wanna actually fix this thing or address the thing that we're researching. After about five minutes, our research is complete, and we have these takeaways for how we could consider putting rate limiting in place in our app. So we have some documentation here.

What are the libraries we would use? What are the critical rules we would need to follow? What are the different algorithms we could consider using for how the rate limiting is gonna actually, like, go?

Now let's say for whatever reason, maybe we continue this chat and we continue to aggregate more context, or we just need to be able to store this output for a later implementation. We can just come through and run this hand off command. And then it's gonna ask us what is the next session gonna be used for, and we could say implementing rate limiting or rather building a plan for implementing rate limiting in our app.

And so now that this handoff command is complete, we can see that we've transferred over all of that context that we just spent time generating, and we could now kick off a new session in order to actually implement anything that we found inside of this that we would want inside of our project. So scratch pad systems like this are great, but one thing people really don't talk about enough is the slow death caused by a bad Claude markdown file.

So a lot of conventions that people use inside of their Claude markdown file are honestly, like, kind of old and out of date. And I think this is something where, like, engineers that use AI coding tools are really good about having disciplined Claude Markdown files. And people that are more of, like, the vibe code first, they don't really give the Claude markdown file the type of justice it deserves.

So the consensus today is that these things should be pretty light, less than 300 lines long, but there's also, like, very specific things that they should include inside of them. The first thing is motivational intent or, like, a project one liner.

The thing with language models is that they will inevitably face a situation where there's ambiguity in what needs to be done. And when they have motivation of your project or your intent behind anything that you're doing, they tend to hit the mark of good a lot more easily. So having a one liner inside of your clogged markdown file that explains the purpose of your project and the motivation behind it is really valuable.

The second thing is any non obvious tooling. So language models are actually really good at seeing patterns and knowing the things that are obvious based on what is in your project already.

But anything that is not obvious or are, like, package specific considerations based on your project setup are the types of things that should be documented in your project.

So for example, if you use Next. Js to build your apps, a lot of the training data of current models is based on out of date Next. Js documentation.

And so this is an example where you would want to specifically tell it the version that you are using and known constraints or differences in that version in the context of your project. So we can actually see an example of this inside of that agent markdown file that we created earlier with intent layers, where proxy dot TypeScript is the new middleware for Next.

Js. And so the reason something like this is important is that when you're going through and you're building things that touch that middleware layer, something like ClaudeCode will go out there without this type of instruction and try to tell you that you don't actually have a middleware file in place and then invent one, create one on the fly, and then next thing you know, you've got a bunch of errors inside of your app.

Things aren't hooked together the right way, and it generally becomes a huge pain in the ass. Number three, a concise architectural map. Now this doesn't mean having, like, diagrams explaining where everything is.

Again, a tool like Claude Code or Codex is gonna be really good at inferring things based on the structure of your project. So anything that deviates from the norm is gonna be super valuable to document.

And by the way, if you want an example template of one of these files, I will link to one in the description below this video. And so, again, we can see a version of that inside of this intent layer output.

Rules with verifiable instructions. So for example, instead of saying write clean, code, saying something like parameterize all SQL queries.

Hard constraints and anti patterns, which again is something that we saw inside of those nested, clawed markdown files in our project earlier, again, with the intent layer system. So an example of that might be, don't add a new public route inside of our app without updating the is public path handler, or else people are gonna get redirected to login.

Pointers to deeper documentation about specific things. So for example, inside of these areas, if your clawed markdown file is getting too large, so you're starting to get past this 300 line limit, you can start breaking some of those things down into rule directories that get linked to and called when they need to be read.

But also that intent layer system is a perfect example of pointing to deeper docs inside of the code base. And then last but not least, gotchas and tribal knowledge. It's inevitable that as you're building things, you will run into errors, run into issues that you have to work around.

And you wanna make sure those things are documented so they they do not resurface later on in time. So putting all of these things together makes for a really strong Clard Markdown file.

And the thing that is interesting about this structure is it's all based around this idea of documenting things that aren't going to be obvious to the model because they've come a long way in being able to recognize patterns.

And so the real intent of this file is to tell it anything that it couldn't know by just inferring it from the code or the structure of your project, and then generally telling it other things that are just simply not obvious. So all of those old conventions about how to name files and folders and all of that type of stuff is honestly a little bit of a waste.

So this one is pretty conceptually, but you just need to be disciplined about it to actually get it right and grow this file over time. The founder of ClawdCode updates his Clawd Markdown file on a weekly basis, and so that's something that all of us should be doing too. But like I said earlier, we have another method for pointing Claude to specific files and functions that we want it to be aware of in a little bit of a better way.

And so what I was talking about are code graphs. Now this concept of a code graph is starting to get a lot of traction. That's why we see a lot of libraries attempting to do the same exact thing right now.

If you were to go look at GitHub's trending repos, there are probably three of them that are trending over the last day, week, or month that are all trying to solve this same problem. How do we give a AI coding tool more contextual understanding of the actual structure of our code base and where certain types of things actually lives?

So the reason people are making these tools is that the way that Clog code, like, kind of works right now is that when you ask it a question, it needs to read all of your files in chunks, and it tries to gather context about the problem that it is being asked to solve. And, honestly, it's it's pretty good at doing that, but we can always try to do things better, and that is what these code review graphs attempt to do.

So instead of it having to read files in chunks, it's instead going to query a graph that sits in between Cloud Code and your code base.

So if we were to install this thing and then come down into our project, we could type in the command code review graph build, and we can see very quickly that it has indexed and built a graph of our entire project. So for example, if we were to come down now that this is installed and ask a question like how does forking inside of our app work, We can see that instead of just reading files, it is using this code review graph MCP server.

And so the reason that this is a little bit different than how Claude code would work out of the box is that it finds the, like, original files that we're talking about in this context.

But now since it has, like, a a graph understanding of the code base, meaning all the different functions and the different imports and how they're related to each other, it can very quickly go from this fork action file to all of the different children and all of the different other files that call it.

So it's able to very quickly come to an understanding about how something works and the exact flow and order of things because, again, it has a knowledge graph of the code base. It has a deeper, like, semantic understanding of how the code base is actually structured and what connects to what.

And that is not something that Claude Code actually has. And so a lot of people claim that there's big token savings with tools like these, and I I think you tend to get, like, some degree of token savings. But the biggest thing that I have found using tools like this is that it can retrieve similar context a lot faster.

And the reason for that is instead of having to, again, like, read files and chunks and try to understand how everything is connected, it's just told how everything is connected.

Now the thing that's really cool about this is as you move through and actually make changes to your code base, you're making commits, it's going to update this graph over time, and it can actually help you understand better how other files might be impacted by what you're doing.

So for example, it will run a blast radius analysis, which means anytime you change a file, it's going to look at every single other file that calls it or is dependent on it or tests it. And then our AI models can read those files specifically and determine if there's something that needs to be done.

So I haven't experienced 49 times fewer tokens. I have experienced a little bit of token efficiency when you're doing larger tasks, but the biggest thing that I have noticed is that it tends to be a lot faster at getting to a solution with really good accuracy, meaning it's finding the same types of things that clogged code would eventually find.

So last up on this list is another one that is really easy to implement and has a surprisingly big impact on your token efficiency. And so this app is called RTK, and it is basically a tool that sits in between your command line and Claude code, and it cleans up all of the outputs to that terminal that Claude code would otherwise read.

So for example, when clog code is running like a list everything in this directory command, or it's trying to read everything in a file, or search for files, or run a git status command, or a git diff command, or run tests inside of your project, it's reading the output of all of those commands.

And so the result of that is that you can honestly waste a lot of tokens by having Cloud Code just read outputs that it doesn't actually have to read. And so the way that you install this, the simplest way is to just use homebrew and type in brew install RTK, and it from there will work pretty much out of the box.

So just to show you an example of this, if we were to come down into my terminal and run RTK gain, these are the token savings from using this tool just in the time that I have been recording this video that we've all been watching. A 192,000 tokens saved by using this tool.

And the reason for that is that I had to clean up one of my work trees before I ran that intent layer skill that we looked at, and that required running a lot of git diffs and helping me manage different conflicts in my merges. And running all of those, like, git status commands and the git diff commands and reading the files, that is something that would have taken, like, a ton of tokens.

And in this case, we were able to cut down on about a 160,000 tokens by simply using this RTK command as a proxy.

Now the way that this thing works is that it sets up a hook inside of your coding agent. So anytime it sees that it's going to try to use a command like git status, it is instead going to use that RTK command.

So as an example, if we wanted to see, like, what's the difference between our current development branch against some other branch that we have, it's now proxying through this RTK command in order to gain an understanding of whatever it is that we're asking.

So in this case, it's running RTK Git log instead of the conventional Git log. And so all of these commands are, again, like, proxying through this RTK library. And what that means is that we're going to be saving a lot of tokens by filtering out information that isn't necessary really for Claude Co.

To see, and we can always override it if we need to. So there's really four ways that this thing works. Uh, number one is smart filtering.

So if there's any sort of, like, just general noise, like comments or white space or boilerplate inside of an output, it's going to filter that out so that Claude code doesn't need to waste input tokens on reading it. It can aggregate similar items together.

So for example, maybe your terminal is just dumping the same 500 error over and over and over and over again.

It can group those things together and then just tell ClaudeCode how many times it's happening. In the case of things like Git logs, it can keep the relevant context and then truncate or cut out, like, really long descriptions or redundant information.

Again, it can be overwritten if it needs to be to go gather that info if it's really important. And then kind of similar to what I described in grouping, it can deduplicate things. So if it has repeated log lines, it can just show that one time with the count of how many times that thing is happening.

So these are, like, the four primary strategies that get applied to every single command that gets used. And so if you wanna see how much is actually being effective for you, again, you can run this RTK gain command, and this is going to show you exactly how many tokens it saved you through those four methods, like the filtering and the grouping and all of that stuff.

And the thing that's pretty cool about this, if we were to come down and do RTK help, we can actually rewrite commands and customize this if we ever encounter something that is outside of the scope of what it has been trained on by default, we can move through and build our own commands, which is pretty cool.

So there you have it. Seven tools that will really help you cut down the amount of tokens that you are using. So of the tools on this list that we went through, there are four of them that I actually use on a daily or weekly basis.

The RTK proxy that we just went through, the intent layer tool, and then both of Matt Pocock's caveman and handoff skills. I use those on a weekly basis to really help cut down on my token usage.

There will be a link to that Claude markdown file that I mentioned earlier in the description below, but that's it for this video. I will see you in the next one.

The Hook

The bait, then the rug-pull.

Every session has a token budget, and most of it drains before you type a single prompt. Seven tools attack the waste at its source -- from the skills you forgot you installed to the git diffs the model reads but never needed to see.

Frameworks