Big Idea

The argument in one line.

Software engineering fundamentals — small tasks, tight feedback loops, shared design concepts, and deep modules — are what make autonomous AI coding agents produce high-quality output, and skipping them is why most developers are frustrated with AI code.

Who This Is For

Read if. Skip if.

READ IF YOU ARE…

You write code with AI every day and keep hitting walls where the agent drifts into the dumb zone or produces garbage after the first few sessions.
You are a solo developer or small-team lead trying to build a reliable night-shift pipeline where agents ship features autonomously while you are away.
You want a reproducible workflow that takes a vague brief all the way to a committed, TDD-tested, code-reviewed feature with AI doing the heavy lifting.
You are curious how classic software engineering books map directly onto AI agent patterns.

SKIP IF…

You want a step-by-step tutorial for a specific tool or CLI — this is framework-level thinking delivered through a live demo, not a beginner setup guide.
You already run a mature agentic orchestration stack and are looking for eval benchmarks or model comparisons rather than workflow philosophy.

TL;DR

The full version, fast.

LLMs have a smart zone of roughly 100k tokens, and the entire workflow is designed to stay inside it. A slash-command grill session stress-tests a vague brief and builds a shared design concept between developer and AI before a single line of code is written. That conversation becomes a PRD — the destination document. The PRD is sliced into vertical Kanban issues that each cross all system layers, enabling the agent to get integrated feedback after every issue. An autonomous AFK loop runs TDD against those issues. The ceiling on output quality is the quality of the feedback loops: codebases with shallow modules and no tests produce slop regardless of model size.

Free for members

Chat with this breakdown — free.

Sign in and you get 23 free chat messages on us — ask for the hook, quote a framework, find the exact transcript moment, generate a markdown action plan. Bring your own key when you want unlimited.

Create a free account →

Chapters

Where the time goes.

00:00 – 04:20

01 · Introduction & Thesis

Workshop kickoff; audience poll on AI coding experience; core claim that SE fundamentals work for AI.

04:20 – 12:45

02 · Smart Zone & Memento

LLM attention quadratic scaling; the 100k smart zone; compacting vs. clearing; multi-phase plan as precursor to DAG.

12:45 – 22:10

03 · The Grill Me Skill

Live /grill-me demo on a gamification brief; shared design concept vs. plan; sub-agents; 25k tokens of alignment.

22:10 – 35:50

04 · Q&A: Grilling & Alignment

Specs-to-code critique; meta-prompting tool ecosystem; who should run grill sessions; 1M context window reality check.

35:50 – 48:15

05 · Writing the PRD

/write-prd demo; destination doc structure; vertical vs. horizontal slicing; tracer bullet concept; proposed modules in the PRD.

48:15 – 1:05:30

06 · Slicing into Issues & AFK Agent

Kanban board from PRD; DAG blocking relationships; parallelization; Ralph loop prompt walkthrough; TDD red-green-refactor live.

1:05:30 – 1:18:45

07 · QA, Code Review & Human Touch

Human QA as taste mechanism; more code review is unavoidable; team workflow for planning phases; prototype role in front end.

1:18:45 – 1:23:30

08 · Deep vs. Shallow Modules

Ousterhout deep module concept; AI defaults to shallow; /improve-codebase-architecture skill live scan; big integration test boundaries.

1:23:30 – 1:34:06

09 · Parallelization with Sandcastle

Sandcastle TypeScript library; planner-implementer-reviewer-merger pipeline; push vs. pull for coding standards; Opus review / Sonnet implement; final summary.

Atomic Insights

Lines worth screenshotting.

LLMs go dumb at roughly 100k tokens regardless of context window size — a 1M window is just more dumb zone, not more smart zone.
Compacting preserves sediment and introduces drift; clearing resets to a known state every time.
The goal of the grill session is a shared design concept, not a plan — AI in plan mode produces a plan before you have reached alignment.
Specs-to-code fails not because specs are bad but because it encourages treating the code as irrelevant; the code is the battleground.
AI codes horizontally by default — layer by layer — which delays integrated feedback until the final phase; vertical tracer-bullet slices fix this.
A tracer bullet issue must cross schema, service, and UI in a single ticket so the agent can test the entire integrated flow.
A sequential multi-phase plan can only be worked by one agent; a DAG of issues with blocking relationships enables genuine parallelization.
The ceiling on AI coding quality is the quality of your feedback loops — agents coding without tests and type checks silently ship bad code.
AI unaided produces shallow modules — many tiny files with small exports — because it codes by spreading changes across layers.
A deep module has a large complex interior behind a small simple interface; wrap a big integration test boundary around it.
Leaving a closed PRD as a markdown file in the repo causes doc rot — the agent finds it, trusts it, and drifts from the actual codebase.
Push coding standards to the reviewer agent so they are always in context; let the implementer pull them on demand.
Human QA is the mechanism for imposing taste — automating it entirely produces apps that technically function but feel like slop.
The human role shifts to a day shift: grilling, PRD, Kanban slicing. The agent handles the night shift: implementation, TDD, automated review.
Using Opus for code review and Sonnet for implementation is a deliberate cost-quality tradeoff — reviewing requires more reasoning, implementation more throughput.

Takeaway

The agent is only as good as the codebase you hand it.

WHAT TO LEARN

Classic software engineering discipline — shared alignment, tight feedback loops, and deep modules — is the multiplier that separates high-output AI coding from expensive slop generation.

01Introduction & Thesis

AI is a new paradigm only in tooling — the underlying discipline of writing good software still determines output quality.

02Smart Zone & Memento

Size every task to stay inside roughly 100k tokens by clearing context between sessions rather than compacting, which accumulates noise.
A 1M context window gives you more dumb zone, not more smart zone — the smart ceiling has not risen proportionally.

03The Grill Me Skill

Run a structured grilling session before writing any plan; the goal is a shared design concept with the AI, not an asset you hand to the AI.
Sub-agents that explore the codebase before the grill session add accuracy without bloating the parent context window.

04Q&A: Grilling & Alignment

Specs-to-code fails because it encourages you to ignore the code; keep the codebase in view throughout planning and use it to sanity-check every proposed module.
Own your planning stack rather than delegating it to a third-party framework — when it breaks, you need to know how to fix it.

05Writing the PRD

The PRD is a destination document and a definition of done — not a spec you hand to the AI and then stop reading the code.
Slice your PRD into vertical tracer-bullet issues so the agent gets integrated feedback after every issue, not only at the end of a horizontal phase.

06Slicing into Issues & AFK Agent

A DAG of Kanban issues with explicit blocking relationships enables parallel agent runs; a numbered sequential plan does not.
The ceiling on agent output is your feedback loop quality; agents coding without tests and type checks produce garbage silently.
TDD red-green-refactor is harder for the agent to cheat than writing tests after implementation — it instruments the code before writing it.

07QA, Code Review & Human Touch

Human QA is not a bottleneck to automate away — it is the mechanism for imposing taste, and removing it produces apps that work but feel like slop.
Expect to do more code review than ever before; there is no shortcut for reviewing agent-generated output.

08Deep vs. Shallow Modules

AI defaults to shallow modules — many small exports with little logic; intentionally design deep modules with simple interfaces and big integration test boundaries.
Prefer closing issues over keeping completed PRDs as markdown files in the repo; stale documentation causes agents to drift from the actual codebase.

09Parallelization with Sandcastle

Push coding standards to the reviewer agent so they are always in context; let the implementer pull them on demand to avoid bloating every implementation session.
Classic pre-AI software books — Pragmatic Programmer, Brooks, Ousterhout, Fowler — already codified the principles that make AI agents effective; they are the highest-leverage prompt engineering resource available.

Glossary

Terms worth knowing.

Smart Zone: The portion of an LLM context window — roughly the first 100k tokens — where output quality is highest before attention relationships become strained.
Dumb Zone: The portion of a context window beyond the smart zone where the model makes increasingly poor decisions; the failure mode of long uncleared sessions.
Compacting: Squeezing a long conversation history into a shorter summary to reclaim context space, at the cost of accumulated noise from the summarization.
Grill Me Skill: A Claude Code slash command that interviews the developer relentlessly about a brief — one question at a time with a recommended answer — until AI and human reach a shared design concept.
Design Concept: Frederick Brooks's term for the shared mental model of what is being built, held by all participants; the grill session is explicitly trying to build this between human and AI.
PRD: Product Requirements Document — a destination document summarizing the shared design concept, user stories, implementation decisions, and out-of-scope items used as the AI's definition of done.
Tracer Bullet: From Pragmatic Programmer: a development unit that crosses all system layers end-to-end, providing immediate integrated feedback — named after phosphorescent bullets that show a gunner where they are aimed.
Vertical Slice: A Kanban issue that touches schema, service logic, and UI in a single ticket, enabling the agent to produce testable integrated output after each issue rather than one layer at a time.
Ralph Loop: An autonomous AFK agent loop: the agent picks the next Kanban issue, implements it with TDD, runs feedback loops, commits, and repeats until the backlog is empty.
AFK: Away From Keyboard — tasks the agent can complete without human involvement. Contrasted with HITL (Human In The Loop) tasks like grilling and QA.
Deep Module: From John Ousterhout: a module with a simple public interface and a large amount of logic inside, easy to test with a big integration boundary and easy for agents to reason about.
Shallow Module: A module that exports many small functions with little internal logic; the default AI output, difficult to test meaningfully and hard for agents to navigate.
Sandcastle: A TypeScript library (@ai-hero/sandcastle) for running agent loops in parallel Docker sandbox worktrees, with a planner-implementer-reviewer-merger pipeline.
DAG: Directed Acyclic Graph — a Kanban board structure where issues have explicit blocking relationships, allowing parallel execution of non-blocked branches by independent agents.
Doc Rot: The failure mode where an outdated document remains in the repo, is discovered by the agent, and causes drift from the actual codebase because the agent trusts the stale doc.

Resources

Things they pointed at.

09:00productHuman Layer (Dex Horthy) ↗

13:20linkAI Hero (aihero.dev) ↗

1:26:40toolSandcastle (@ai-hero/sandcastle) ↗

40:50bookPragmatic Programmer

26:20bookThe Design of Design (Frederick P. Brooks)

1:19:40bookA Philosophy of Software Design (John Ousterhout)

34:20bookRefactoring (Martin Fowler)

1:29:40linkBeads Framework (Steve Yegge)

Quotables