Building an Agentic Dev Pipeline — From Ad-Hoc Prompting to a Repeatable Protocol

How eleven design decisions, a structured interview technique, and two effectiveness axes turned a slash command into a self-managing dev loop.

I had a problem that most people using LLMs for development eventually hit: inconsistency. Sometimes I’d get well-structured code with tests. Sometimes I’d get a half-finished implementation with no tests and no explanation of what changed. Sometimes I’d ask the same question twice and get architecturally different answers. The issue wasn’t the model — it was me. Every session started from scratch. No shared protocol. No handoffs. Just vibes.

The pipeline I built this week replaces that. It’s not a framework or a library — it’s nine files that define a protocol: who does what, in what order, with what information, and when to ask me before proceeding. The output is repeatable. The quality is auditable. And the design decisions that shaped it contain, I think, some generally useful lessons about building agentic systems.

One-line thesis: a well-designed agentic pipeline is a protocol, not a prompt.

The Pipeline

The result is five stages, seven specialised agents, a routing brain, and a slash command that orchestrates everything.

/forge <input>
│
├─ [PLANNER — Opus]         interviews user (idea-first) or reads existing doc
│       ↓
├─ [DISPATCHER — Haiku]     certainty gate → auto-route or surface to user
│       ↓
├─ ── PLAN APPROVAL ──      hard stop, always — user reads plan.md before code runs
│       ↓
├─ [CODER — Sonnet]         implements; escalates to planner on ambiguity
│       ↓
├─ [DISPATCHER — Haiku]     certainty gate
│       ↓
├─ [QA-REVIEWER — Sonnet]   reviews output; writes review-findings.md
│       ↓
├─ [DISPATCHER — Haiku]     certainty gate
│       ↓
├─ ── REVIEW APPROVAL ──    hard stop, always — user reads findings before tests run
│       ↓
├─ [TEST-WRITER — Sonnet]   reads review findings; writes + runs pytest
│       ↓
└─ [DOCS-WRITER — Sonnet]   changelog written to {changelog-dir}

Lesson G1 — Pick model per task, not per session. The interview surfaced this early: Opus for deep architectural planning, Sonnet for implementation and review, Haiku for routing decisions and file exploration. The model is a parameter of the task, not a global setting for the session. Switching models mid-pipeline is normal and expected.


The Core Insight: Certainty Grading and the Handoff Contract

This is the part of the design I’d build into any agentic system, not just this one.

Certainty grading

Most agentic pipelines use binary escalation: the agent either succeeds or fails, and failure routes somewhere. That’s too coarse. There are three meaningfully different states an agent can be in when it finishes a task:

  • It knows it’s done correctly (sure)
  • It thinks it’s done but something feels off (unsure)
  • It doesn’t have enough information to evaluate its own output (dont-know)

Only the first state should auto-route to the next stage. The second and third should surface to the user — not with a failure message, but with the specific question the agent can’t resolve, plus the context needed to answer it.

The practical difference: an agent that returns unsure has done useful work. It has a draft. It has a specific blocker. Treating it as a failure throws away that work. Treating it as a success routes broken output downstream. Graded certainty is the right primitive: auto-route on sure, ask the human on anything else.

The handoff log as contract

Agents don’t share memory. Each agent runs in an isolated context — what it knows is what it reads at the start of its turn. The only way information flows between agents is through files they write and files the next agent reads. This means the handoff file isn’t just a log — it’s the contract.

The handoff log has two parts, each serving a different consumer:

{
  "agent": "coder",
  "status": "done",
  "certainty": "sure",
  "escalate": false,
  "files_touched": ["file1", "file2"],
  "log_path": ".claude/workflow/run-20260519-143012/coder.md"
}
---
# coder @ 2026-05-20T14:30:12

## did
- added rate_limit_check() to api/chat.py before request processing
- added RATE_LIMIT_RPM config key to core/config.py

## state
- files-touched: [file1, file2}
- tests: not-run
- open-issues: none

The JSON front-matter is for the dispatcher. It reads only that block — fast, structured, unambiguous. The narrative is for the human reading the log after the session. Two consumers with different needs, one file with two sections.

This design pays off repeatedly. The qa-reviewer reads the coder’s files_touched list to know what to review. The test-writer reads the reviewer’s findings to know which files to skip. The final documentation stage reads both to write the changelog. Every downstream agent inherits structured context it didn’t have to discover itself.


Quality-Effectiveness: Decisions That Improved Output

Dispatcher isolation

The dispatcher is Haiku-model, read-only, and never writes anything. It reads one file and returns one structured decision. That’s its entire job.

I considered embedding routing logic in the slash command itself — checking the handoff payload inline and branching from there. The problem is that routing logic grows. Edge cases accumulate. Exceptions get added. A routing brain embedded in the orchestrator becomes a routing brain nobody maintains separately.

Isolating the dispatcher means it can be reasoned about independently. It has no write access, so it can’t corrupt state. It has a fixed input format and a fixed output format. If routing breaks, the dispatcher is the only place to look.

Lesson G4 — Routing agents are read-only. A dispatcher that can write is a dispatcher that can corrupt state. The constraint isn’t just good practice — it’s what makes the dispatcher auditable. If something routes wrong, the dispatcher file is the complete record of why.

Escalation paths with clean ownership

The escalation rules follow ownership boundaries: the coder escalates to the planner when it encounters ambiguity or architectural decisions it can’t resolve. The test-writer escalates to the coder when tests reveal a source bug. Neither agent crosses into the other’s territory.

This almost breaks down in one place: what should the test-writer do if it finds a trivial issue in source code — a wrong import, a typo in a return value? The answer, after thinking it through, is still “escalate to coder.” The alternative — test-writer patches source — blurs the boundary. A test-writer that sometimes fixes source code is a test-writer you can’t trust to stay in its lane. The escalation overhead is worth the predictability.

The multiple-solutions protocol

The planner has an explicit rule: if planning surfaces more than one viable approach, it never picks silently. It either runs the @mattpocock skill:grill-me interview to surface the user’s preference, or it presents an options table with pros and cons and waits for a choice.

This matters because the first viable solution the model finds isn’t always the right one for the user’s constraints. The planner might see three ways to implement rate limiting — middleware, decorator, or gateway-level. The right choice depends on factors the planner doesn’t know: whether the user wants it to apply to all routes or specific ones, whether it needs to be user-scoped, whether they care about testability over simplicity. Making that choice visible forces it to be made deliberately.

Hard approval gates

The pipeline has two hard stops: one after planning and one after review. These fire regardless of the certainty grade — even if every agent returned sure, the user still sees the plan before code runs and the findings before tests run.

This gives 2 escalation paths: coder → planner (ambiguity or architectural decision), test-writer → coder (source bug found during testing). Both directions are resolved before the pipeline continues — no silent swallowing of problems.

Lesson G5 — Hard gates are for humans, certainty gates are for agents. Agent sure means the agent has no unresolved questions. It doesn’t mean the user has seen the output and agrees with it. These are different things. The certainty mechanism prevents broken output from flowing downstream. The approval gate prevents correct output from flowing forward without human awareness. Both are necessary.


Token-Savings Effectiveness: Same Quality, Fewer Tokens

The researcher subagent pattern

The planner and coder both need codebase context before they can work. The naive approach is to read that context directly — open ten files, search for symbols, trace call paths. The problem is that all of that reading lands in the main agent’s context, where it stays for the rest of the turn.

Instead, both agents dispatch a researcher subagent: a Haiku-model agent with read-only access to the codebase. The researcher answers one specific question — “where is rate limiting handled and what are the relevant invariants?” — and returns a dense summary. Its context is then discarded. Only the summary flows back.

Lesson G2 — Subagent context is throwaway; only the summary is cargo. This is the main token-savings pattern in agentic systems. If you need to read ten files to answer a question, dispatch a subagent to read them. The subagent’s context is the working memory for that investigation. The calling agent’s context only receives the conclusion. On a multi-file investigation, this can reduce main-context token usage by 60–80% compared to reading everything inline.

Prompt caching is a protocol

Claude’s prompt cache keys on the prefix of a conversation — the longest stable prefix gets cached and reused. This means the order in which you provide context matters for cost.

Every agent in the pipeline follows the same rule: read all static files first, generate output last. The files don’t change between invocations. The instructions don’t change between runs (stable agent descriptions). Only the dynamic content — the task-specific details — changes. By putting static content first and dynamic content last, cache hits are maximised on every re-invocation.

The agent descriptions themselves are part of this strategy. They’re fixed markdown files. Changing an agent description mid-project busts the cache for every subsequent run of that agent. Keeping them stable is a cost decision, not just a quality decision.

Lesson G3 — Prompt caching is a protocol, not a setting. You don’t “turn on” caching. You design for it. Files before instructions. Stable descriptions. Dynamic content at the end. Miss this and you pay full price on every token, every run.

The review-to-test handoff

Before the test-writer runs, the qa-reviewer writes review-findings.md — a structured document that summarises every file it reviewed: what it found, what the test-writer should cover, and which files need a deeper look before writing tests.

The test-writer reads this document first. For files marked clean, it writes tests based on the review summary without re-reading the source. It only re-reads files explicitly flagged needs-deeper-look. On a typical change touching four or five files, this eliminates two or three full file reads from the test-writer’s context.

This is a deliberate design choice, not an optimisation discovered after the fact. The review document is the communication channel between reviewer and tester. It saves tokens and it saves time — but the primary reason to write it is to give the test-writer structured, already-interpreted context, not raw source code it has to interpret from scratch.

### Orchestrator context minimalism

To prevent the orchestrator drifting off its routing protocol mid-run, because of context dillution. Agents are instructed to output only the compact handoff-payload JSON in their response — all narrative goes to the handoff file, not to the tool result. The orchestrator is told to treat agent responses as opaque completion signals and never reference their content. This keeps the orchestrator’s accumulated context flat across a full run. The handoff files grow with every stage; the conversation context does not. The same routing rules that were prominent at turn one are still proportionally prominent at turn ten.

> **Lesson G6 — Minimise tool results, not just tool calls.** The same principle that makes researcher subagents efficient (throwaway context, summary cargo) applies to every agent in the pipeline. If the orchestrator never needs to read an agent’s response, make that response a compact machine-readable signal. All human-readable content belongs in the files and stdout, where it can be read by the right consumer at the right time.


What We Shipped

Nine files. No changes to the application source. The pipeline is entirely meta — it lives in .claude/ alongside the project it manages.

.claude/
  commands/forge.md   # orchestrator slash command
  agents/
    planner.md                 # Opus — planning + grill-me interview
    dispatcher.md              # Haiku — routing brain (read-only)
    researcher.md              # Haiku — codebase explorer (subagent only)
    coder.md                   # Sonnet — implementation
    qa-reviewer.md             # Sonnet — code review + findings doc
    test-writer.md             # Sonnet — tests + pytest
    docs-writer.md             # Sonnet — changelog entry
  workflow/
    .gitignore                 # logs persist locally, not in git

Every pipeline run writes its working files to .claude/workflow/run-{timestamp}/ — plan, handoff logs, review findings. They’re gitignored but not cleaned up. After a session, I can read exactly what each agent did, what it decided, and why it routed the way it did. That introspectability isn’t an afterthought — it’s why the handoff log has a narrative section at all.

The full pipeline is at github@PazsitZ/forge — nine files, drop into any Claude Code project.

The run logs also feed the docs-writer agent. It reads `plan.md` (why the change was made) and the coder handoff log (`files_touched`, what changed) instead of re-reading source files. The structured data that made routing fast makes documentation cheap — same files, second consumer, near-zero additional tokens. The handoff log earns its cost at every stage, not just the one that wrote it.

The pipeline’s full state persistence also enables resumption: if a run is aborted mid-way — context overflow, user interrupt, network issue — the handoff logs already on disk contain everything needed to re-enter. `/forge .claude/workflow/run-{timestamp}` scans for the most advanced completed stage and picks up from there, without regenerating the timestamp or re-running completed work.


Key Takeaways

  • G1 — Model per task, not per session. Opus for architectural decisions. Sonnet for implementation and review. Haiku for routing and exploration. Switch freely; pay only for what the task requires.

  • G2 — Subagent context is throwaway; only the summary is cargo. Multi-file investigations belong in a subagent. The calling agent receives the conclusion, not the working memory.

  • G3 — Prompt caching is a protocol. Static files first, dynamic content last. Stable agent descriptions. Design for cache hits from the start, not as a retrofit.

  • G4 — Routing agents are read-only. A dispatcher that can write can corrupt state. The read-only constraint is what makes routing auditable and safe.

  • G5 — Hard gates are for humans; certainty gates are for agents. Agent certainty tells you the agent has no unresolved questions. It tells you nothing about whether the user has seen the output. Both mechanisms are needed.

  • G6 — Minimise tool results, not just tool calls. If the orchestrator never reads an agent’s response, make that response a compact machine-readable signal. All human-readable content belongs in the handoff files. Verbose tool results accumulate in the orchestrator’s context and dilute its routing instructions over a long run.

The pipeline took one design session and one implementation pass to build. The durable artifact isn’t the configuration it produced today — it’s the protocol that will regenerate a correct configuration every time I run it, regardless of how the codebase changes underneath it. That’s the point of building a pipeline instead of writing a better prompt.

Files are at github@PazsitZ/forge

Leave a Reply

Your email address will not be published. Required fields are marked *