How seven load-bearing principles across chat sessions and agentic pipelines, keep LLM dev costs manageable without degrading what the tools produce.
Occasinally I had a problem most people working with LLMs eventually run into: Long sessions forgot their own constraints. Multi-file investigations dumped thousands of tokens into the main context and never gave them back. Pipelines paid full price for content that should have been cached. None of it was the model’s fault, all of it needs only changing how you worked with the tools. The patterns below come experience and research about LLM assisted development to scale. Design choices about where to spend tokens and where not to.
One-line thesis: token efficiency is a design discipline, effectivness not being cheap.
The Two Axes
- Cost: tokens billed. Visible on the invoice.
- Attention: whether the model follows your intent. Invisible until you re-run.
Cutting cost by degrading attention is the tradeoff you didn’t mean to make. Aggressive length limits look cheaper per call and cost twice as much in re-runs. A truncated context window saves tokens and produces a confident answer to a different question. Every technique below is judged on both axes, cheaper and sharper, or it doesn’t earn its place.
Seven patterns are load-bearing enough to anchor as numbered lessons (T1–T7). Nine more are worth doing and get a paragraph each. Some trivial, some needs background knowledge.
Foundations: applies to chat and pipelines
Match model to task, not session
The model tier is a decision per task, not a global setting for the session.
| Task | Right tier |
|---|---|
| Architecture, complex reasoning, planning | Opus |
| Implementation, review, writing | Sonnet |
| Routing, classification, extraction, exploration | Haiku |
Paying Opus pricing for a routing decision costs 5–10× what it should. Paying Haiku pricing for architectural planning produces output you’ll re-run, which costs more in the end. The cost of switching mid-session is one cache miss. The cost of not switching is paying the wrong tier on every turn until the session ends.
Lesson T1 – Pick model per task, not per session. Architecture goes to the deep model. Implementation goes to the mid-tier. Routing and exploration go to the fast model. Switching mid-flow is a feature, not a compromise.
Batch questions before sending
Each turn adds to the context window. Three partial questions cost three times the input tokens of one complete question and each answer is less coherent because it lacks the full picture.
Before: “What does query.py do?” → “How does it filter?” → “How does it interact with embed.py?”
After: “Explain query.py – main entry points, how filters are applied, how it interacts with embed.py.”
One question, one coherent answer, planned with full context.
Tell the model how long to be
Without a length constraint, the model defaults to comprehensive. Comprehensive is expensive and rarely what you wanted.
Add to prompt: “Answer in 2–3 sentences.” / “Bullet points only.” / “One paragraph max.” Costs ~5 tokens, cuts response length 60–70% on informational queries. Don’t apply to outputs you’ll use verbatim (truncated code costs more in re-runs than the prose you trimmed saved).
Signal uncertainty early, not after a re-run
Add to prompt: “If you’re not confident in any part of this, say so before giving the answer.”
A model that flags uncertainty lets you address it once. A model that produces a confident-sounding wrong answer costs you the original run + the correction + a re-run to verify the fix. The certainty signal converts hidden re-run cost into a visible question.
Layer structural indexes before reading source
Agents default to grep + read-walk to find symbols and trace structure. Pre-computed structural indexes turn that exploration into lookups. Cheapest to most expensive:
- Symbol index: flat
name → {file, line, kind}map. O(1) for where is X defined?. - Package index: exports and reverse dependencies per file. O(1) for who calls X? -impact analysis without grep.
- Per-file structural doc: purpose, exports, dependencies, tags. Read before opening the source.
- The source file: last resort.
Generic shape of a symbol index entry:
{"<symbol>": [{"file": "<path>", "line": 42, "kind": "def"}]}
The chain is built by an AST-walking script in seconds and the script keeps a build cache keyed by file hash so regeneration is near-free. If regeneration is expensive, agents skip it; the indexes go stale; trust collapses. A cheap rebuild is what makes the discipline survive contact with reality.
Where is X defined 3–5 tool calls → 1 lookup. Who calls X fan-out grep → 1 lookup. Different from T4’s throwaway investigation: that pattern is per-task and ephemeral; this one is persistent across every session and every tool that reads your repo.
Lesson T2 – Pre-computed structural indexes turn exploration into lookups. A flat symbol index, a package-level reverse-deps map, and per-file structural docs collapse the most expensive LLM exploration patterns into one-call lookups. Cache the index build so regeneration stays cheap; if it doesn’t, agents will skip it and the indexes will rot.
Chat sessions
Compact at every task boundary
Over a long session, turn history accumulates and dilutes your initial instructions. The model starts ignoring constraints it was following an hour ago, reverts to defaults, or answers a slightly different question. Compact at 60–70% context fill, or at every task boundary. Starting a new conversation for a new task is almost always the right call. Compacting costs a cache miss; not compacting costs re-runs.
Use structured output instead of prose for information retrieval
Before: “Explain the difference between the two approaches.” → 4 paragraphs, ~600 tokens.
After: “Compare the two approaches. Format: markdown table. Approach | Pros | Cons | When to use.” → table, ~150 tokens.
Prose has padding, hedges, restatements, transitions, summarising prefixes (“In summary, the key difference is…”). A table doesn’t. The model can’t pad a cell that doesn’t exist. The same constraint also sharpens the answer: every column is a question the model has to answer explicitly, and missing data becomes visible instead of glossed over in a paragraph.
Use for comparisons, option analysis, file inventories, decision logs, schema dumps. Don’t use for reasoning flow, code generation, or nuanced tradeoffs that don’t compress into cells. Specify the structure in the prompt (column names, JSON schema, exact field names), leaving it open invites the model to re-invent the shape every call.
Lesson T3 – Structured output beats prose on both axes. Tables and JSON have no padding, 3–4× cheaper per query and every cell forces an explicit answer instead of a paragraph that might be glossing. Specify the structure (schema, column names, field names) or the model invents its own and you’ve lost the win.
Throwaway context for investigation
When you need to read many files to answer one question, doing it inline dumps all that reading into your main context for the rest of the session.
How:
- Open a fresh conversation.
- Do the investigation, read files, trace symbols, explore.
- End with: “Write a dense summary of your findings to
research/summary.md. Include: answer, relevant file paths with line numbers, constraints discovered, what you ruled out. No prose.” - Return to the main conversation. Read only that file.
One screen of conclusions beats ten screens of working memory, 60–80% main-context reduction on multi-file investigations. The pipeline analog is the researcher subagent: same idea, automated.
Lesson T4 – Subagent context is throwaway; only the summary is cargo. Working memory belongs in the investigation, not in the calling agent’s window. The conclusion is what gets carried forward, file paths, constraints, dead ends. Everything else stays behind.
Keep the system prompt stable
The cachehits on the longest stable prefix. Your system prompt is always first.
Do: put durable instructions there, coding style, output format preferences, what not to do.
Don’t: put session-specific context there. That changes every session and busts the cache.
The project’s AI guidance file – CLAUDE.md, AGENTS.md, or equivalent is the second cache target. Everything the model reads about your project’s architecture, conventions, and constraints loads from there on every session.
A well-maintained project guide is itself a token-efficiency artifact – it prevents the agent from rediscovering documented things. The agent that knows where the architecture map lives doesn’t grep for it.
When multiple AI tools read the repo (Claude, Codex, Cursor, Aider, Copilot), maintain one canonical guide. Each tool-specific config is a thin pointer plus tool-specific extras only. Two consequences:
- Drift kills trust. Duplicated guides drift within a sprint. The day one is stale, every agent reading it produces wrong-guidance work, pay for the wrong run + correction + re-verification.
- Cache stays warm. A stable canonical file is cached once across tools, not duplicated.
Lesson T5 – A canonical project guide is itself a token-efficiency artifact. It prevents agents from rediscovering documented things; the agent that knows the map doesn’t grep for it. Maintain one canonical guide; tool specific configs are thin pointers. Duplicated guides drift and drifted guides produce wrong-guidance re-runs.
Isolate judgment from authorship
For code review: start a new conversation with only the code and the review criteria, not the conversation where you built it. The prior work biases the model toward defending decisions it already made. A fresh context produces a more honest assessment. For extra robustness, use a different provider or model tier for review.
Agentic pipelines
The patterns above scale into multi-agent pipelines, where each agent’s context accumulates separately and a misplaced principle compounds across every stage. The patterns below assume you’re building something like the pipeline described in Building an Agentic Dev Pipeline. If you’re not, skim, they generalise.
Dispatch investigation to a subagent – discard its context
T4’s pipeline analog. A planning or implementation agent that needs ten files of context dispatches a researcher subagent: it reads everything, writes a dense summary, returns ~300 tokens instead of ~10,000. Same principle as T4, automated.
Keep agent responses minimal – payload only
Every subagent response lands in the orchestrator’s conversation as a tool result, the agent’s full response text. Over a pipeline run, planner + coder + reviewer + test-writer output accumulates, diluting the routing instructions at the top of the orchestrator’s context. This is what makes orchestrators drift off protocol mid-run.
The fix is two-sided: agents output only a compact handoff-payload JSON in their response (narrative goes to a handoff file); the orchestrator treats responses as opaque completion signals and reads the handoff file directly via known paths. The conversation context stays flat; handoff files grow on disk.
## Response to orchestrator Output ONLY the handoff-payload JSON below — nothing before it. All narrative goes to the handoff file, not here.
Lesson T6 – Minimise tool results, not just tool calls. If the orchestrator never needs to read an agent’s response, make that response a compact machine-readable signal. All human-readable content belongs in handoff files, where the right consumer reads it at the right time.
Transfer context through files, not through the orchestrator
Each agent writes structured output to a file; the next agent reads it directly. The orchestrator never sees the work product, only handoff payloads (T6) and known file paths. A reviewer that writes which files are clean and which need deeper inspection lets the test-writer skip re-reading the clean ones, work done once, structured record reused by every subsequent agent at near-zero additional cost. The same files enable resumption: an aborted run re-enters from the last completed stage without re-running earlier work, because every stage’s output is already on disk in a form the next stage knows how to parse.
Lesson T7 – Transfer context through files, not through the orchestrator. Agents write structured output; subsequent agents read it directly. Work done once is reused by every downstream agent at near-zero cost, and any run is resumable from the last completed stage on disk.
Few-shot examples: use only when zero-shot structure is wrong
Few-shot examples cost 200–500 tokens each. For most coding tasks, zero-shot output is structurally correct, it just needs a constraint you can state in 10 tokens (“return early on error”, “no inline comments”).
Use few-shot for: custom output formats the model doesn’t produce naturally; domain-specific patterns not well represented in training data.
Don’t use few-shot for: style preferences, verbosity tuning, common code patterns. State the constraint directly, cheaper, more flexible, and the model follows a single sentence more reliably than it generalises from two examples.
Hard gates for humans, certainty gates for agents
Two separate questions, often conflated: does the agent have unresolved questions? (certainty gate – auto-route on sure) and have I seen the output and agreed with it? (approval gate – always human). They’re independent. An agent reporting sure has no open questions; it does not mean the output is correct or that you’ve reviewed it. Auto-routing past human review finds the error downstream, where fixing it is always more expensive. Decide which steps are consequential and gate those regardless of certainty grade.
Quick reference
| Technique | Axis | Context | Gain |
|---|---|---|---|
| Match model to task (T1) | Cost | Both | 5–10× on simple tasks |
| Batch questions | Cost | Both | Eliminates turn accumulation |
| Explicit length limits | Cost | Both | 60–70% response reduction |
| Signal uncertainty early | Attention | Both | Avoids re-run cost |
| Compact at task boundary | Attention | Chat | Prevents drift |
| Structured output over prose (T3) | Cost + Attention | Both | 3–4× shorter; every cell forces an explicit answer |
| Throwaway investigation context (T4) | Cost + Attention | Both | 60–80% main-context reduction |
| Pre-built structural indexes (T2) | Cost + Attention | Both | “Where is X?” 3–5 calls → 1 lookup |
| Canonical project guide (T5) | Cost + Attention | Both | Prevents re-exploration; no drift across tools |
| Stable system prompt | Cost | Chat | Cache hit across sessions |
| Minimal agent responses (T6) | Cost + Attention | Pipelines | Flat orchestrator context |
| File-based handoff (T7) | Cost + Attention | Pipelines | Work done once, reused downstream at near-zero cost; resumable |
| Few-shot only for structure | Cost | Both | Avoids 200–500 token/example overhead |
| Human gates regardless of certainty | Attention | Pipelines | Catches confident-but-wrong output |
| Isolate judgment from authorship | Attention | Chat | Avoids self-defending review |
Key Takeaways
- T1 – Pick model per task. Opus for architecture, Sonnet for implementation, Haiku for routing and exploration.
- T2 – Structural indexes turn exploration into lookups. Symbol map, package reverse-deps, per-file docs. Cache the build so regeneration stays cheap.
- T3 – Structured output beats prose on both axes. Tables and JSON have no padding; every cell forces an explicit answer. Specify the structure or the model invents it.
- T4 – Subagent context is throwaway; only the summary is cargo. Investigation belongs in a fresh chat or dedicated subagent.
- T5 – A canonical project guide is itself a token-efficiency artifact. One canonical guide; tool-specific configs are thin pointers.
- T6 – Minimise tool results, not just tool calls. Compact machine-readable signals to the orchestrator; human-readable content in handoff files.
- T7 – Transfer context through files, not through the orchestrator. Work done once is reused by every downstream agent at near-zero cost; any run is resumable from the last completed stage on disk.
Token efficiency is not a knob you turn at the end. It’s a property of how you’ve decided to work, where you store context, how you compose prompts, which work belongs in which agent, where the human stays in the loop. Get those right and the bill stops being something you flinch at. Get them wrong and no model upgrade will save you.
Pipeline patterns are explored in depth in Building an Agentic Dev Pipeline — From Ad-Hoc Prompting to a Repeatable Protocol.