AI – Pazsit's Dev Blog

Token Efficiency for LLM assisted Development

How seven load-bearing principles across chat sessions and agentic pipelines, keep LLM dev costs manageable without degrading what the tools produce.

Occasinally I had a problem most people working with LLMs eventually run into: Long sessions forgot their own constraints. Multi-file investigations dumped thousands of tokens into the main context and never gave them back. Pipelines paid full price for content that should have been cached. None of it was the model’s fault, all of it needs only changing how you worked with the tools. The patterns below come experience and research about LLM assisted development to scale. Design choices about where to spend tokens and where not to.

One-line thesis: token efficiency is a design discipline, effectivness not being cheap.

Building an Agentic Dev Pipeline — From Ad-Hoc Prompting to a Repeatable Protocol

How eleven design decisions, a structured interview technique, and two effectiveness axes turned a slash command into a self-managing dev loop.

I had a problem that most people using LLMs for development eventually hit: inconsistency. Sometimes I’d get well-structured code with tests. Sometimes I’d get a half-finished implementation with no tests and no explanation of what changed. Sometimes I’d ask the same question twice and get architecturally different answers. The issue wasn’t the model — it was me. Every session started from scratch. No shared protocol. No handoffs. Just vibes.

The pipeline I built this week replaces that. It’s not a framework or a library — it’s nine files that define a protocol: who does what, in what order, with what information, and when to ask me before proceeding. The output is repeatable. The quality is auditable. And the design decisions that shaped it contain, I think, some generally useful lessons about building agentic systems.

One-line thesis: a well-designed agentic pipeline is a protocol, not a prompt.

Tuning RAG Retrieval Quality with the Autoresearch Pattern

Applying Karpathy’s autoresearch loop to measure and systematically improve RAG retrieval — from gut-feel tuning to +68% MRR across 8 eval runs.

A retrieval pipeline has a lot of knobs. Dense vs. sparse. Hybrid on or off. Time-decay reranking on or off. Per-collection fusion weights for every source feeding the index. I had built one with all of them, and I had gut feelings about every setting. Hybrid probably helps. Time decay probably helps for recency-sensitive queries. The fusion weights were whatever felt reasonable when I first wrote them.

What I didn’t have was a way to tell whether a change made things better or just different. Tweak a weight, run a few questions, eyeball the results — that’s not measurement, that’s superstition.

This article is the story of how I replaced the superstition with a small, fast evaluation harness, and how an automated loop borrowed from a recent Karpathy project ended up finding a +68% MRR improvement over the dense-only baseline I had been quietly running for months.