Applying Karpathy’s autoresearch loop to measure and systematically improve RAG retrieval — from gut-feel tuning to +68% MRR across 8 eval runs.
A retrieval pipeline has a lot of knobs. Dense vs. sparse. Hybrid on or off. Time-decay reranking on or off. Per-collection fusion weights for every source feeding the index. I had built one with all of them, and I had gut feelings about every setting. Hybrid probably helps. Time decay probably helps for recency-sensitive queries. The fusion weights were whatever felt reasonable when I first wrote them.
What I didn’t have was a way to tell whether a change made things better or just different. Tweak a weight, run a few questions, eyeball the results — that’s not measurement, that’s superstition.
This article is the story of how I replaced the superstition with a small, fast evaluation harness, and how an automated loop borrowed from a recent Karpathy project ended up finding a +68% MRR improvement over the dense-only baseline I had been quietly running for months.
Continue reading Tuning RAG Retrieval Quality with the Autoresearch Pattern