Tuning RAG Retrieval Quality with the Autoresearch Pattern

Applying Karpathy’s autoresearch loop to measure and systematically improve RAG retrieval — from gut-feel tuning to +68% MRR across 8 eval runs.

A retrieval pipeline has a lot of knobs. Dense vs. sparse. Hybrid on or off. Time-decay reranking on or off. Per-collection fusion weights for every source feeding the index. I had built one with all of them, and I had gut feelings about every setting. Hybrid probably helps. Time decay probably helps for recency-sensitive queries. The fusion weights were whatever felt reasonable when I first wrote them.

What I didn’t have was a way to tell whether a change made things better or just different. Tweak a weight, run a few questions, eyeball the results — that’s not measurement, that’s superstition.

This article is the story of how I replaced the superstition with a small, fast evaluation harness, and how an automated loop borrowed from a recent Karpathy project ended up finding a +68% MRR improvement over the dense-only baseline I had been quietly running for months.

In this article, we’ll walk through:

A short detour on RAG and why multi-level retrieval needs tuning at all.
The autoresearch inspiration and how it maps to RAG.
What I built: golden set generation, metrics, and the two-phase eval loop.
The evolution from all-zero metrics to a confirmed optimum across 8 runs.
The decisions and bugs that shaped the final result.

A Quick Detour: What is RAG, and Why “Multi-Level”?

Retrieval-Augmented Generation is the workflow where, before asking an LLM to answer a question, you first fetch a handful of relevant documents from a vector database and stuff them into the prompt. The LLM doesn’t “know” your data — it only sees the snippets you retrieved. So the quality of the final answer depends almost entirely on whether the retrieval step found the right snippets.

That last part is where it gets interesting. A naive RAG system has one collection, one search call, and one set of parameters. Real-world corpora aren’t that uniform. Long-form notes look nothing like chat transcripts. Dense technical write-ups don’t behave like short concept stubs. If you pour all of them into one bucket and run a single similarity search, the larger and denser-vocabulary documents tend to dominate, and the shorter but more useful entries get drowned.

Multi-level (or multi-collection) RAG splits the corpus by type, runs a search per collection, and fuses the results — usually with a weighted score combination. Each document type gets a fair share of the top-K budget, and you can tune the weights to match how useful each source actually is for the queries you care about.

But “tune the weights” is the trap. Without measurement, the weights are vibes. That’s what this article is really about: making the tuning testable.

The Setup

The pipeline I’m tuning has three live collections plus one I’ll talk about later:

Collection	Content
`resource_1`	Long-form markdown notes
`conversations`	Chat conversation summaries
`resource_2`	Shorter concept / topic pages
`resource_3`	Structured data (removed mid-experiment — was never populated)

The retrieval call runs parallel Qdrant searches (one per collection), normalises and fuses the scores with per-collection weights, and returns the top 8 documents. Hybrid (dense + BM25) and time-decay reranking are independent feature flags. That’s the surface area I wanted to explore.

The goal: an integration-test-style harness that measures retrieval quality intrinsically — no LLM in the loop per run, fast enough to execute after every config change — and an automated loop that searches for better configs on top of it.

The Autoresearch Inspiration

Andrej Karpathy released AutoResearch earlier this year. The core idea is simple: an AI agent runs ML experiments in a loop, keeping only the changes that beat the current best result. Three files, one contract:

prepare.py — immutable evaluator that neither human nor agent can touch
train.py — the agent’s sandbox, can be rewritten freely
program.md — the human’s direction file, defines what “better” means

The ratchet loop: propose a change → train → measure → keep if better, revert if not → repeat.

What struck me was how cleanly this pattern maps to RAG tuning:

AutoResearch	RAG eval equivalent
`prepare.py` (immutable evaluator)	Golden set + metric function (Precision@K, MRR)
`train.py` (agent-modifiable)	Retrieval config (hybrid, decay, weights)
`program.md` (research direction)	Phase 1 grid + Phase 2 LLM proposals
`val_bpb` (single metric)	MRR (Mean Reciprocal Rank)

I didn’t want to pull in AutoResearch as a dependency — it’s built for ML training loops with GPU budgets. I just wanted the pattern: propose config → eval → measure → iterate.

What I Built

The harness has two scripts.

Step 1 — Golden Set Generation (one-time)

golden_set_gen.py samples 10 documents per collection, generates one retrieval query per document via Claude, and writes golden_set.json. The golden set is then frozen as a stable fixture — regenerating it between runs resets the baseline and makes all subsequent comparisons meaningless.

Sampling matters more than it looks. The naive approach is to walk the disk and sample markdown files. The correct approach is to scroll Qdrant directly: that guarantees the relevant_id in each golden entry actually exists in the collection the eval will later search. ID mismatches produce all-zero metrics, silently, and they are an absolute pain to diagnose. (Ask me how I know.)

Query generation is one call per document:

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=80,
    messages=[{
        "role": "user",
        "content": (
            "Write ONE natural language question a user would ask to retrieve "
            "this document. Return ONLY the question, no explanation.\n\n"
            f"Document:\n{content[:600]}"
        ),
    }],
)

After this, no LLM in the per-run eval. Generating the golden set is the only AI cost; everything downstream is pure Qdrant.

Step 2 — The Two-Phase Eval Loop

Phase 1 — Pillar grid (~2 minutes):

baseline      hybrid=off  decay=off
+hybrid       hybrid=on   decay=off
+decay        hybrid=off  decay=on
+hybrid+decay hybrid=on   decay=on

All four configs share the same fusion weights. The winner by MRR advances to Phase 2.

Phase 2 — LLM weight tuning (up to 10 iterations):

Starting from the best pillar config, Claude proposes new fusion weights each iteration. To get reliable structured output I used a forced tool call rather than free-text JSON parsing:

_WEIGHTS_TOOL = {
    "name": "propose_weights",
    "input_schema": {
        "type": "object",
        "properties": {
            "resource_1":    {"type": "number"},
            "conversations": {"type": "number"},
            "resource_2":    {"type": "number"},
            "rationale":     {"type": "string"},
        },
        "required": ["resource_1", "conversations", "resource_2", "rationale"],
    },
}

Forcing tool_use guarantees valid JSON. The only failure mode the schema can’t express is sum = 1.0 — I check that in code.

Stopping criterion: |ΔMRR| < 0.01 (convergence) or 10 iterations.

Metrics

With a single gold document per query, Precision@K and Recall@K both collapse to Hit@K: did the right document appear in the top K? MRR adds rank position to the picture — a hit at rank 1 scores 1.0, at rank 2 scores 0.5, at rank 8 scores 0.125.

Precision@K = Recall@K = hits / n_queries
MRR = mean(1/rank_of_first_hit)   # 0 if missed

MRR is the primary target. It penalises configs that find the right document but bury it under noise.

The Evolution: From All-Zero to +68%

What I expected: a couple of runs to dial in weights, done. What actually happened: eight runs over several days, three of them spent diagnosing why every single metric was zero, two of them learning that small defaults can quietly sabotage convergence.

The runs in order, with the lesson each one taught.

Runs 1–2 — Silent all-zero failure

Every metric across every config: 0.000. The harness reported no errors. It just couldn’t find anything.

Four compounding bugs in the generator:

Wrong path for one collection — sampling found zero documents and the harness fell through silently.
Frontmatter mismatch — generator looked for a type field that didn’t exist on those files. Zero documents of that type sampled.
Legacy collection split — documents sampled from disk but searched in a renamed collection. IDs that don’t exist always miss.
Cascading Phase 2 collapse — with MRR=0 across Phase 1, Phase 2 started at MRR=0, Δ=0, hit the convergence threshold after one iteration and exited. The LLM proposal path was never exercised.

The single change that fixed bugs 2 and 3 in one stroke: scroll Qdrant directly instead of inferring IDs from the filesystem. If the eval searches a collection, the golden set must come from that same collection.

Lesson: silent zero-metric failures are the worst kind of bug — no exception, no log line, just numbers that look like a model problem when they’re really a data plumbing problem.

Run 3 — First real numbers

Config	P@8	MRR
baseline	0.900	0.595
+hybrid	0.000	0.000
+decay	0.700	0.483
+hybrid+decay	0.000	0.000

Dense-only retrieval was strong out of the box. Time decay hurt — older notes aren’t less relevant, they’re just old, and a 365-day half-life penalised them harshly for no good reason.

Hybrid scored exactly zero, which is suspicious in the same way an all-green test suite is suspicious. Cause: the collections were indexed with unnamed dense vectors, and hybrid search calls Qdrant with the named vector "dense". Qdrant returns a 400 for that, which my code swallowed gracefully — empty results, no exception bubbling up.

Run 4 — Hybrid testable, and a baseline trap

After rebuilding the live collections with named vectors:

Config	P@8	MRR
baseline	0.333	0.311
+hybrid	0.667	0.444
+decay	0.333	0.311
+hybrid+decay	0.633	0.383

Hybrid wins. But the absolute scores dropped from Run 3 — not because retrieval regressed, but because I had also regenerated the golden set against a harder query-generation model. This is exactly the trap of regenerating: the comparison to Run 3 was now meaningless. After this, the golden set was locked.

Runs 5–6 — Phase 2 stabilised, weight optimum found

Two fixes: weight proposals moved from a smaller model to a stronger one, plus a 3-attempt retry loop and tool_use enforcement. Run 6 ran all 10 Phase 2 iterations without a single proposal failure.

Both runs converged to the same weight optimum, with conversations earning the top weight at 0.30. That made sense — conversation summaries are the most contextually specific documents in the corpus for the kinds of questions the golden set asks.

Run 7 — A self-inflicted regression

I removed resource_3 from the active collections (it had been producing a 404 every run). The right call, but it left the default weights summing to 0.90 instead of 1.0. Phase 2 starts from the defaults, while the LLM proposes weights summing to 1.0 — so the starting point lived in a different space than the proposals.

Result: convergence triggered after 3 iterations on a tiny delta, MRR=0.456. The exploration died before it began.

Lesson: if the default config doesn’t live in the same space as the proposed configs, convergence checks fire on the boundary effect, not on a real plateau.

Run 8 — Normalised defaults, confirmed optimum

Fix: bump the conversations weight default from 0.20 → 0.30 so the defaults sum to 1.0.

Iteration	resource_1	conversations	resource_2	MRR
weight-0 (start)	0.35	0.30	0.35	0.522
weight-1	0.40	0.20	0.40	0.368
weight-2	0.30	0.40	0.30	0.489
weight-3	0.32	0.35	0.33	0.514 → converge

Every direction from the start was worse. The starting point was the optimum, and convergence fired correctly at iteration 3 (|Δ|=0.008 < 0.01).

Results

Config	P@8	MRR	vs. baseline
Dense-only (baseline)	0.333	0.311	—
+hybrid	0.667	0.447	+44% MRR
+hybrid, tuned weights	0.667	0.522	+68% MRR

MRR=0.522 means the correct document appears at average rank 1.28 among queries that hit — most retrievals land the right document at position 1 or 2 in the fused top-8.

The P@8 Ceiling

P@8 = 0.667 (20/30 hits) didn’t budge regardless of config. That isn’t a tuning problem — no retrieval setting can surface what isn’t there. The 10 persistent misses fell into three categories:

Stub documents — concept pages with only a backlink and no prose. The embedding of ---\n_Source: [[some/link]]_ has near-zero semantic overlap with any natural-language question.
Weak conversation entries — summaries missing the overview section, leaving only related-link lists. The generated query ends up asking by ID rather than by topic.
Genuinely hard queries — the query-generation model produced questions more specific than the source document actually covered.

Improving past 0.667 is a content problem, not a retrieval problem.

Key Takeaways

The golden set is the foundation. Lock it after the first clean generation and treat it like a test fixture. Regenerating between runs throws away your only reference point.
Silent zero-metric failures are the hardest bugs. Always-zero metrics with no exceptions point at data plumbing — ID mismatches, wrong collection names, schema drift — not at model quality. Sample directly from the same source the eval queries against.
Weight defaults must sum to whatever the proposer is proposing in. If they don’t, convergence triggers on a boundary effect and the loop quits before it explores.
tool_use over free-text JSON. Forcing a structured tool call eliminated parse failures entirely. The only remaining check is whatever the JSON schema can’t express — for me, the sum constraint.
The pattern works. Propose config → eval → measure → iterate. No GPU, no training loop, no special framework. The same ratchet logic Karpathy applied to val_bpb applies cleanly to MRR, and the LLM’s weight proposals were genuinely informative — it correctly identified that conversations carry the most signal for this corpus, and it correctly confirmed the peak by exploring all directions from it.

For future’s takeaway:

The measurement loop has a longer life than the result. Today’s winning config — hybrid on, time decay off — reflects today’s corpus, which is mostly evergreen content. A time-based collection is planned for a later phase, and decay will almost certainly matter there. Every time the corpus shape shifts, the same propose → eval → measure → iterate loop runs again. The harness is the durable artifact, not the weights it produced.

The deeper lesson, for me, was this: most of the difficulty in RAG tuning isn’t in the tuning. It’s in setting up a measurement you trust enough to act on. Once that’s there, the optimisation part is almost boring — which is exactly what you want.