Skip to content
· 0%
Go back

Proactively Augmented Generation (PAG): Stop Paying LLM Prices for Repo Search

Published:
·6 min read·infrastructure

Your AI dev tools search your repo while you wait. Cursor greps 200 lines at a time with regex. Copilot pulls in whatever files sit nearby in the tree. Every search is synchronous, token-priced, and blocking.

This is a solved problem. rust-analyzer pre-indexes your entire crate graph. JetBrains builds a full semantic model on project open. Language servers have done background indexing for years. Then AI tooling showed up and regressed to synchronous grep, charging you tokens for the privilege.

The LLM context window is a bounded, expensive, per-call resource. Hard token limits. Real cost per call. Amnesic between calls. Whatever doesn’t make it into the window, the model guesses at. That’s a different problem than serving editor completions into a persistent connection. It deserves tooling that takes it seriously.

What PAG actually is

If LLMs are eager junior devs, current context retrieval hands them the whole repo and says “go find what you need, call me when you think you’re done.” They search. They guess. They miss the one spec that governs the change. And you pay for every token of that wandering.

PAG flips this. The senior pre-stages a briefing packet before the junior touches anything: “Here’s the spec, the dependency map, and the two files that matter. Then we talk.” That senior is a harness running cheap traditional tools during the dead time your dev loop already has.

Background indexing, time-gating, and priority queues are well-understood. None of this is novel in isolation. What caught my attention is that the LLM context window creates a fundamentally different optimization target: token-limited, discrete, expensive, amnesic between calls. That combination means the same tools that power your LSP need different orchestration when the output is a bounded prompt instead of a live index.

The mechanism: time-gated collection

All analyzers launch simultaneously at T=0. Not LLM calls. AST parsers, dependency walkers, syntax checkers, coverage maps, import scanners. Tooling that runs in single-digit milliseconds and costs nothing in tokens.

Three tiers, based on latency:

At a fixed cutoff (200ms in the current design), the system snapshots. Whatever’s ready goes into the context window. The LLM gets called immediately. No blocking on slow analyzers.

  0ms   All analyzers launch
  5ms   Syntax checker completes
  8ms   Import scanner completes
 10ms   Comment analyzer completes
 50ms   Complexity analyzer completes
 80ms   Error detector completes
120ms   Code smell detector completes
180ms   Dependency analyzer completes
200ms   *** Snapshot. LLM called. User gets a response. ***
500ms   Performance analyzer completes → context store
  1.5s  Memory profiler completes → context store
  2.0s  Architecture reviewer completes → context store

Slow results flow into a context store as they finish. When the next LLM call fires, those results are already there. No retrieval step. No additional cost. Dead time became better context.

A simple LLM ask takes a minute. A complex task can run ten to thirty. A 200ms collection pass before that call is rounding error. But the model notices what’s in the window. A call that starts with the right spec, the dependency map, and the relevant code doesn’t spend its first ten tool calls grepping for context it could have had for free. The 200ms pays for itself in fewer tool calls, less wandering, and a shorter path to the answer.

Multi-shot isn’t an advanced mode. It’s the default. Every interaction is shot N. Shot 1 happens to be the first. Each subsequent shot inherits whatever the previous shots’ deferred analyzers produced. The junior’s briefing packet gets thicker every round without anyone doing extra work.

What goes in the window

Time-gating produces context fast and cheap. You still need to decide what makes it into the window. Not everything fits. Not everything should. This is where the LLM context problem diverges from what LSPs do: you’re packing a bounded token budget for a single discrete call, not maintaining a live index.

Priority tiers: upstream artifacts like specs and architectural decisions are non-evictable, first in. Docs next. Analyzer output next. Code snippets last. Strict budgets on tokens, bytes, or lines. If must-haves exceed the budget, fail fast with a clear error instead of silently dropping context and hoping the model figures it out.

Same config, same repo, same packet. You can diff what the model saw between runs. When behavior changes, compare packets, not prompts. Inspectability falls out of deterministic assembly. You don’t bolt it on.

How PAG compares

DimensionRAGAgentic search (Cursor-style)PAG
Who drives retrievalModel (similarity search, tool calls)Model (grep, regex, 200 lines at a time)Harness (traditional tools, pre-computed)
When context is assembledReactively, on the critical pathReactively, per keystroke or tool callProactively, during dead time
Cost per context passLLM tokens for retrieval + generationLLM tokens for search + generationFractions of a cent (traditional tools)
InspectabilityOpaque retrieval intentOpaque tool callsDeterministic packet, diffable
Progressive enrichmentRe-retrieves each turn (no accumulation)Re-searches each turn (no accumulation)Yes (deferred results flow into next shot)

RAG is still the better fit when you genuinely don’t know what matters: broad knowledge bases, exploratory Q&A, unstructured corpora. PAG is for structured domains where you do know, and the question is how to get that knowledge to the model cheaply and fast.

Where it breaks

The time-gated architecture is implemented. Three-tier timeout, fire-and-collect, deferred injection into subsequent shots: that machinery works. What’s missing is production data at scale across diverse repos. The benchmarks are ahead of me, not behind me.

PAG rewards structured repos. Clear module boundaries, specs, and docs give analyzers something to work with. A flat directory with no documentation produces thin packets. You don’t need perfection, but structure helps.

No dead time, no background work. If your dev loop is pure synchronous chat with no tests or builds, there’s nowhere to run analyzers. Most real workflows have test suites or inference latency, so this bites less often than it sounds. But worth knowing.

Intelligent analyzer selection is designed but early. Current implementations use keyword matching. The ML-driven layer that predicts which analyzers matter for a given query is future work. I haven’t tested multi-repo scale. Single-repo with structured artifacts is the proven ground.

What’s next

The pattern has obvious next moves: predictive deferral (learn which analyzers to defer by query type), adaptive timeouts (adjust cutoffs to system load), result streaming (partial context updates as analyzers complete), and a PAG + RAG hybrid where you build a curated base packet from pre-computed context and let RAG handle long-tail lookups within bounded scope. I’ll write those up as the implementations mature.



Previous Post
xchecker: Spec Pipeline for LLMs
Next Post
Analysis: How even Huawei's use of AOSP may be in jeopardy with the US-China trade war