Skip to content
Go back

Proactively Augmented Generation (PAG): When the Harness Steers Retrieval

Published:
·18 min read·infrastructure

You’ve seen this problem: you ask an LLM to help with a codebase task, and it wanders. It grabs whatever files look vaguely relevant. Sometimes it finds the right context. Sometimes it hallucinates because it missed the one spec that would have told it what to do. You can’t reproduce the run. You can’t audit what it saw. You can’t explain to anyone why it did what it did.

Proactively Augmented Generation (PAG) is a way of feeding context to an LLM where the harness, not the model, decides what to load and how.

Instead of asking the model to “go fetch what you need” (classic RAG), you:

  1. Model your domain and artifacts up front (specs, ADRs, docs, code, schemas).
  2. Use deterministic rules to assemble a context packet from that graph before you call the model.
  3. Hand that packet to the LLM as its working set.

If LLMs are eager junior devs, PAG is the senior dev sitting down before the meeting and saying:

“Here’s the spec, ADRs, and the two files you actually need to read. Then we talk.”

Not:

“Here’s the whole repo. Go wander the wiki and call me when you think you’re done.”

How PAG differs from RAG

Most setups use RAG (Retrieval-Augmented Generation) as: “let the model decide what to retrieve, using similarity search or tools.”

In typical RAG:

In PAG:

The short version:

Both augment generation with external knowledge. PAG just moves the control point.

Scoped to one flow at a time

PAG doesn’t build “the one true context” for an entire codebase. It’s flow-scoped.

A flow is a specific kind of work:

At runtime, a typical pass looks like this:

  1. Request routing figures out what kind of work this is:

    • Classifies the intent (“review”, “explain”, “generate tests”, “compare”).
    • Extracts anchors: spec IDs, ADR IDs, file paths, services, domains (“billing”, “checkout”, “auth”).
    • Selects the flow and which analyzers to run.

    This can be NLP-based (a classifier + entity extractor), but it doesn’t have to be. Rule-based routing, pattern matching on file paths, git branch conventions, or explicit user selection all work. The key is that something maps the request to a flow before the packet builder runs.

  2. Analyzers and tools run for that flow:

    • Quick queries over your existing graph, metrics, or coverage.
    • Direct lookups of specs, ADRs, and docs tied to the anchors.
    • Optionally, slower jobs that can run in the background.
  3. The flow harness hands PAG a flow library:

    • Upstream specs and ADRs.
    • Relevant docs and READMEs.
    • Code snippets or modules.
    • Analyzer outputs (e.g., “entities in this module”, “callers of this function”).
    • Previous receipts for similar flows.

The packet builder operates entirely inside that flow library. It doesn’t rediscover the world; it turns “here’s everything that might matter for this flow” into “here is the exact, budgeted packet this model will see for this run.”

Note that packets can be scoped at different levels:

The same PAG principles apply at both levels: priorities, budgets, deterministic assembly, and receipts. The difference is what you’re scoping to.

This isn’t Copilot auto-context with a fancier name

Most “auto context” systems today (Copilot, Cursor, etc.) are some mix of:

From your point of view as a user, it’s basically: “we’ll try to find some relevant stuff and jam it into the prompt behind the scenes.”

You rarely get:

It’s helpful for autocomplete. It’s also opaque and opportunistic.

Concrete example: you’re editing orders.rs and the editor pulls in invoice.rs and customer.rs because they’re nearby in the file tree — even though the change you’re making is governed by a spec in specs/order-processing.md that it never sees. The heuristic grabbed proximity; it missed intent.

PAG is different: this isn’t the model or the editor heuristically grabbing random nearby files. It’s a curated, spec-first library with deterministic rules.

DimensionEditor auto-contextPAG
Who decides what’s relevantHidden heuristics + sometimes semantic searchYour harness + spec/ADR/doc graph + explicit priorities
When context is assembledJust-in-time, often per keystroke, often opaquePre-call, single deliberate pass, with budgets and failure modes
InspectabilityNo durable, inspectable “packet” per requestYou can look at the packet, hash it, log it, diff it
Input library”The repo and maybe the workspace”A curated library: specs, summaries, explanations, selectively chosen code/docs

If you just said “PAG = auto context”, that’d be hand-wavy. As defined here — spec-first, curated library, deterministic packet builder, receipts — it’s a distinct pattern.

Why this matters

RAG works well for broad knowledge bases, semi-unstructured corpora, exploratory question-answering.

It’s less great when:

PAG tackles three issues in those settings:

Predictability

With RAG, if the embedding changes or the corpus shifts, your retrieval set can “wiggle” in ways that are hard to reason about.

With PAG: same config + same repo → same packet. If behaviour changes, you can diff input artifacts, packet contents, and environment (lockfiles/receipts) — not just the final prompt.

Governance

If you’re running LLMs in an SDLC or governance-heavy context, you need to answer:

RAG gives you a retrieval intent but not necessarily a clean, auditable packet.

PAG stores packet evidence (what files, which priorities, how they were chosen), writes receipts with hashes and metadata, and fits naturally into lockfiles, CI gates, and audits.

Speed and context efficiency

RAG often means multiple model calls (“what should I search?”, “what next?”), multiple retriever calls, and sometimes redundant or noisy context.

PAG tends to be faster (one packet builder pass over a known tree, one model call) and more context-efficient (only highest-priority artifacts make it in, budgets enforced up front).

In a repo with good structure, the harness can do a better job than “semantic nearest neighbours to this text.”

Reducing hallucination and Process Confabulation

Two common failure modes when LLMs work with codebases:

PAG attacks both from the context side:

The goal isn’t to make the model smarter. It’s to make sure the model sees the right information with enough coverage to reduce hallucination, while keeping each chunk concise and scoped enough to avoid Process Confabulation.

How it works in practice

Build a domain graph

Start by modeling the things that matter:

Organize them so tooling can reason about them: by path (specs/, docs/, src/), by tags (“core”, “upstream”, “doc”, “code”), by dependencies (“this design depends on this ADR”).

Assign priorities and budgets

PAG is opinionated about what’s more important when context is tight:

Then define budgets: max tokens (or bytes/lines) per packet, max docs per type, optional fair-share rules (“never starve ADRs”).

Assemble the packet

The packet builder:

  1. Walks the graph, highest-priority first.
  2. Adds artifacts while staying under budgets.
  3. If must-have artifacts don’t fit → fails fast with a clear error (“your upstream spec is bigger than the packet budget”).

Result: a fixed packet with data, provenance, and ordering. Only now do you call the model.

Emit receipts

Each run records a receipt: exit code, error kind, model info, hashes, and packet evidence. Optionally pins the environment (lockfile) for drift detection. You can inspect and diff the actual inputs to the model, not just the prompts.

This closes the loop: iterate on packet rules (priorities, budgets) based on what receipts tell you.

Watch out for

Formalizing chaos. If your repo structure is random, your packet builder will be too. Invest a bit in organizing specs and docs before you encode them in rules.

Over-engineering the first version. Don’t spend weeks designing the perfect priority scheme. Start with “specs and ADRs are Priority 0, everything else is lower” and iterate based on what receipts tell you.

Receipts nobody reads. Receipts only matter if someone looks at them. Wire at least one check into CI — “fail if required specs are missing from the packet” — so the evidence has teeth.

Try this once in your repo

Before you build a full packet builder, run a single experiment.

  1. Pick one LLM task you already run (e.g., “summarize this spec and propose tests”).
  2. Manually list the 5–10 artifacts that must be in context: the main spec, any linked ADRs, one schema, at most 1–2 code files.
  3. Build a small packet.json with file paths, byte counts, and a simple priority field.
  4. Log that packet next to the model call.

You’ve just taken the first step toward PAG: a deterministic packet you can inspect and diff later, instead of opaque auto-context.

PAG as a spectrum, not a product

PAG isn’t a specific toolchain. It’s a way of thinking about context:

“Given this request and this budget, what’s the most useful, token-efficient set of inputs I can assemble on purpose, without flooding the window with garbage?”

You can apply this at very different levels of sophistication.

Level 0: Implicit policy (no PAG yet)

The default most teams start with:

This is a policy—it’s just an implicit one: “whatever fits, in whatever order the tool picks.”

Level 1: Heuristics with intent

You don’t need a Rust harness to start doing PAG-style thinking. Even with a 2M token window and cheap tokens, you can:

Here, the “packet” might be nothing more than an ordered list of files, a few simple priority bands, and a soft budget (“never let ‘misc’ consume more than 20% of the window”).

You’re still using a big window, but you’re packing it deliberately instead of dumping the repo.

Level 2: Structured graphs and light receipts

Once you have more structure—specs and ADRs in version control, some notion of modules/services, basic analyzers (graphs, coverage, metrics)—you can move to the more “classical” PAG shape:

This is where per-run reproducibility and CI integration become realistic.

Level 3: Full harness, multi-shot, receipts everywhere

At the far end:

You don’t have to jump straight to Level 3. If all you do is stop letting auto-context spray random neighbours into the prompt and instead rank candidates by “how likely is this to matter for this request” and enforce some notion of budgets and caps, you’re already doing a light version of PAG. The rest is tightening the rules and adding better evidence.

When PAG works (and when it doesn’t)

Works best when:

Less ideal when:

In those cases, classic RAG is still the better fit. You want the model to help steer retrieval because you don’t know the structure ahead of time.

Single-shot vs multi-shot PAG

So far, we’ve assumed a single-shot setup: for a given flow and phase, you build one packet, call the model once, and log the receipt.

In practice, you often want multi-shot PAG across a conversation.

Single-shot

For a single-shot call:

Useful when you’re in CI or batch mode, you need a reproducible one-shot judgement, or you care more about determinism than interactivity.

Multi-shot

In a multi-shot setup, you stretch the same pattern across shots:

Shot 1:

Between shots:

Shot 2 and beyond:

You’re not “hoping it sees more stuff next time.” You’re deliberately refusing to block the first answer on slow analysis, letting slow analysis catch up while the user reads or types, and treating finished slow results as new, high-value artifacts for the next packet in the same flow.

Example: background analysis improving the second answer

Imagine a user asks:

“Why does the checkout service keep timing out?”

Shot 1:

The first answer can outline plausible causes and ask for clarification, without waiting on heavy analysis.

While the user is reading, the harness kicks off slower analyzers:

By the time the user asks:

“Can you show which components are on the slow path?”

those heavier analyzers have finished. Their results are already scoped to the same flow, tagged with the same service and ADRs, and available as new artifacts in the flow library.

Shot 2’s packet can now include a summarized call graph for the slow path and a focused diff for the relevant components, alongside the original specs and docs. Same PAG rules, strictly better inputs.

PAG + RAG together

You don’t have to pick one forever. A mature setup might:

Think of it as: “First, assemble the curated briefing folder (PAG). If you really need something extra, run a controlled search within these bounds (RAG).”

How this relates to systems like Watson/DeepQA

If you’ve seen descriptions of IBM’s Watson DeepQA architecture, the pattern here should feel familiar.

DeepQA didn’t rely on a single model. It used:

PAG plays a similar role around an LLM:

In a multi-shot setup, this aligns even more with DeepQA’s progressive evidence gathering:

The goal isn’t to re-implement DeepQA. It’s to take the same “many small, specialized reasoners feeding a final answerer” pattern and pair it with modern LLMs and explicit, inspectable packets instead of opaque prompts.

The bottom line

If you’ve been frustrated by LLMs that wander aimlessly through your codebase, miss the specs that matter, and produce unreproducible results—PAG is the pattern that fixes it.

PAG doesn’t try to make the model smarter. It makes sure the model sees the right information with enough coverage to reduce AI hallucination, while keeping each chunk concise and scoped enough to avoid Process Confabulation.

You can apply this whether you have a tight 8K window or a 2M token context: the point isn’t “use less context,” it’s “use your budget on the highest-value context you can assemble on purpose.”

PAG is what you use when you’re serious about giving your LLM every chance to succeed—and still having receipts when something goes wrong.



Next Post
Analysis: How even Huawei's use of AOSP may be in jeopardy with the US-China trade war