xchecker: Spec Pipeline for LLMs

Originally submitted to the Kiroween hackathon on Devpost.

Keep your code monster under control with xchecker: a Frankenstein spec harness that stitches Rust safety to LLM creativity in an auditable flow.

Inspiration

Victor’s mistake wasn’t building a monster; it was failing to take care of Adam — failing to build the structure and guardrails that would let him grow.

Vibe coding today feels a lot like Victor’s lab: wiring powerful models straight into repos without specs, receipts, or schema enforcement, then acting surprised when the results bite back.

Anthropic and others have been pushing long-running agents and harnesses: tool-first workflows over versioned artifacts instead of one-off prompts. GitHub’s Spec Kit and Amazon’s Kiro push in that direction from the workspace side.

xchecker is the pipeline version of that idea:

Treat LLMs as powerful but junior collaborators.
Wrap them in a Rust harness with contracts, receipts, and exit codes.
Run a multi-phase spec and design pipeline before anyone trusts the output.

What xchecker does

xchecker is a spec-driven Rust CLI for the upstream part of LLM-driven development — requirements, design, tasks, review, fixups. It turns that work into a structured, auditable pipeline instead of a one-shot prompt.

Core premise:

LLMs are eager juniors. Give them a governed spec lifecycle — clear inputs, explicit phases, strong verification — instead of a single prompt and hope.

That structure is there so LLM “juniors” can move quickly while human reviewers look at finished packets and receipts instead of raw prompts and diffs. Trust, but verify.

Current scope

Right now, xchecker owns the spec side of the SDLC:

Requirements → Design → Tasks → Review → Fixup → Final (you end up with a spec, a plan, and change proposals; not a deployed service)

It does not run builds, tests, or deploys. The spine is designed so downstream phases (Build, Gate, Deploy, etc.) can attach later; this release focuses on upstream thinking and change proposals you can gate and audit.

Scope snapshot

Stage	Status here
Problem → Spec	Implemented (requirements, design)
Spec → Tasks	Implemented (task plan, review, fixups)
Build / Test	Not implemented (planned downstream)
Gate / Deploy	Not implemented (planned downstream)

Multi-phase harness with an air gap

Each phase is first-class:

Its own prompt and deterministic Packet.
Its own artifacts on disk (00-requirements.*, 10-design.*, 20-tasks.*, …).
Its own JCS-canonical JSON receipt.

LLMs never write directly to the repo:

They propose fixups as structured diffs.
xchecker keeps them in preview by default.
Applying changes is explicit, via atomic writes and backups.

The junior drafts. The harness controls edits and persistence.

Deterministic packets instead of fuzzy RAG

Before any LLM call, xchecker builds a Packet: a deterministic, budgeted context window.

Prioritizes problem → specs → ADRs → docs → code under strict byte/line budgets.
If required upstream artifacts don’t fit, xchecker fails fast instead of silently dropping them.

This is an implementation of Proactively Augmented Generation:

Assemble context proactively from a curated spec/ADR/doc graph.
Use deterministic rules and cheap traditional indexing/search to pre-assemble what the model is likely to need, instead of “see what the retriever finds” at call time.

In practice:

Fewer blind hops than RAG-heavy flows.
Higher context density — only top-priority artifacts enter the window.

And behaviour is explainable:

What did the model see? → inspect the packet.
Why did behaviour change? → diff packets and receipts, not prompts.

LLM backend layer

An LlmBackend abstraction hides provider details. In this Kiroween build xchecker runs against Anthropic’s Claude Code command-line tool; the backend is structured so other CLIs/APIs (Gemini CLI, OpenRouter, Anthropic’s HTTP API, etc.) can slot in later without rewriting phase logic.

The backend layer:

Normalizes budgets, timeouts, and error handling.
Keeps provider quirks at the edges.
Presents one interface to the rest of the system.

Receipts, lockfiles, and drift

Every run emits a JCS-canonical JSON receipt with:

Exit code + error kind.
Runner mode (native vs WSL).
LLM provider/model + timing.
BLAKE3 hashes of artifacts and context.
Structured packet evidence (files, priorities, sizes).

xchecker also treats the LLM environment as a dependency, similar to Cargo.lock:

Lockfile pins model, CLI version, and schema version.
Lock drift detection reports when runtime no longer matches the snapshot.
In strict modes (--strict-lock, CI), drift is a hard error.

Packets + receipts + lockfiles give “what changed?” answers at the spec and environment level, not just the code diff level.

Status, gate, and doctor

xchecker status --json emits status-json.v2 with:
- schema_version, spec_id, per-phase status
- pending_fixups, has_errors
- artifacts[] (path, blake3_first8)
- effective_config entries { value, source }
- optional lock_drift when the environment diverges
xchecker gate is a CI-style spec gate:
- Fails if the spec hasn’t reached the required phase
- Or recent receipts are failing
- Or there are unapplied fixups
- Or the environment has drifted beyond the lockfile
xchecker doctor runs static checks (filesystem behaviour, config shape, runner/LLM wiring) and never calls an LLM, so it’s safe anywhere.

Docs and tests

Doc validation:

Parses markdown with pulldown_cmark.
Extracts fenced code/JSON/TOML blocks.
Executes them against the real binary in isolated homes.

If code and docs drift, tests fail.

The broader suite includes:

Unit and property-based tests.
Cross-platform coverage (Linux/macOS/Windows, WSL).
End-to-end harness flows with mocked LLMs.
Opt-in real-LLM tests behind env flags.

The goal is a tool you can put in CI for real projects, not just a demo.

How xchecker relates to Spec Kit, Kiro, and DemoSwarm

All of these live in the same space: “how do we get from rough intent → governed change using AI.”

They make different bets about where to sit:

Spec Kit–style flows
- Open-source toolkit for spec-driven development: project constitution, spec, tech plan, tasks, and implementation, driven by AI assistants like Copilot, Claude, Gemini, Cursor, etc.
- Exposed as commands (/constitution, /specify, /plan, /tasks, /implement) run from your AI surface or terminal; the developer decides when to advance each step and how to sequence the lifecycle.
Kiro
- Spec-driven AI IDE: keeps requirements and design alongside code and UX, and offers Vibe (chat-first) and Spec (plan-first) modes.
- Uses a human-owned markdown task list as the main surface.
- Selecting a task triggers a single Claude agent flow for that task.
- Context is a modest bundle (shared docs + local files), not a budgeted per-phase packet; that task agent doesn’t orchestrate its own subagent tree.
- Kiro also exposes hooks, MCP, and steering docs, but for this project I stayed with core Spec/Vibe flows; hooks and MCP were the wrong tool for this job, and I learned about steering late in the window.
DemoSwarm (separate project)
- A 6-flow, ~45-agent swarm I’m open-sourcing at effortlesssteven.com/demoswarm.
- Encodes the full SDLC in agents: Signal → Specs → Plan → Draft (code/tests) → Gate → Deploy → Wisdom.
- Runs inside Claude Code as a governed multi-agent flow, with receipts written to disk.
xchecker
- Pipeline-first: upstream lifecycle encoded as a Rust harness.
- Requirements, Design, Tasks, Review, Fixup are top-level runs with packets and receipts.
- Each phase can be backed by richer Claude Code or other CLI/API flows than a single IDE action.
- The harness owns context assembly, budgets, secret scans, and stable artifacts/receipts/lockfiles.

Short version:

Spec Kit and Kiro set the pattern for spec-first, AI-assisted SDLC in the workspace.

DemoSwarm (separate repo) explores one way to push that pattern into a 6-flow, multi-agent SDLC swarm inside Claude Code.

xchecker explores the same direction from another angle: it takes that spec-and-task loop and grounds it in a deterministic Rust pipeline that can drive those orchestrators as nested phases, with an extra subagent layer and tighter context/receipt handling.

Side-by-side

High-level comparison – where they live and how they run

Aspect / Layer	Spec Kit	Kiro	DemoSwarm (separate repo)	xchecker
Home surface	AI surfaces (Claude Code REPL, Copilot chat, Cursor, etc.)	Kiro IDE	Claude Code REPL + repo	CLI + repo
Orchestration trigger	Slash commands (`/constitution`, `/specify`, `/plan`, …)	Click task in markdown task list	Slash commands (`/flow-1-signal`, …)	CLI phases (`spec`, `resume`, `status`, `gate`)
L0: Orchestrator	Human in AI surface steering Spec Kit commands	Kiro runtime reacting to user-selected tasks	Claude Code flow for each `/flow-*`	Rust CLI orchestrator running phases (`spec`, `resume`, `status`, `gate`)
L1: Primary AI worker	Assistant thread executing Spec Kit commands + tools	Single Claude flow per selected task	Narrow domain subagents per flow	Provider orchestrator flow per phase (Claude Code today, others later)
L2: Extra AI tier	Tools / light subagent usage when running inside Claude Code	None – that task flow doesn’t spawn subagents	None – Tools inside each agent; agents don’t call other agents	Subagents under provider orchestrator for micro-scoped work per phase
SDLC span	Spec → Plan → Tasks / implementation	Requirements → Design → Tasks / implementation	Signal → Specs → Plan → Build → Gate → Deploy → Wisdom	Problem → Spec → Tasks (upstream)
Receipts / gates	Specs and logs; any gating is something you wire in CI around those files	Specs, tasks, and steering files in the repo; rely on your normal test/CI gates	Per‑flow swarm artifacts under `swarm/runs/<run-id>/`, plus agentic gates inside the flows (critics, BDD runs, mutation tests)	JCS‑canonical JSON receipts per phase, `status --json`, and a `gate` command intended for CI

Read another way: Spec Kit and DemoSwarm keep the orchestrator inside the assistant, Kiro drives it from a clickable task list in the IDE, and xchecker moves the orchestrator into code so it can wrap those flows, give them deeper subagent trees, and keep the whole upstream SDLC under packets, receipts, and lockfiles.

How I built it

Most of xchecker was built in Kiro’s Spec loop, with ChatGPT providing framing and review. The final stretch used Kiro’s Vibe mode on top of that foundation.

Spec-driven loop

Problem framing & research (ChatGPT) Used ChatGPT to map the problem: how to harness LLMs over specs/receipts/state; how Anthropic’s harness work and prior RAG patterns fit; what guarantees were actually needed.
First-pass specs (Kiro) Fed that framing into Kiro and asked for concrete requirements, design docs, and task breakdowns for xchecker as a long-running spec harness.
Spec review and restructuring (ChatGPT + GitHub) Pulled those specs back into ChatGPT with the GitHub connector. Checked them against the real tree and the architecture I wanted. Tightened phase boundaries and artifact layout so they line up with what the Rust could support cleanly.
Implementation (Kiro) Used the refined specs as source of truth and let Kiro drive most of the coding work, mainly on Claude Sonnet 4.5 (with Haiku for codebase research and Opus for the last couple of days):
- LlmBackend and provider scaffolding
- Receipt / status / doctor JSON contracts
- Doc-validation harness using pulldown_cmark
- WSL and Windows runner integration
- CI/test plumbing around docs, schemas, and end-to-end flows
My focus was architecture, invariants, and tests. MCP and hooks were intentionally left out here — they didn’t match what we needed, and I learned about Kiro’s steering at the end of Kiroween.

Challenges

Time vs depth Late entry meant focusing on the core harness — orchestrator, packets, receipts, lockfiles, doc validation — instead of implementing every provider combination or downstream phase. The design for those exists; the code is intentionally scoped.
Cross-platform behaviour Atomic writes, AV quirks on Windows, WSL paths, Job Objects, and line endings all needed explicit tests to keep behaviour aligned across Linux/macOS/Windows.
Keeping code, docs, and schemas in sync Once receipts and status are captured in schemas and canonicalized, every meaningful change becomes a four-way update (Rust, schemas, docs, tests). That extra ceremony is the price of letting LLM “juniors” move fast: the harness can check their work mechanically instead of burning senior attention on re-auditing every change by hand.

What this build reinforces

Most of the ideas behind xchecker — specs, harnesses, “LLMs as juniors” — predate this repo. This build just made some of the trade-offs very concrete:

Specs don’t stand alone. Once receipts and status are schema’d, any meaningful change touches spec, code, docs, and tests together. Without doc + schema validation, drift is inevitable.
Cross-platform needs proof, not hope. Atomic writes, Windows AV quirks, WSL path translation, and Job Objects only behaved the way they should once they had dedicated tests pinning them down.
Fixup engines are where the sharp edges live. Fuzzy patching, path validation, and preview/apply semantics in fixup.rs are the parts most likely to go wrong. Treating fixups as first-class (with receipts and clear error modes) feels mandatory if LLMs are going to propose diffs.
Orchestration and observability matter more than raw codegen. Packets, receipts, lockfiles, and gates turned out to be more valuable than any single prompt or model tweak. The hard part is making the system explainable, not making it type.