LLM-based coding agents—systems that autonomously write, test, and deploy software—have become increasingly capable, but their reliability remains uneven. In our experience maintaining a multi-platform agent system, we have observed three recurring failure patterns:
This paper describes a set of techniques we have found useful for mitigating these failures. The approach uses a phase-gated state machine with named unknowns and mandatory witnessed execution. Rather than relying on the model to self-regulate, constraints are enforced externally through hooks that operate below the agent’s decision layer.
The paper is organized as follows: §2 describes the state machine model; §3 introduces the discipline for handling unknowns; §4–5 cover work tracking and cross-session memory; §6 describes the implementation; §7–9 discuss constraints, verification, and limitations; §10 documents advances made in the three weeks following initial publication.
The framework defines five phases with forward-only progression. PLAN discovers all unknowns (exits when a full pass adds zero new unknowns). EXECUTE resolves unknowns via witnessed execution (exits when all mutables are KNOWN). EMIT writes files to disk (exits when post-emit matches pre-emit exactly). VERIFY performs end-to-end witnessed validation (exits when .prd is empty and git is clean and pushed). COMPLETE updates documentation and closes the session.
Each phase is implemented as a discrete skill—a self-contained prompt with explicit tool permissions, enforcement level, and transition rules. Skills are invoked through a dedicated dispatch mechanism, never through general-purpose agent spawning, ensuring that phase identity is preserved across invocations.
A phase advances only when its exit condition is satisfied. Conditions are conjunctive: every sub-condition must hold simultaneously. This prevents partial advancement where, for example, code is written before all unknowns are resolved.
Any newly discovered unknown at any phase triggers an immediate backward transition:
We refer to this informally as a “snakes and ladders” model: forward progress requires satisfying conditions (ladders), while encountering problems sends the agent back to an earlier phase (snakes). The intent is to prevent agents from patching around problems rather than reconsidering their assumptions.
A key element of our approach is the named mutable: every unknown fact needed to make a decision or write code is explicitly named before action is taken. A mutable has a name, an expected value, a current value (initially UNKNOWN), and a resolution method. The agent is instructed not to proceed past an unresolved mutable.
The intent is to make uncertainty visible rather than allowing the agent to generate plausible-but-unverified code.
In our framework, a mutable is considered resolved only when its value is assigned by the output of real execution—running actual code against actual services with actual data. The following do not constitute resolution:
The rule is straightforward: only witnessed output counts as ground truth. Once a value has been observed through execution, it is treated as known. Before that, it remains unknown regardless of how plausible a guess might be.
If a mutable remains unresolved after two execution passes, it is reclassified as a new unknown and the agent snakes back to PLAN. This prevents infinite loops where the agent repeatedly attempts the same failing resolution strategy.
When execution produces unexpected output, the agent is instructed not to silently absorb the surprise. Instead, every discrepancy between expected and actual output generates a new named mutable, triggering a return to PLAN. The goal is to prevent workarounds that accumulate technical debt.
The system tracks work through a .prd file (Product Requirements Document), a JSON array where each item carries an id, imperative subject, status (pending → in_progress → completed), acceptance criteria, edge cases, bidirectional dependency links, and effort estimates (small: <15 min, medium: <45 min, large: >1 hour).
The .prd’s dependency graph enables parallel execution. Each wave identifies all pending items with satisfied dependencies, launches up to 3 concurrent subagents (one per item), and upon completion checks newly unblocked items for the next wave. This exploits the inherent parallelism in software tasks.
A stop hook inspects the .prd at session end. If any items remain, the session cannot close. A second stop hook checks for uncommitted or unpushed git changes. Together, these act as barriers against premature completion.
The system implements a file-based persistent memory system with four semantic types:
Each memory is stored as a Markdown file with YAML frontmatter, indexed in MEMORY.md. The index loads at conversation start; individual memories are read on demand.
The system avoids storing information that can be derived from current project state: code patterns, architecture, file paths, git history, and debugging solutions. In our experience, memory staleness—acting on outdated cached information—is among the more damaging failure modes.
When a memory names a specific file, function, or flag, the agent must verify its continued existence before recommending action. A memory is a claim about a point in time, not current reality.
The feedback memory type creates a persistent behavioral correction channel. When a user corrects the agent or confirms a non-obvious approach, the agent records the rule, reason, and application context—addressing the “Groundhog Day” problem where agents make the same mistakes across sessions.
Each phase is implemented as a skill with YAML frontmatter specifying name, description, enforcement level, and allowed tools. The skill chain is: gm → planning → gm-execute → gm-emit → gm-complete → update-docs.
Skills are invoked through a dedicated Skill tool ensuring phase identity is explicit, tool permissions are scoped per phase, and transition logic is localized.
Four hooks enforce the state machine at the platform level:
exec:<lang> execution, enforces standardsHooks operate below the agent’s decision layer—they cannot be bypassed by agent reasoning. This is a deliberate design choice: the agent should not be able to “decide” to skip planning or write code during verification.
All code execution passes through exec:<lang> dispatch. The pre-tool-use hook intercepts shell calls, detects the prefix, routes to the appropriate runtime, and manages background tasks. Supported languages include Node.js, Python, Go, Rust, TypeScript, and extensible language plugins.
The gm orchestrator deploys across 11 platforms via plugforge, generating platform-specific implementations from a single convention-driven source: 7 CLI platforms (Claude Code, Gemini CLI, OpenCode, Kilo, Codex, Copilot CLI, Qwen Code) and 4 IDE platforms (VS Code, Cursor, Zed, JetBrains).
The framework defines four tiers of constraints with decreasing severity:
The EMIT phase implements two-stage verification. Pre-emit: the agent imports the target module, runs proposed logic in-memory with real and error inputs, records expected outputs, and verifies all gate conditions hold simultaneously. Only then does it write to disk.
Post-emit: the agent re-imports the actual file from disk (not in-memory), runs identical inputs, and compares output exactly. Known variance triggers fix-and-reverify; unknown variance triggers a snake to PLAN. This catches errors that test-after-write approaches miss: serialization bugs, import resolution differences, and file system race conditions.
The approach shares surface similarities with model checking (named state variables, explicit transitions) and design-by-contract (pre/post-conditions on advancement). However, it operates at a much coarser level of abstraction—the “states” are task phases, not program states, and the “contracts” concern what the agent knows rather than program correctness.
There is a loose analogy to TDD: both emphasize verification before implementation. The difference is that our verifications are not persisted as test files—they exist only as witnessed execution during the EXECUTE phase. This trades the long-term regression safety of a test suite for reduced context overhead during agentic sessions. Whether this tradeoff is worthwhile likely depends on the project and use case.
The snake-back mechanism introduces overhead: an agent discovering a new unknown during EMIT must re-plan and re-execute, potentially discarding partial work. In our experience, this cost is generally offset by reduced late-stage rework, though we have not conducted controlled experiments to quantify the tradeoff.
This section documents major advances made between the paper’s initial publication (March 25, 2026) and April 12, 2026. Entries are dated to the commit that introduced the capability.
An exec:browser runtime was added to rs-exec, replacing ad-hoc browser invocations with a deterministic managed-Chrome pipeline. The runtime launches a portable Chromium instance with a fixed debugging port (32882), detects an existing session before spawning a new one, and cleans up all browser state on session end. Early versions suffered from file-lock races on Windows when two agents shared a Chrome profile; this was resolved by per-agent cwd-local profiles (April 12).
Session ownership is enforced at the RPC layer: getTask, getAndClearOutput, and waitForOutput all reject requests from sessions that did not create the task. Cross-session contamination on Windows—where a race in port-to-session mapping could return another agent’s browser—was closed on April 9.
exec:rust was added (March 31) supporting inline // cargo-dep: and // cargo-path: annotations so Rust snippets can declare their own dependencies without a manifest file. exec:serial was added (April 6) for COM-port streaming, routing through the global npm root for the serialport native module.
On runner restart, the in-memory active-PID map is lost, leaving --exec-process-mode child processes with no cleanup path. A startup reaper was added to rs-exec that uses sysinfo to find exec-process-mode processes whose parent is not a live runner-mode process (with a 5-second age guard) and kills them via recursive process-tree kill. Without this, orphans accumulated across restarts, reaching 300+ processes and ~5.9 GB RAM in observed failure cases.
The runner previously selected a random port on each start. This caused task-ID loss across restarts because the plugkit client could not reconnect to the same port. The port was fixed at 32882, making task IDs stable across runner restarts and enabling reliable exec:sleep / exec:status / exec:close workflows.
rs-search gained a vector embedding backend using nomic-embed-text-v1.5.Q4_K_M (a GGUF-quantized model, split into 6 parts for distribution). Search output is now structured as four sections: BM25 symbol hits, BM25 content hits, vector similarity hits, and git commit index hits. The combination gives the agent semantic recall for concepts that BM25 misses due to vocabulary mismatch, while retaining BM25’s precision for exact identifiers.
The git commit index was added simultaneously, enabling search queries to surface relevant commit messages alongside code results—surfacing the “why” of a change alongside the “what.”
Background tasks are now isolated by a compound key of session_id + cwd. This prevents tasks from one agent session from appearing in another agent’s task list when both operate in the same working directory. The compound key is forwarded by the pre-tool-use hook on every exec invocation.
An exec:kill-port command was added to reclaim ports held by stale processes. A start-lock mechanism was added to prevent multiple simultaneous plugkit runner instances—a race that previously caused split task state between two competing daemons.
Inline edits to CLAUDE.md were replaced by a background memorize sub-agent (Haiku model, run_in_background=true). The sub-agent receives context, extracts non-obvious caveats, deduplicates against existing entries, and writes. This separates the agent’s task stream from memory management and prevents mid-task context pollution from large file reads.
The planning skill was extended with a mandatory observability enumeration pass: on every planning iteration, the agent must enumerate every subsystem that lacks a permanent queryable inspection endpoint and add a .prd item to address each gap. The mandate distinguishes permanent structures (named, addressable, queryable without restart) from ad-hoc logs (console.log). On the client side, window.__debug must be a live structured registry—every component’s state, every active request, every WebSocket connection, addressable by key. New modules register on mount and deregister on unmount.
A four-step resolution order was codified across all phases: (1) native first—does the language or runtime already do this?; (2) library second—does an existing dependency solve this pattern?; (3) structure third—can the problem be encoded as data (map, table, pipeline) so the structure enforces correctness?; (4) write last—only author new logic when the above three are exhausted. The mandate names the pattern when structure eliminates a class of wrong states: dispatch tables replacing switch chains, pipelines replacing loop-with-accumulator, maps replacing if/else forests are treated as correctness properties, not style preferences.
The “snakes and ladders” metaphor (Section 2.3) was replaced throughout the skill chain with explicit immutable state machine language: forward transitions are named by their condition, backward transitions are named by the trigger that caused them. This eliminates ambiguity about which phase boundary an agent is at and what condition must be satisfied to leave it.
A fully automated cascade pipeline was established linking all five Rust repositories. A push to any of rs-exec, rs-codeinsight, rs-search, or rs-plugkit triggers a workflow_dispatch to rs-plugkit’s release workflow, which runs cargo update (picking up the latest git dep hashes), builds six platform binaries (linux/mac/win × x64/arm64), auto-bumps the rs-plugkit patch version, commits the binaries into plugforge, and triggers the plugforge publish workflow. The plugforge publish workflow builds and pushes to the downstream gm-cc repository, whose HEAD hash is what the /plugin command tracks to deliver updates to agents. No manual version bumps or local cargo builds are required at any step.
An eleventh platform, gm-qwen, was added for Qwen Code (a Claude Code fork with a Qwen-specific extension manifest). Simultaneously, a lang/ plugin system was introduced: language plugins in a local project’s lang/ directory are resolved before global package plugins, allowing per-project runtime extensions without modifying the package. The lang/browser.js plugin routes exec:browser through plugkit in the kilo/oc platforms.
The gm-complete skill was extended with a CI enforcement gate: the COMPLETE phase cannot be declared until CI is confirmed green on the pushed commit. This closes a gap where a session could terminate with all .prd items resolved and git clean but a failing build on the remote.
ReAct [1] interleaves reasoning and action but lacks formal phase boundaries or named unknowns. Reflexion [2] adds self-reflection but does not enforce witnessed execution as the sole resolution mechanism. AutoGPT and BabyAGI implement task decomposition but without dependency-aware parallel execution or stop-hook enforcement. SWE-Agent [3] provides a coding agent interface but delegates lifecycle control to the LLM’s judgment rather than enforcing it through external hooks.
Our approach is largely orthogonal to these efforts: it provides an external behavioral envelope that constrains agent lifecycle through hooks, stop barriers, and tool permissions, rather than improving the agent’s internal reasoning. In principle, these techniques could be combined.
We have described a set of techniques for improving the reliability of LLM coding agents: a phase-gated state machine, a discipline for naming and resolving unknowns, and a persistent memory system for cross-session coherence. These techniques are implemented in the gm orchestrator and deployed across 11 platforms.
Our experience suggests that agent reliability depends substantially on lifecycle control—not just model capability. An agent operating within explicit phase boundaries and external enforcement mechanisms tends to produce more predictable results than one relying on self-regulation alone. We offer these techniques as practical tools rather than general claims, and expect that further work will clarify their scope and limitations.