Toward Structured Lifecycle Control
in LLM Coding Agents

AnEntrypoint
March 2026
Abstract
LLM-based coding agents frequently exhibit three failure modes: silent assumption propagation, uncontrolled state drift, and memory loss across sessions. We describe a set of practical techniques for addressing these failures through external lifecycle enforcement rather than prompt-based self-discipline. The approach centers on a phase-gated state machine—PLAN → EXECUTE → EMIT → VERIFY → COMPLETE—where forward progress requires satisfying explicit conditions and newly discovered unknowns trigger backward transitions to earlier phases. We complement this with a discipline for naming unknowns before acting on them, requiring witnessed execution to resolve them, and a file-based memory system for cross-session coherence. We describe our implementation, gm, which has been deployed across 11 IDE and CLI platforms. An addendum (Section 10) documents major system advances made in the three weeks following initial publication, including a deterministic browser runtime, vector embedding search, session-scoped task isolation, an automated cascade build pipeline, and a mandatory observability enumeration discipline.

1. Introduction

LLM-based coding agents—systems that autonomously write, test, and deploy software—have become increasingly capable, but their reliability remains uneven. In our experience maintaining a multi-platform agent system, we have observed three recurring failure patterns:

  1. Assumption propagation: Agents narrate hypotheses about code behavior without verifying them, leading to cascading errors when assumptions prove false.
  2. State amnesia: Multi-step tasks lose coherence as context windows fill, with agents forgetting earlier decisions or re-solving solved subproblems.
  3. Uncontrolled lifecycle: Without explicit phase boundaries, agents freely interleave planning, coding, and verification, making it impossible to reason about task progress or enforce quality gates.

This paper describes a set of techniques we have found useful for mitigating these failures. The approach uses a phase-gated state machine with named unknowns and mandatory witnessed execution. Rather than relying on the model to self-regulate, constraints are enforced externally through hooks that operate below the agent’s decision layer.

The paper is organized as follows: §2 describes the state machine model; §3 introduces the discipline for handling unknowns; §4–5 cover work tracking and cross-session memory; §6 describes the implementation; §7–9 discuss constraints, verification, and limitations; §10 documents advances made in the three weeks following initial publication.

2. The State Machine Model

2.1 Phases

The framework defines five phases with forward-only progression. PLAN discovers all unknowns (exits when a full pass adds zero new unknowns). EXECUTE resolves unknowns via witnessed execution (exits when all mutables are KNOWN). EMIT writes files to disk (exits when post-emit matches pre-emit exactly). VERIFY performs end-to-end witnessed validation (exits when .prd is empty and git is clean and pushed). COMPLETE updates documentation and closes the session.

Each phase is implemented as a discrete skill—a self-contained prompt with explicit tool permissions, enforcement level, and transition rules. Skills are invoked through a dedicated dispatch mechanism, never through general-purpose agent spawning, ensuring that phase identity is preserved across invocations.

2.2 Forward Transitions (Ladders)

A phase advances only when its exit condition is satisfied. Conditions are conjunctive: every sub-condition must hold simultaneously. This prevents partial advancement where, for example, code is written before all unknowns are resolved.

2.3 Backward Transitions (Snakes)

Any newly discovered unknown at any phase triggers an immediate backward transition:

We refer to this informally as a “snakes and ladders” model: forward progress requires satisfying conditions (ladders), while encountering problems sends the agent back to an earlier phase (snakes). The intent is to prevent agents from patching around problems rather than reconsidering their assumptions.

Observation (Termination). If the set of discoverable unknowns is finite and each planning pass reduces the set of undiscovered unknowns, the process terminates. In our usage, we typically observe convergence within 2–4 planning passes.

3. The Mutable Discipline

3.1 Named Unknowns

A key element of our approach is the named mutable: every unknown fact needed to make a decision or write code is explicitly named before action is taken. A mutable has a name, an expected value, a current value (initially UNKNOWN), and a resolution method. The agent is instructed not to proceed past an unresolved mutable.

The intent is to make uncertainty visible rather than allowing the agent to generate plausible-but-unverified code.

3.2 Resolution by Witnessed Execution

In our framework, a mutable is considered resolved only when its value is assigned by the output of real execution—running actual code against actual services with actual data. The following do not constitute resolution:

The rule is straightforward: only witnessed output counts as ground truth. Once a value has been observed through execution, it is treated as known. Before that, it remains unknown regardless of how plausible a guess might be.

3.3 The Two-Pass Rule

If a mutable remains unresolved after two execution passes, it is reclassified as a new unknown and the agent snakes back to PLAN. This prevents infinite loops where the agent repeatedly attempts the same failing resolution strategy.

3.4 Surprise Absorption Prohibition

When execution produces unexpected output, the agent is instructed not to silently absorb the surprise. Instead, every discrepancy between expected and actual output generates a new named mutable, triggering a return to PLAN. The goal is to prevent workarounds that accumulate technical debt.

4. Work Item Management: The .prd

4.1 Structure

The system tracks work through a .prd file (Product Requirements Document), a JSON array where each item carries an id, imperative subject, status (pending → in_progress → completed), acceptance criteria, edge cases, bidirectional dependency links, and effort estimates (small: <15 min, medium: <45 min, large: >1 hour).

4.2 Wave-Based Parallel Execution

The .prd’s dependency graph enables parallel execution. Each wave identifies all pending items with satisfied dependencies, launches up to 3 concurrent subagents (one per item), and upon completion checks newly unblocked items for the next wave. This exploits the inherent parallelism in software tasks.

4.3 Stop-Hook Enforcement

A stop hook inspects the .prd at session end. If any items remain, the session cannot close. A second stop hook checks for uncommitted or unpushed git changes. Together, these act as barriers against premature completion.

5. Cross-Session Memory Architecture

5.1 Memory Types

The system implements a file-based persistent memory system with four semantic types:

user:
Profile, role, expertise, preferences—enables tailored collaboration
feedback:
Behavioral corrections and confirmations—persistent learning channel
project:
Ongoing work context, deadlines, decisions—situational awareness
reference:
Pointers to external systems—navigational knowledge

Each memory is stored as a Markdown file with YAML frontmatter, indexed in MEMORY.md. The index loads at conversation start; individual memories are read on demand.

5.2 Exclusion Principle

The system avoids storing information that can be derived from current project state: code patterns, architecture, file paths, git history, and debugging solutions. In our experience, memory staleness—acting on outdated cached information—is among the more damaging failure modes.

5.3 Verification Before Recall

When a memory names a specific file, function, or flag, the agent must verify its continued existence before recommending action. A memory is a claim about a point in time, not current reality.

5.4 Feedback Loop

The feedback memory type creates a persistent behavioral correction channel. When a user corrects the agent or confirms a non-obvious approach, the agent records the rule, reason, and application context—addressing the “Groundhog Day” problem where agents make the same mistakes across sessions.

6. Implementation: The gm Orchestrator

6.1 Skill-Based Architecture

Each phase is implemented as a skill with YAML frontmatter specifying name, description, enforcement level, and allowed tools. The skill chain is: gm → planning → gm-execute → gm-emit → gm-complete → update-docs.

Skills are invoked through a dedicated Skill tool ensuring phase identity is explicit, tool permissions are scoped per phase, and transition logic is localized.

6.2 Hook Lifecycle

Four hooks enforce the state machine at the platform level:

session-start:
Tool installation, agent context injection, codebase analysis
prompt-submit:
Routes all user messages through the gm skill chain
pre-tool-use:
Blocks forbidden tools, dispatches exec:<lang> execution, enforces standards
stop:
Blocks session end if .prd has remaining items or git is dirty

Hooks operate below the agent’s decision layer—they cannot be bypassed by agent reasoning. This is a deliberate design choice: the agent should not be able to “decide” to skip planning or write code during verification.

6.3 Unified Code Execution

All code execution passes through exec:<lang> dispatch. The pre-tool-use hook intercepts shell calls, detects the prefix, routes to the appropriate runtime, and manages background tasks. Supported languages include Node.js, Python, Go, Rust, TypeScript, and extensible language plugins.

6.4 Cross-Platform Deployment

The gm orchestrator deploys across 11 platforms via plugforge, generating platform-specific implementations from a single convention-driven source: 7 CLI platforms (Claude Code, Gemini CLI, OpenCode, Kilo, Codex, Copilot CLI, Qwen Code) and 4 IDE platforms (VS Code, Cursor, Zed, JetBrains).

7. Constraint Tiers

The framework defines four tiers of constraints with decreasing severity:

Tier 0 (Absolute):
no_crash, no_exit, ground_truth_only, real_execution—enforced by hooks, cannot be overridden
Tier 1 (Quality):
max_file_lines=200, hot_reloadable, checkpoint_state
Tier 2 (Modularity):
no_duplication, no_hardcoded_values
Tier 3 (Style):
no_comments, convention_over_code

8. Pre-Emit Verification Protocol

The EMIT phase implements two-stage verification. Pre-emit: the agent imports the target module, runs proposed logic in-memory with real and error inputs, records expected outputs, and verifies all gate conditions hold simultaneously. Only then does it write to disk.

Post-emit: the agent re-imports the actual file from disk (not in-memory), runs identical inputs, and compares output exactly. Known variance triggers fix-and-reverify; unknown variance triggers a snake to PLAN. This catches errors that test-after-write approaches miss: serialization bugs, import resolution differences, and file system race conditions.

9. Discussion

9.1 Relationship to Formal Methods

The approach shares surface similarities with model checking (named state variables, explicit transitions) and design-by-contract (pre/post-conditions on advancement). However, it operates at a much coarser level of abstraction—the “states” are task phases, not program states, and the “contracts” concern what the agent knows rather than program correctness.

9.2 Relationship to Test-Driven Development

There is a loose analogy to TDD: both emphasize verification before implementation. The difference is that our verifications are not persisted as test files—they exist only as witnessed execution during the EXECUTE phase. This trades the long-term regression safety of a test suite for reduced context overhead during agentic sessions. Whether this tradeoff is worthwhile likely depends on the project and use case.

9.3 Overhead vs. Benefit

The snake-back mechanism introduces overhead: an agent discovering a new unknown during EMIT must re-plan and re-execute, potentially discarding partial work. In our experience, this cost is generally offset by reduced late-stage rework, though we have not conducted controlled experiments to quantify the tradeoff.

9.4 Limitations

  1. Finite unknown assumption: Termination requires discoverable unknowns to be finite. In adversarial environments, this may not hold.
  2. Witnessed execution cost: Real execution against real services may be prohibitive for expensive external APIs.
  3. Memory staleness window: The time between memory creation and verification creates potential for stale-memory-induced errors.

10. Advances Since Initial Publication

This section documents major advances made between the paper’s initial publication (March 25, 2026) and April 12, 2026. Entries are dated to the commit that introduced the capability.

10.1 Deterministic Browser Runtime (March 31, 2026)

An exec:browser runtime was added to rs-exec, replacing ad-hoc browser invocations with a deterministic managed-Chrome pipeline. The runtime launches a portable Chromium instance with a fixed debugging port (32882), detects an existing session before spawning a new one, and cleans up all browser state on session end. Early versions suffered from file-lock races on Windows when two agents shared a Chrome profile; this was resolved by per-agent cwd-local profiles (April 12).

Session ownership is enforced at the RPC layer: getTask, getAndClearOutput, and waitForOutput all reject requests from sessions that did not create the task. Cross-session contamination on Windows—where a race in port-to-session mapping could return another agent’s browser—was closed on April 9.

10.2 exec:rust and exec:serial (March–April 2026)

exec:rust was added (March 31) supporting inline // cargo-dep: and // cargo-path: annotations so Rust snippets can declare their own dependencies without a manifest file. exec:serial was added (April 6) for COM-port streaming, routing through the global npm root for the serialport native module.

10.3 Orphan Process Reaper (April 11, 2026)

On runner restart, the in-memory active-PID map is lost, leaving --exec-process-mode child processes with no cleanup path. A startup reaper was added to rs-exec that uses sysinfo to find exec-process-mode processes whose parent is not a live runner-mode process (with a 5-second age guard) and kills them via recursive process-tree kill. Without this, orphans accumulated across restarts, reaching 300+ processes and ~5.9 GB RAM in observed failure cases.

10.4 Fixed Port and Stable Task Tracking (April 6, 2026)

The runner previously selected a random port on each start. This caused task-ID loss across restarts because the plugkit client could not reconnect to the same port. The port was fixed at 32882, making task IDs stable across runner restarts and enabling reliable exec:sleep / exec:status / exec:close workflows.

10.5 Vector Embedding Search (April 11, 2026)

rs-search gained a vector embedding backend using nomic-embed-text-v1.5.Q4_K_M (a GGUF-quantized model, split into 6 parts for distribution). Search output is now structured as four sections: BM25 symbol hits, BM25 content hits, vector similarity hits, and git commit index hits. The combination gives the agent semantic recall for concepts that BM25 misses due to vocabulary mismatch, while retaining BM25’s precision for exact identifiers.

The git commit index was added simultaneously, enabling search queries to surface relevant commit messages alongside code results—surfacing the “why” of a change alongside the “what.”

10.6 Session-Scoped Task Isolation (April 8, 2026)

Background tasks are now isolated by a compound key of session_id + cwd. This prevents tasks from one agent session from appearing in another agent’s task list when both operate in the same working directory. The compound key is forwarded by the pre-tool-use hook on every exec invocation.

10.7 exec:kill-port and Runner Start Lock (April 6, 2026)

An exec:kill-port command was added to reclaim ports held by stale processes. A start-lock mechanism was added to prevent multiple simultaneous plugkit runner instances—a race that previously caused split task state between two competing daemons.

10.8 Memorize Sub-Agent (April 10, 2026)

Inline edits to CLAUDE.md were replaced by a background memorize sub-agent (Haiku model, run_in_background=true). The sub-agent receives context, extracts non-obvious caveats, deduplicates against existing entries, and writes. This separates the agent’s task stream from memory management and prevents mid-task context pollution from large file reads.

10.9 Observability Mandate (April 11, 2026)

The planning skill was extended with a mandatory observability enumeration pass: on every planning iteration, the agent must enumerate every subsystem that lacks a permanent queryable inspection endpoint and add a .prd item to address each gap. The mandate distinguishes permanent structures (named, addressable, queryable without restart) from ad-hoc logs (console.log). On the client side, window.__debug must be a live structured registry—every component’s state, every active request, every WebSocket connection, addressable by key. New modules register on mount and deregister on unmount.

10.10 Minimal-Code / Maximal-DX Mandate (April 12, 2026)

A four-step resolution order was codified across all phases: (1) native first—does the language or runtime already do this?; (2) library second—does an existing dependency solve this pattern?; (3) structure third—can the problem be encoded as data (map, table, pipeline) so the structure enforces correctness?; (4) write last—only author new logic when the above three are exhausted. The mandate names the pattern when structure eliminates a class of wrong states: dispatch tables replacing switch chains, pipelines replacing loop-with-accumulator, maps replacing if/else forests are treated as correctness properties, not style preferences.

10.11 Immutable State Machine Terminology (April 9, 2026)

The “snakes and ladders” metaphor (Section 2.3) was replaced throughout the skill chain with explicit immutable state machine language: forward transitions are named by their condition, backward transitions are named by the trigger that caused them. This eliminates ambiguity about which phase boundary an agent is at and what condition must be satisfied to leave it.

10.12 Cascade Pipeline (April 3, 2026)

A fully automated cascade pipeline was established linking all five Rust repositories. A push to any of rs-exec, rs-codeinsight, rs-search, or rs-plugkit triggers a workflow_dispatch to rs-plugkit’s release workflow, which runs cargo update (picking up the latest git dep hashes), builds six platform binaries (linux/mac/win × x64/arm64), auto-bumps the rs-plugkit patch version, commits the binaries into plugforge, and triggers the plugforge publish workflow. The plugforge publish workflow builds and pushes to the downstream gm-cc repository, whose HEAD hash is what the /plugin command tracks to deliver updates to agents. No manual version bumps or local cargo builds are required at any step.

10.13 Qwen Platform and Lang Plugin System (April 5–6, 2026)

An eleventh platform, gm-qwen, was added for Qwen Code (a Claude Code fork with a Qwen-specific extension manifest). Simultaneously, a lang/ plugin system was introduced: language plugins in a local project’s lang/ directory are resolved before global package plugins, allowing per-project runtime extensions without modifying the package. The lang/browser.js plugin routes exec:browser through plugkit in the kilo/oc platforms.

10.14 CI Enforcement Gate (April 8, 2026)

The gm-complete skill was extended with a CI enforcement gate: the COMPLETE phase cannot be declared until CI is confirmed green on the pushed commit. This closes a gap where a session could terminate with all .prd items resolved and git clean but a failing build on the remote.

11. Related Work

ReAct [1] interleaves reasoning and action but lacks formal phase boundaries or named unknowns. Reflexion [2] adds self-reflection but does not enforce witnessed execution as the sole resolution mechanism. AutoGPT and BabyAGI implement task decomposition but without dependency-aware parallel execution or stop-hook enforcement. SWE-Agent [3] provides a coding agent interface but delegates lifecycle control to the LLM’s judgment rather than enforcing it through external hooks.

Our approach is largely orthogonal to these efforts: it provides an external behavioral envelope that constrains agent lifecycle through hooks, stop barriers, and tool permissions, rather than improving the agent’s internal reasoning. In principle, these techniques could be combined.

12. Conclusion

We have described a set of techniques for improving the reliability of LLM coding agents: a phase-gated state machine, a discipline for naming and resolving unknowns, and a persistent memory system for cross-session coherence. These techniques are implemented in the gm orchestrator and deployed across 11 platforms.

Our experience suggests that agent reliability depends substantially on lifecycle control—not just model capability. An agent operating within explicit phase boundaries and external enforcement mechanisms tends to produce more predictable results than one relying on self-regulation alone. We offer these techniques as practical tools rather than general claims, and expect that further work will clarify their scope and limitations.

References

  1. S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” ICLR, 2023.
  2. N. Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning,” NeurIPS, 2023.
  3. J. Yang et al., “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering,” arXiv:2405.15793, 2024.
  4. J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” NeurIPS, 2022.
  5. J.S. Park et al., “Generative Agents: Interactive Simulacra of Human Behavior,” UIST, 2023.
  6. G. Wang et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models,” NeurIPS, 2023.