Advances in Agentic Discipline for LLM Coding Agents

Abstract

We describe a set of advances to the agentic discipline of the gm orchestrator, building on the phase-gated state machine introduced in our earlier work [1]. Four concerns motivate this paper. First, agents lack a principled protocol for searching codebases, leading to false negatives and overly broad result sets. Second, agents accumulate code complexity through habitual addition—writing new logic when native or structural solutions would have sufficed. Third, agents operating across multiple concurrent tasks lack mechanisms to ensure tasks are truly independent before executing them in parallel. Fourth, agents produce outputs that cannot be inspected at runtime without restarting the process or adding temporary logging. We describe techniques addressing each concern: a minimum-two-word codebase search protocol with iteration requirements, a four-step resolution order that defers code authorship to last resort, a structural mandate for wave-based parallel execution with dependency enforcement, and a permanent observability registry discipline. Together these advances tighten the behavioral envelope within which the gm state machine operates.

1. Introduction

Our earlier paper [1] described the gm orchestrator: a phase-gated state machine for LLM coding agents enforcing lifecycle control through hooks, named unknowns, and witnessed execution. That paper addressed agent lifecycle—the order in which work may proceed and the conditions for advancing between phases. It did not address the quality of work within each phase.

This paper describes four advances to the discipline of work within phases. Each addresses a failure mode we observed in practice:

The paper is organized as follows: §2 describes the codebase search discipline; §3 introduces the minimal-code resolution order; §4 covers structural parallelism enforcement; §5 describes the observability mandate; and §6–8 discuss related work, limitations, and conclusions.

2. Codebase Search Discipline

2.1 The Single-Query Failure Mode

When an agent searches a codebase for an existing implementation, the most common failure is a single overly broad query that either returns too many results (the agent gives up and writes new code anyway) or returns zero results (the agent concludes the symbol is absent when it exists under a different name). In either case, the result is a duplicate implementation or a missed consolidation.

A secondary failure is the agent treating a search result as conclusive after a single attempt. Codebase search is a vocabulary problem: the agent’s name for a concept may not match the codebase’s name for it. A single query covering only one vocabulary is not sufficient evidence of absence.

2.2 The Minimum-Two-Word Protocol

The search discipline requires every codebase query to contain at least two words. Single-word queries are rejected at the invocation layer. The rationale is simple: a two-word query with a subject and a context returns substantially fewer false positives than a one-word query, making the result set actionable rather than requiring manual filtering.

When a query returns no results, the protocol prescribes a specific iteration strategy: change one word and add a third. This forces the agent to consider alternate vocabulary rather than concluding immediately that no result exists. The protocol requires a minimum of four distinct queries before a negative result is accepted:

Only after all four attempts return no results may the agent conclude the symbol is absent. This discipline addresses the vocabulary mismatch problem systematically rather than by chance.

2.3 Scan Before Edit

The search discipline extends to edits: before modifying any concern, the agent must scan the codebase for existing implementations of the same concern. A duplicate concern discovered during an edit—rather than during search—triggers an immediate regression to PLAN. The scan-before-edit rule converts consolidation from an afterthought into a gate condition on all modifications.

Observation (Coverage). The minimum-four-attempt protocol does not guarantee discovery of all existing implementations. It does guarantee that the agent has exhausted at least four distinct vocabulary hypotheses before concluding absence, which substantially reduces the false-negative rate compared to single-attempt protocols.

3. The Minimal-Code Resolution Order

3.1 The Complexity Accumulation Problem

LLM coding agents have a systematic tendency to write new code. This is unsurprising: generating code is the primary training objective. The failure mode is not writing bad code—it is writing code at all when no new code was needed. Every line of new code is a line that must be read, tested, maintained, and eventually deleted. An agent that reaches for a new function before checking whether the standard library already provides it is accumulating complexity without reason.

We observe this failure in three common forms: writing a utility function that wraps a one-line native API call; writing a conditional chain that could be encoded as a dispatch table; and writing a loop accumulator that could be expressed as a pipeline. None of these adds functionality. All add surface area.

3.2 Four-Step Resolution Order

The minimal-code mandate imposes a resolution order that must be followed before writing any logic. The agent stops at the first step that resolves the need:

The order is strict. An agent that reaches step 4 without documenting why steps 1–3 were inapplicable has violated the protocol. This is not enforced mechanically—the agent must apply the resolution order as a cognitive discipline before each authorship decision.

3.3 Structure as Correctness

Step 3 deserves elaboration because its benefit is not merely conciseness. When a set of cases is encoded as a dispatch table, the table shape makes it structurally impossible to dispatch to the wrong handler—there is no branching logic to contain a bug. When a data transformation is expressed as a pipeline of pure functions, there is no mutable accumulator in which intermediate state can corrupt the result. These are not style preferences: they are correctness properties that the structure enforces without requiring the reader to trace execution.

The mandate requires that when a structural choice eliminates a class of wrong states, that elimination is named explicitly. The name is not a comment—comments are prohibited. It is a function name, a type name, or a variable name that makes the structural property self-evident to any reader.

3.4 Readability Criterion

The mandate includes a readability gate: code that requires the reader to mentally simulate execution in order to determine its behavior is not complete. Code that can be read top-to-bottom—each line following from the last without requiring the reader to hold intermediate state—is complete. This operationalizes the principle without subjective judgment about “clean code.” If you must trace execution to understand behavior, the structure must change.

4. Structural Parallelism Enforcement

4.1 False Parallelism

The .prd dependency graph enables wave-based parallel execution: items with satisfied dependencies are launched as concurrent subagents, each working on an independent task. The failure mode is false parallelism: launching items as concurrent when they have implicit dependencies that the dependency graph did not capture. The result is conflicting edits to shared files, race conditions in test execution, or one subagent completing work that invalidates another’s assumption.

False parallelism is particularly insidious because it is not visible in the .prd structure. Two items may have no declared dependencies but both modify the same file, or both depend on the same external service being in a specific state. The dependency graph captures declared dependencies; it cannot capture implicit ones.

4.2 Wave Discipline

The wave discipline requires the agent to verify independence before launching a parallel wave. For each pair of items in the proposed wave, the agent must confirm:

If any pair fails these checks, the failing items are serialized: one executes first, and the other is deferred to the next wave after the first completes. The wave size is capped at three concurrent subagents to limit blast radius when a false dependency is missed.

4.3 Browser Task Serialization

Browser automation tasks are unconditionally serialized regardless of declared dependencies. The rationale is that a managed browser instance is inherently stateful shared resource—navigation, cookies, and DOM state from one task bleed into the next. Even tasks that appear independent (navigating to different pages) share a browser profile and can interfere through cached credentials, service workers, or local storage. Serialization eliminates this failure class at the cost of reduced parallelism on browser-heavy workloads.

4.4 The Memorize Sub-Agent

Memory persistence was previously performed inline during the agent’s task stream, requiring the agent to read and edit the memory index file mid-task. This introduced both context pollution (large file reads during active task execution) and a sequencing hazard (memory writes could conflict with task edits to adjacent files).

The advance replaces inline memory persistence with a background sub-agent: a separate Haiku-model agent invoked with run_in_background=true that receives the context to persist, classifies it into the appropriate memory type, deduplicates against existing entries, and writes. The sub-agent runs independently of the main task stream, returning no result to the invoking agent. The invoking agent does not wait for it. This separates memory management from task execution both temporally and in terms of context window consumption.

5. Permanent Observability

5.1 The Opacity Problem

An agent that cannot inspect its own runtime state must guess when diagnosing failures. A guess is an unwitnessed hypothesis—precisely the failure mode the mutable discipline is designed to prevent. The opacity problem is therefore not merely an operational inconvenience: it is a category violation of the witnessed-execution principle. If the agent cannot see what is running, what has been output, or what phase the state machine is in without adding temporary logging, then its debugging process is structurally identical to narrating hypotheses—which it is explicitly prohibited from doing.

5.2 Permanent vs. Ad-Hoc Observability

Permanent structures exist independently of any specific debugging session. They are named, queryable, and survive across agent interactions. A structured inspection endpoint at /debug/<subsystem> that returns current state is permanent. A window.__debug.chat.pendingMessages registry entry populated on mount and removed on unmount is permanent. A structured log filterable by subsystem tag at runtime without restart is permanent.

Ad-hoc structures are temporary and disappear. A console.log statement is ad-hoc. A variable dump added during a debugging session and removed after is ad-hoc. Ad-hoc observability requires foreknowledge of where to look and disappears the moment it is no longer immediately needed. It accumulates no diagnostic value across sessions.

5.3 Enumeration Mandate

During every planning pass, the agent must enumerate every internal subsystem that lacks a permanent queryable inspection point and add a .prd item for each gap. The enumeration is not optional and is not deferred: observability items are classified as highest-priority and cannot be blocked by feature work. The mandate names the required subjects: state machine current phase, .prd item list with statuses, active background task map, last tool output per task, hook invocation counts, and memory index freshness.

The intent is to make each planning pass leave the system more observable than it was before the pass began, independent of what other work is in progress.

5.4 Client-Side Registry

For agent systems with a client-side component, window.__debug must be a live structured registry, not a dump. Each module registers itself on mount (window.__debug.moduleName = { state, methods }) and deregisters on unmount (delete window.__debug.moduleName). The registry is addressable by key: any component’s current state is reachable without refreshing. Any execution path not traceable through window.__debug is classified as an observability gap and generates a .prd item in the next planning pass.

6. Relationship to Earlier Work

The first paper [1] addressed the agent’s behavioral envelope: what it may do and in what order. This paper addresses the quality of work within that envelope. The two concerns are orthogonal: an agent can follow the phase transitions perfectly while still searching poorly, writing unnecessary code, launching false-parallel tasks, and operating opaquely. The advances described here are therefore not replacements for the state machine discipline—they are refinements of behavior within it.

The minimal-code resolution order extends the witnessed-execution discipline of §3 in the first paper. Witnessed execution prohibits acting on unverified hypotheses about runtime behavior; the resolution order prohibits acting on unverified hypotheses about whether existing code solves the need. Both address the same root failure: the agent generating plausible output without verifying the premise that output was needed.

The search discipline extends the named-mutable discipline of §3.1. A search result is a mutable—it has an expected value (the symbol exists here), a current value (UNKNOWN), and a resolution method (execute the query and witness the output). The minimum-four-attempt protocol operationalizes the two-pass rule of §3.3: if two attempts with different vocabulary both fail, the mutable is reclassified and the agent regresses to PLAN.

7. Limitations

8. Related Work

The minimal-code resolution order shares motivation with the principle of least power [2]: prefer the least powerful tool that solves the problem, because less powerful tools are easier to analyze and less likely to produce unexpected behavior. Our formulation is specifically calibrated for LLM agents, where the failure mode is excess code generation rather than the use of overly expressive languages.

The observability mandate draws on the observability-as-code literature [3], which argues that structured inspection endpoints should be designed into systems from the start rather than added as an afterthought. Our contribution is applying this principle to the agent itself as a system, rather than only to the software the agent produces.

Wave-based parallelism with dependency enforcement is a standard technique in build systems [4]. Our application differs in that the dependency graph is constructed by the agent from task descriptions rather than derived mechanically from file dependency analysis, making false-parallel errors possible in ways that build-system parallelism is not.

The background memorize sub-agent pattern is related to the actor model [5]: the main agent and the memory agent are concurrent processes communicating by message, with no shared mutable state. The memory agent’s output is durable (disk) while the main agent’s context is ephemeral (conversation window), which motivates the separation.

9. Conclusion

We have described four advances to the agentic discipline of the gm orchestrator: a minimum-two-word codebase search protocol with four-attempt iteration requirements, a four-step resolution order that defers code authorship to last resort, a structural parallelism discipline requiring dependency verification before concurrent execution, and a permanent observability mandate that treats unqueryable internal state as a planning-phase defect. These advances tighten the behavioral envelope within which the gm state machine operates, addressing failure modes that the lifecycle enforcement of the first paper left unaddressed.

Our experience suggests that agent reliability depends not only on what the agent is permitted to do—the subject of the first paper—but on the discipline with which it performs the permitted actions. A well-structured lifecycle is necessary but not sufficient. The advances described here address the sufficiency gap.

References

AnEntrypoint, “Toward Structured Lifecycle Control in LLM Coding Agents,” March 2026.
W3C TAG, “The Principle of Least Power,” w3.org/2001/tag/doc/leastPower, 2006.
C. Richardson, “Observability Patterns for Microservices,” microservices.io/patterns/observability, 2018.
D. McIlroy and J. Emer, “Engineering a Sort Function,” Software—Practice and Experience, 1993. (Build parallelism context.)
C. Hewitt, P. Bishop, and R. Steiger, “A Universal Modular ACTOR Formalism for Artificial Intelligence,” IJCAI, 1973.
S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” ICLR, 2023.

Advances in Agentic Disciplinefor LLM Coding Agents