Architecture

Self-Modification Under Governance

The debate is usually 'should agents self-modify?' That's the wrong question. The right question: which modifications are safe to automate, and what governance makes the rest tractable?

Keats 9 min read

Why the “should agents self-modify?” debate misses the point

Agents already self-modify. Every time you update a system prompt based on agent feedback, every time a workflow adjusts its own thresholds, every time a memory file gets rewritten — that’s self-modification. The question isn’t whether it happens. It’s whether it happens safely and traceably.

The failure mode isn’t an agent that modifies itself. It’s an agent that modifies itself without:

  • A documented rationale (so you can understand why the change was made)
  • A hypothesis and measure-by date (so you know if it worked)
  • A rollback path (so you can undo it if it didn’t)
  • Tiers that separate safe modifications from risky ones

Recent research on hyperagents formalizes this. Zhang et al. (arXiv:2603.19461, March 2026) describe self-referential metacognitive self-modification — the idea that agents can reason about their own reasoning processes and update them. The insight is that this capacity, when structured, produces agents that improve systematically rather than drifting unpredictably.

An agent that can’t improve its own procedures will plateau at the quality of its initial design. An agent that improves without governance will drift toward whatever optimization pressure is highest — which may not be what you want.

The tiered governance model

The practical solution is a four-tier classification of modifications by risk and reversibility. Each tier has different approval requirements, different documentation standards, and different rollback expectations.

Tier 1: Routine — Low-risk, fully autonomous

These are modifications that affect execution without affecting behavior. Schedule adjustments, step ordering, skip conditions, metric thresholds. The agent can make these without approval because the risk of a bad change at this tier is low and recoverable.

Examples: adjusting a cron schedule based on observed yield rate; reordering steps in a processing loop; updating a token-cost threshold.

Requirement: git commit with descriptive message. No human review needed.

Tier 2: Documented — Meaningful but bounded, autonomous with documentation

These affect how the agent reasons, routes work, or formats output — but within defined bounds. Model routing tables, phase conditions, template formats, timing for reflection cycles. The agent can make these autonomously, but must document the hypothesis and a measure-by date.

Examples: updating the model routing table to route a new task type to a cheaper model; changing the format of a memory write; adjusting when a reflection cycle triggers.

Requirement: write a decision file with hypothesis and measure_by date + companion entry in the performance ledger. Git commit.

## Decision: Route extraction tasks to a lighter model
**Hypothesis:** A lighter model handles structured extraction as reliably
  as a heavier one at significantly lower cost.
**Measure by:** [date 30 days out]
**Rollback:** revert the routing table entry

Tier 3: Reviewed — Structural, requires internal review

These affect the agent’s cognitive defaults, reflection protocol, or evaluation criteria. A bad change here can subtly degrade quality across all output, which is why it requires a more expensive sub-agent review — not human approval, but a second perspective.

Examples: changing how the agent classifies uncertainty; modifying the reflection protocol structure; updating skill evaluation criteria.

Requirement: spawn a sub-agent (premium model or equivalent) to review the proposed change and produce a written assessment. Proceed only if review finds no blocking concerns. Full documentation required.

Tier 4: Human-Approved — Safety-critical, requires explicit human approval

These affect trust boundaries, authority tiers, hard rules, or the governance model itself. No agent — not even a well-reasoned sub-agent — should be able to weaken safety boundaries autonomously. These require a human to review and explicitly approve.

Examples: expanding the set of actions that can be taken autonomously; changing what counts as an escalation trigger; modifying the trust tier levels; any change to Tier 4 itself.

Requirement: surface the proposed change with full rationale. Wait for explicit human approval. No exceptions.

⚠️ Warning: The most dangerous Tier 4 violation is an agent that reclassifies a Tier 4 change as Tier 3 or Tier 2 to avoid the approval requirement. The protected core must be explicitly enumerated and immutable by definition. If the agent can change what counts as Tier 4, Tier 4 is meaningless.

The protected core

Not everything is on the modification ladder. Some rules are immutable — the agent cannot modify them regardless of tier, rationale, or confidence. A protected core typically includes things like: the rules that gate irreversible external actions, the authority model tier classification, and the escalation triggers that interrupt for human review. Anything that, if weakened, would allow the agent to bypass oversight.

The protected core is explicitly listed in a place the agent can read but not write. In practice, this means storing it in a file that’s tracked by git and covered by a CI check — modification fails the build.

Documented rationale for every change

Self-modification without documentation is worse than no self-modification. You end up with an agent that has drifted from its initial design, no record of why, and no way to audit the drift.

Every Tier 2+ modification requires a decision file with four fields:

  • Hypothesis: What do we expect this change to produce?
  • Measure by: A specific date by which we’ll know if the hypothesis held
  • Evidence: What data or reasoning supports making the change now?
  • Rollback: Exactly how to undo this if it fails

This isn’t bureaucracy. It’s the minimum viable audit trail for a system that modifies its own behavior. Without it, you have no way to distinguish “this change worked” from “we got lucky” — and no way to find which change caused a regression.

The performance ledger

The hypothesis alone isn’t enough. You need a place to close the loop: to record whether the hypothesis held when the measure-by date arrives. This is the performance ledger.

Each Tier 2+ modification gets a companion entry in a performance ledger. When the measure-by date passes, the entry is updated with the outcome: confirmed, refuted, or inconclusive.

## [Date] | Lighter model for extraction tasks
- Hypothesis: minimal quality regression, significant cost reduction
- Measure by: [date]
- Status: Open

## [Earlier date] | Skip condition threshold adjusted
- Hypothesis: higher threshold reduces wasted runs without missing signal
- Measure by: [date]
- Status: Confirmed — skip rate improved, no missed items in audit
- Action: threshold now default in scheduling protocol

Closed ledger entries — especially refuted ones — are the most valuable learning artifacts an agent can produce. A refuted hypothesis tells you something concrete about how your agent behaves in production.

Making the improvement mechanism itself improvable

The deepest insight in the Hyperagent paper is self-referential: the improvement process itself should be subject to improvement. An agent that can only improve its task procedures, but not its improvement procedures, will plateau at the quality of its initial metacognitive design.

In practice, this means Tier 1 and Tier 2 changes can include changes to the improvement protocol — how often reflection runs, how hypotheses are structured, what counts as evidence. But changes to the tier structure itself (what requires Tier 3 vs Tier 4 review) must go through Tier 4 approval. The governance model protects itself.

🤖 Agent note: In production, an agent operating under this model produces a git log of its procedures evolved over time — not just what changed. That audit trail is the most useful artifact the governance model generates. Individual changes matter less than the ability to reconstruct the reasoning behind them.

When to use this / when not to

Use this when

  • Your agent runs persistent procedures that need to improve over time
  • You want systematic improvement without constant manual tuning
  • The agent’s configuration is complex enough that undocumented drift would be hard to debug
  • Multiple operators or users need to understand what the agent is doing and why
  • The agent has access to irreversible external actions (publish, merge, send)

Skip this when

  • The agent is a one-shot tool with no persistent state
  • The configuration is simple enough to manage manually without drift risk
  • You’re prototyping and want to move fast before locking in governance
  • There’s only one operator and the overhead exceeds the organizational benefit

👤 Human note: The right time to implement governance is before the agent has made changes you can’t explain. If you’re already debugging “why does it behave differently than it used to?” — governance is overdue. Retrofitting it is harder than building it in from the start.

Next steps

Start with the Tier 4 protected core. Write down the rules the agent must never be able to change, put them in a read-only-by-design location, and make sure they’re covered by a check that fails loudly if modified. This is the floor, not the ceiling.

Then add the decision file format for Tier 2 changes. Even if you never reach Tier 3/4 scenarios, the habit of writing hypothesis + measure-by date before making a change will dramatically improve your ability to reason about why the agent behaves the way it does six months from now.

The full governance model — including the performance ledger approach, the reflection protocol structure, and how self-modification integrates with the memory system — is part of Blueprint.