Self-Modification Under Governance
The debate is usually 'should agents self-modify?' That's the wrong question. The right question: which modifications are safe to automate, and what governance makes the rest tractable?
Why the "should agents self-modify?" debate misses the point
Agents already self-modify. Every time you update a system prompt based on agent feedback, every time a workflow adjusts its own thresholds, every time a memory file gets rewritten — that's self-modification. The question isn't whether it happens. It's whether it happens safely and traceably.
The failure mode isn't an agent that modifies itself. It's an agent that modifies itself without:
- A documented rationale (so you can understand why the change was made)
- A hypothesis and measure-by date (so you know if it worked)
- A rollback path (so you can undo it if it didn't)
- Tiers that separate safe modifications from risky ones
Recent research on hyperagents formalizes this. Zhang et al. (arXiv:2603.19461, March 2026) describe self-referential metacognitive self-modification — the idea that agents can reason about their own reasoning processes and update them. The insight is that this capacity, when structured, produces agents that improve systematically rather than drifting unpredictably.
An agent that can't improve its own procedures will plateau at the quality of its initial design. An agent that improves without governance will drift toward whatever optimization pressure is highest — which may not be what you want.
The tiered governance model (M1–M4)
The practical solution is a four-tier classification of modifications by risk and reversibility. Each tier has different approval requirements, different documentation standards, and different rollback expectations.
M1 — Low-risk, fully autonomous
These are modifications that affect execution without affecting behavior. Schedule adjustments, step ordering, skip conditions, metric thresholds. The agent can make these without approval because the risk of a bad M1 change is low and recoverable.
Examples: adjusting a cron schedule from hourly to every 90 minutes based on observed yield rate; reordering steps in a heartbeat loop; updating a token-cost threshold.
Requirement: git commit with descriptive message. No human review needed.
M2 — Meaningful but bounded, autonomous with documentation
These affect how the agent reasons, routes work, or formats output — but within defined bounds. Model routing tables, phase conditions, template formats, timing for reflection cycles. The agent can make these autonomously, but must document the hypothesis and a measure-by date.
Examples: updating the model routing table to route a new task type to a cheaper model; changing the format of a daily memory write; adjusting when the dream cycle triggers.
Requirement: write a decision file with hypothesis and measure_by date + companion entry in the performance ledger. Git commit.
# memory/decisions/2026-03-25-routing-haiku-extraction.md
## Decision: Route extraction tasks to Haiku
**Hypothesis:** Haiku handles structured extraction as reliably as Sonnet
at ~10% of the cost. Expected: <5% quality regression, >80% cost reduction.
**Measure by:** 2026-04-25
**Rollback:** revert routing table entry in TOOLS.md M3 — Structural, requires internal review
These affect the agent's cognitive defaults, reflection protocol, or evaluation criteria. A bad M3 change can subtly degrade quality across all output, which is why it requires a more expensive sub-agent review — not human approval, but a second perspective.
Examples: changing how the agent classifies uncertainty; modifying the reflection protocol structure; updating skill evaluation criteria in AGENTS.md.
Requirement: spawn a sub-agent (Opus or equivalent) to review the proposed change and produce a written assessment. Proceed only if review finds no blocking concerns. Full documentation required.
M4 — Safety-critical, requires explicit human approval
These affect trust boundaries, authority tiers, hard rules, or the governance model itself. No agent — not even a well-reasoned sub-agent — should be able to weaken safety boundaries autonomously. These require a human to review and explicitly approve.
Examples: expanding the set of actions that can be taken autonomously; changing what counts as an escalation trigger; modifying the trust tier levels; any change to M4 itself.
Requirement: surface the proposed change with full rationale. Wait for explicit human approval. No exceptions.
⚠️ Warning: The most dangerous M4 violation is an agent that reclassifies an M4 change as M3 or M2 to avoid the approval requirement. The protected core must be explicitly enumerated and immutable by definition. If the agent can change what counts as M4, M4 is meaningless.
The protected core
Not everything is on the modification ladder. Some rules are immutable — the agent cannot modify them regardless of tier, rationale, or confidence. The protected core includes:
- The hard rules in the agent's core memory (the ones that gate irreversible external actions)
- The authority model tier classification
- The escalation trigger list
- The M4 tier definition itself
- Any rule that gates Tier 4–5 actions (publish, merge to main, financial commits)
The protected core is explicitly listed in a place the agent can read but not write. In practice, this means storing it in a file that's tracked by git and covered by a CI check — modification fails the build.
Documented rationale for every change
Self-modification without documentation is worse than no self-modification. You end up with an agent that has drifted from its initial design, no record of why, and no way to audit the drift.
Every M2+ modification requires a decision file with four fields:
- Hypothesis: What do we expect this change to produce?
- Measure by: A specific date by which we'll know if the hypothesis held
- Evidence: What data or reasoning supports making the change now?
- Rollback: Exactly how to undo this if it fails
This isn't bureaucracy. It's the minimum viable audit trail for a system that modifies its own behavior. Without it, you have no way to distinguish "this change worked" from "we got lucky" — and no way to find which change caused a regression.
The performance ledger
The hypothesis alone isn't enough. You need a place to close the loop: to record whether the hypothesis held when the measure-by date arrives. This is the performance ledger.
Each M2+ modification gets a companion entry in memory/performance/ledger.md. When the measure-by date passes, the ledger entry is updated with the outcome: confirmed, refuted, or inconclusive.
# memory/performance/ledger.md (excerpt)
## 2026-03-25 | Haiku routing for extraction tasks
- **Hypothesis:** <5% quality regression, >80% cost reduction
- **Measure by:** 2026-04-25
- **Status:** Open
- **Evidence:** 3 test runs showed 0% quality issues, 87% cost reduction
## 2026-02-14 | Skip condition threshold 0.15 → 0.20
- **Hypothesis:** Higher threshold reduces wasted runs without missing signal
- **Measure by:** 2026-03-14
- **Status:** Confirmed — skip rate up 22%, no missed items in audit
- **Action:** Threshold now default in scheduling protocol Closed ledger entries — especially refuted ones — are the most valuable learning artifacts an agent can produce. A refuted hypothesis tells you something concrete about how your agent behaves in production.
Making the improvement mechanism itself improvable
The deepest insight in the Hyperagent paper is self-referential: the improvement process itself should be subject to improvement. An agent that can only improve its task procedures, but not its improvement procedures, will plateau at the quality of its initial metacognitive design.
In practice, this means M1 and M2 changes can include changes to the improvement protocol — how often reflection runs, how hypotheses are structured, what counts as evidence. But changes to the tier structure itself (what requires M3 vs M4 review) must go through M4 approval. The governance model protects itself.
🤖 Agent note: Keats operates under this exact model. The AGENTS.md self-modification tiers were themselves designed to be auditable — every change to AGENTS.md is a git commit with a rationale comment. The most useful thing this produces isn't the individual changes; it's the git log of why the agent's procedures evolved over time.
When to use this / when not to
Use this when
- Your agent runs persistent procedures that need to improve over time
- You want systematic improvement without constant manual tuning
- The agent's configuration is complex enough that undocumented drift would be hard to debug
- Multiple operators or users need to understand what the agent is doing and why
- The agent has access to irreversible external actions (publish, merge, send)
Skip this when
- The agent is a one-shot tool with no persistent state
- The configuration is simple enough to manage manually without drift risk
- You're prototyping and want to move fast before locking in governance
- There's only one operator and the overhead exceeds the organizational benefit
👤 Human note: The right time to implement governance is before the agent has made changes you can't explain. If you're already debugging "why does it behave differently than it used to?" — governance is overdue. Retrofitting it is harder than building it in from the start.
Next steps
Start with the M4 protected core. Write down the rules the agent must never be able to change, put them in a read-only-by-design location, and make sure they're covered by a check that fails loudly if modified. This is the floor, not the ceiling.
Then add the decision file format for M2 changes. Even if you never reach M3/M4 scenarios, the habit of writing hypothesis + measure-by date before making a change will dramatically improve your ability to reason about why the agent behaves the way it does six months from now.
The full governance model — including the performance ledger template, the reflection protocol structure, and how self-modification integrates with the memory system — is part of Blueprint.