Structure Dictates Behavior: golden signals for agentic development teams

A previous post described what a day looks like when your engineering system runs mostly on its own. This one is how to reason through the structure required to drive the behaviors we want.

Starting with a simple concept, the learning organization. Here is a system dynamics stock-and-flow diagram.

In the simplest terms, learning begets learning:

What we’re actually tracking

Most engineering metrics measure what the system produced: PRs merged, tickets closed, features shipped. Those are fine for tracking work. They’re less useful for improving the system doing the work. For that, I want to use and augment the DORA metrics that were successful for my teams in the cloud services space.

Every metric we care about maps to one of the two loop types:

Balancing loops create friction and slow things down. The metric targets are always down.
Reinforcing loops create compounding improvement. The metric targets are always up.

Here’s what that looks like in practice:

KPI	What it measures	Goal
Interrupt Rate	Interrupts per agent-task	↓
Autonomous Completion Rate	Tasks completed with zero interrupts	↑
Mean Time to Correct (MTTC)	Time from interrupt to human response	↓
Context Coverage Score	% of interrupt categories with a structural fix	↑
Feedback-to-Demo Cycle Time *	Time from /feedback to working demo	↓

An “interrupt” is what we call the moment an agent has to stop and ask a human to make a decision because the system doesn’t have what it needs. The system encountered something it wasn’t prepared for because something … a decision record, a policy rule, a skill, was missing from its context. Our engineering response is to drive that number toward zero: by patching the prompt so that class of problem never recurs.

The mechanism that makes this work: interrupt triage

The single highest-leverage thing we do is triage interrupts by category and assign each category a structural fix.

The logic: if the same type of interrupt keeps firing, it means the system is missing something it should know. The fix isn’t to resolve the individual interrupt faster. The fix is to give the system what it needs, as well as a way to self-service discovery of unknown data, so that interrupt never fires again.

Three categories, three fix types:

Missing context → Write an ADR (Architectural Decision Record)
Policy gap → Write a new rule into the system’s constitution
Tool failure → Write a skill fix, create a new skill

Once you have the fix, that whole category of interrupt disappears from the queue. The system learned something. Context Coverage Score goes up. Interrupt Rate goes down.

Loops in the wild

Hey, the secret is out.

For many decades, systems thinkers have applied the concepts of stock-and-flow and balancing & reinforcing loops to design resilient systems and processes that self-correct and teach themselves. What’s changed now is that these loops are tight, virtually eliminating this DELAY box.

Our existing employees and systems were built to handle this DELAY, and in some cases rely on its existence. This delay box is also what made code itself a precious resource. It takes time to develop software. The delays are gone, the communications layers were collapsed and we are left to try and figure out what systems we *actually* need. These concepts allow us to build systems that autonomously converge on user cost/performance targets while meeting SLOs.

Examples of loops in GHA

Here are three patterns from our reference library that illustrate one of the loop types above.

Issue-to-PR: a reinforcing loop

An issue gets labeled ready-for-pr. A GitHub Actions workflow triggers, an agent reads the spec, implements the fix, and opens a draft PR. The loop: more issues get filed with confidence because engineers know that a clear spec goes straight into development. This is often a very human thing. More issues filed means the backlog stays honest and throughput compounds. The loop runs faster the more it’s used.

Multi-agent code review: a reinforcing loop

When a PR opens, multiple specialized agents run in parallel – one for security, one for bugs, one for code quality. Critical issues get auto-fixed before a human ever sees the diff. The loop: code that reaches human review is already reviewed, refined and refactored, so reviews are faster and can be done by agents…trust in the agentic output goes up, and engineers unblock agents sooner. Higher throughput generates more PRs, which generates more review data, which improves agent calibration. Each cycle tightens the next.

Self-review reflection: eliminating a balancing loop

Without it, the cycle looks like this: agent produces code, human reviews, human redlines, agent fixes, human reviews again. That back-and-forth is a governor. It caps throughput at whatever speed a human can turn around feedback, and it normalizes multiple rounds as the expected cost of agentic development.

Self-review reflection removes that governor before it forms. The agent re-examines its own output as a reviewer would: checking edge cases, security gaps, incomplete reasoning, and fixes what it finds before presenting anything. The human sees polished work on the first pass. The balancing loop never runs.

Correction-to-PR: closing the loop on prompt bugs

Here’s what interrupt triage looks like when it’s fully automated:

A user corrects an agent during a chat session. The system logs the correction to a monitoring system. A separate process reviews the correction log, and identifies a pattern… in this case, the agent had been implementing React Query polling logic but consistently missing the Error terminal state, and it determines that the gap is structural. Not a one-off mistake. A missing prompt.

The system opens a PR. Pull request #51 in the workflows repo: “fix(bugfix-workflow): add completeness guidance to prevent incomplete fixes.” Authored by ambient-code[bot]. Merged by a human. Eighteen lines added across four files.

The interrupt category: incomplete implementation when fixing bugs involving state-dependent logic. The fix type: workflow update… a policy addition to the bugfix workflow. The correction fired three times in a single session, and it will not fire again. We hope.

This is the loop closing. User correction → Monitoring log → Automated Review → PR → merged → gone. No meeting required. The system evolved itself.

The meeting

We’ve designed (but haven’t yet run) a 30-minute weekly meeting structured entirely around “what do we triage now?”. The numbers below are estimates on what the first few weeks will show. We’ll publish real data once we have it. The design choices in this framework are worth explaining before results validate or revise them.

The agenda:

Time	Topic	Focus
0:00	Dashboard review	Deltas only. No storytelling.
0:05	Interrupt triage	Top three categories. Owner and fix type for each.
0:15	Loop state check	2-3 loops. Absent, Present, or Throttled.
0:22	Highest-leverage intervention	One fix. That’s it.
0:27	Constitution/ADR updates	Review what shipped or needs review for agent memory.

“Throttled” means a loop exists and is running, but something external is capping its speed. Left unaddressed, throttled loops quietly calcify into permanent ceilings. An escalation rule forces the team to name the constraint before it becomes invisible. An example of this would be API rate limits on a provider – you can only go so fast – something external is capping our speed.

What we expect this to look like

We’re guessing the first few weeks of interrupt data will look something like this:

Majority from missing architectural, process or business context.
- Engineering response: we need a low latency data layer for this that gets the right agents the right (optimized) data at the right time.
- Our teams are still gathering, or writing down, how they want the system to work. Spec-kit constitution is a good approach to gaining a foothold.
~Half from ambiguous specs
- This is what this blog is focused on. Prompting is an art and we want it to be a science. The technology is probabilistic so we need loops. We can turn art into science through continuous learning.

If that holds, the first high-leverage intervention might be: auto-generate ADR stubs from interrupt logs so that missing context gets captured before it fires again. We have these same flywheels going for skills, too. Skills are optimistically generated based on a perceived gap in the LLM’s understanding. Expected impact: a meaningful drop in interrupt rate week over week.

We don’t know yet. We’ll find out.

Why one intervention per week

There’s a temptation, once you have a framework like this, to fix everything at once. The interrupt triage table shows you exactly where the system is failing. It feels wrong to pick one thing and leave the rest.

But the meeting commits to exactly one fix. Here’s why:

The system improves by learning. Learning happens when you make a change, observe the effect, and update your model. If you ship five fixes at once, you can’t attribute the resulting change in interrupt rate to any of them. The signal disappears. You’ve made the system better at the expense of understanding how.

One fix per week is a forcing function for causality. It keeps the feedback loop tight. And over time, tight feedback loops compound in a way that scattered, high-volume changes don’t. Maybe a week is too long or too short. We will see.

The thing this framework is actually doing

Loop tightening is itself a reinforcing loop.

Every week the team does this: interrupts go down, autonomous completion goes up, context coverage improves. A more capable system handles more tasks. More tasks generate more data. Better data produces more targeted fixes. The loops are spinning and that’s the job.

ambient-code.ai