the agile monkeys.

how we build software now · engineering report

AI agents carry every change from signal to production; people decide what ships. This report describes the system we run on our own products and what eleven months of measurement shows.

scroll

summary

0%

of agent pull requests accepted by human reviewers

0%

merged one-shot: no change requests, no fixup commits

0

production reverts in eleven months

0h

median time from issue to merged code

Collected by the system's own outcome collector and refreshed nightly. Accepted means a human squash-merged it; agents cannot merge.

why we built this

Five questions kept coming up.
From boards, security reviews, and engineering leads.

01

Generation is fast. Review is not.

Code arrives in minutes, then waits days for someone willing to take responsibility for it. The bottleneck moved; it didn't disappear.

02

Who is accountable for a merge?

If an agent ships a bad change, someone's name has to be on it. Boards ask this question first.

03

Can security review keep pace?

Agents produce more code than any application-security team can read. Rules either hold mechanically or they don't hold.

04

What does a shipped change cost?

Tokens, seats, runners, retries. Few teams can price a single merged change, so spend stays unmanaged.

05

What is an AI engineer?

The title is on every CV and doesn't yet describe work. Hiring needs a sharper definition of the roles.

The system below is our working answer.

It has run our own products since August 2025. The rest of this page describes what it does and what it measures.

the ai sdlc

One change, end to end.

This is the AI SDLC we run on our own products. The walkthrough follows a single change through the loop. Scroll to advance.

the self-observation layer

The SDLC watches itself.

The models are interchangeable; the harness is not. Most of our engineering went into the layer that verifies, measures, and repairs the system around the agents.

every run

Run receipts

Cost, tokens, duration and check outcomes — recorded by a tamper-evident collector after every agent job, failures included. Never prompts. Never your code.

nightly

Outcome collector

Joins every agent PR with its fate: merged or rejected, time to merge, review rounds, human commits needed after the bot's. Acceptance is a metric, not an anecdote.

daily

Health monitor

Classifies failures into machine vs agent, checks spend against per-agent budgets, and manages its own incident issue — opens it, updates it, closes it on recovery.

weekly

The canary

A synthetic probe runs the real delivery loop end-to-end — authorization, sandbox, agent, reply, receipt — and goes red before any human ever hits the breakage.

⟳ the ratchet  Failures are clustered weekly; any pattern that recurs becomes a proposed deterministic guard. The guard set only grows.

proof · the canary's first flight

outcomes · full history · aug 2025 – jul 2026

The record, with definitions.

Accepted means a human reviewer squash-merged the pull request; agents cannot merge. One-shot means merged with zero change requests and zero human fixup commits.

Claude Code lane

acceptance 86%

Codex lane

acceptance 100%

GitHub Copilot — trialed, retired

acceptance 60%

0%

merged one-shot: no change requests, no fixup commits

0h

median hours from issue to merge, agent lanes

0

deterministic guards; the set only grows

0%

of pull requests AI-reviewed and guard-checked, human-authored ones included

Scope note: fully autonomous agent runs produced about 18% of shipped work so far, a share that grows by design as trust compounds. The rest was written by engineers working with agents. Only the autonomous lane is counted in the acceptance figures above. Assisted work, however large, is excluded.

who runs this

When agents carry implementation, job titles stop describing the work. What we observe instead are five shapes of work, and each runs on the loop above.

01

the prototyper

ships questions

Runs new ideas knowing most won't survive. The system makes each experiment cheap enough to discard without ceremony.

runs hardest on

intakeplanning

02

the builder

makes it real

Turns a validated idea into production-grade product and infrastructure without skipping a gate on the way.

runs hardest on

builddefense in depth

03

the optimizer

removes weight

Simplifies the system, unships what earns nothing, tunes performance. Reads run receipts the way a CFO reads costs.

runs hardest on

run receiptsthe guards

04

the grower

compounds usage

Iterates a shipped product toward fit, steered by usage data flowing through intake rather than by opinion.

runs hardest on

intake signalspr outcomes

05

the maintainer

keeps it true

Owns a mature system's security, reliability and cost as it scales. The self-observation layer is their instrument panel.

runs hardest on

self-observationfailures → guards

Most people span two of these. Some span three. None of them is a job title — we've seen designers in the first column and in the third.

Inside this system they share one duty: each archetype maintains the part of the system it depends on. The prototyper keeps intake honest, the optimizer keeps the guards sharp, the maintainer keeps the measurement layer true. That is how five different sets of priorities ship through one loop.

what comes next

The AI SDLC described on this page — the gates, the guards, the receipts, the self-observation layer — is being packaged as an open-source system that other teams can run. Leave your details if you want access when it is ready. One email when it ships; one before, if you want to be part of the early group.

Stored privately. Used for exactly this. Never sold, never shared.