how we build software now · engineering report
AI agents carry every change from signal to production; people decide what ships. This report describes the system we run on our own products and what eleven months of measurement shows.
summary
of agent pull requests accepted by human reviewers
merged one-shot: no change requests, no fixup commits
production reverts in eleven months
median time from issue to merged code
Collected by the system's own outcome collector and refreshed nightly. Accepted means a human squash-merged it; agents cannot merge.
why we built this
Five questions kept coming up.
From boards, security reviews, and engineering leads.
01
Generation is fast. Review is not.
Code arrives in minutes, then waits days for someone willing to take responsibility for it. The bottleneck moved; it didn't disappear.
02
Who is accountable for a merge?
If an agent ships a bad change, someone's name has to be on it. Boards ask this question first.
03
Can security review keep pace?
Agents produce more code than any application-security team can read. Rules either hold mechanically or they don't hold.
04
What does a shipped change cost?
Tokens, seats, runners, retries. Few teams can price a single merged change, so spend stays unmanaged.
05
What is an AI engineer?
The title is on every CV and doesn't yet describe work. Hiring needs a sharper definition of the roles.
→
The system below is our working answer.
It has run our own products since August 2025. The rest of this page describes what it does and what it measures.
the ai sdlc
One change, end to end.
This is the AI SDLC we run on our own products. The walkthrough follows a single change through the loop. Scroll to advance.
00 / 11 · scroll
the self-observation layer
The SDLC watches itself.
The models are interchangeable; the harness is not. Most of our engineering went into the layer that verifies, measures, and repairs the system around the agents.
every run
Run receipts
Cost, tokens, duration and check outcomes — recorded by a tamper-evident collector after every agent job, failures included. Never prompts. Never your code.
nightly
Outcome collector
Joins every agent PR with its fate: merged or rejected, time to merge, review rounds, human commits needed after the bot's. Acceptance is a metric, not an anecdote.
daily
Health monitor
Classifies failures into machine vs agent, checks spend against per-agent budgets, and manages its own incident issue — opens it, updates it, closes it on recovery.
weekly
The canary
A synthetic probe runs the real delivery loop end-to-end — authorization, sandbox, agent, reply, receipt — and goes red before any human ever hits the breakage.
⟳ the ratchet Failures are clustered weekly; any pattern that recurs becomes a proposed deterministic guard. The guard set only grows.
outcomes · full history · aug 2025 – jul 2026
The record, with definitions.
Accepted means a human reviewer squash-merged the pull request; agents cannot merge. One-shot means merged with zero change requests and zero human fixup commits.
Claude Code lane
acceptance 86%
Codex lane
acceptance 100%
GitHub Copilot — trialed, retired
acceptance 60%
merged one-shot: no change requests, no fixup commits
median hours from issue to merge, agent lanes
deterministic guards; the set only grows
of pull requests AI-reviewed and guard-checked, human-authored ones included
Scope note: fully autonomous agent runs produced about 18% of shipped work so far, a share that grows by design as trust compounds. The rest was written by engineers working with agents. Only the autonomous lane is counted in the acceptance figures above. Assisted work, however large, is excluded.
who runs this
When agents carry implementation, job titles stop describing the work. What we observe instead are five shapes of work, and each runs on the loop above.
01
the prototyper
ships questions
Runs new ideas knowing most won't survive. The system makes each experiment cheap enough to discard without ceremony.
runs hardest on
02
the builder
makes it real
Turns a validated idea into production-grade product and infrastructure without skipping a gate on the way.
runs hardest on
03
the optimizer
removes weight
Simplifies the system, unships what earns nothing, tunes performance. Reads run receipts the way a CFO reads costs.
runs hardest on
04
the grower
compounds usage
Iterates a shipped product toward fit, steered by usage data flowing through intake rather than by opinion.
runs hardest on
05
the maintainer
keeps it true
Owns a mature system's security, reliability and cost as it scales. The self-observation layer is their instrument panel.
runs hardest on
Most people span two of these. Some span three. None of them is a job title — we've seen designers in the first column and in the third.
Inside this system they share one duty: each archetype maintains the part of the system it depends on. The prototyper keeps intake honest, the optimizer keeps the guards sharp, the maintainer keeps the measurement layer true. That is how five different sets of priorities ship through one loop.
what comes next
The AI SDLC described on this page — the gates, the guards, the receipts, the self-observation layer — is being packaged as an open-source system that other teams can run. Leave your details if you want access when it is ready. One email when it ships; one before, if you want to be part of the early group.