Posts

Designed for Humans

6 min read essay

I built an autonomous delivery pipeline on top of GitHub’s Agentic Workflows. You write a product brief in natural language. Agents decompose it into issues, implement each as a PR, review each other’s code, merge, deploy, and self-heal when CI breaks. No human writes a line of application code.

It works. But getting here meant discovering that every piece of developer infrastructure — git, CI/CD, code review, deployment — was designed for humans. And when agents try to use it, things break in ways nobody warned me about.

The friction

Git assumes human-paced collaboration. When four agents push to branches concurrently, you hit race conditions that no human workflow ever encounters. CI/CD pipelines assume human-triggered events: a developer pushes, a build runs, a human triages the failure. When an agent needs to programmatically dispatch a workflow, retry on a structured error, or requeue stalled work, the infrastructure fights you. Code review expects a human reader scanning diffs. When an agent reviews another agent’s PR, it needs structured evaluation criteria, not a comment box.

Over several weeks and thousands of workflow runs, the pipeline surfaced dozens of failure modes in GitHub’s Agentic Workflows platform itself. Most shipped as fixes within days. Building a pipeline that chains agents, depends on structured outputs, and runs full lifecycle workflows stresses edges that shallow integrations never touch.

The bugs themselves were fixable. What they revealed was harder to fix: the bottleneck in autonomous software delivery is the system around the model, not the model itself. The agents I use — Copilot, Claude, Codex — are capable enough. They can implement features, write tests, review code, diagnose CI failures. The hard problem is orchestrating them into a system that’s reliable enough to trust with production workloads.

The five layers

The harness in PRD to Prod is five layers deep. Each one exists because something broke without it.

Autonomy policy

Early on, an agent modified a GitHub Actions workflow file while fixing a CI failure. It was trying to help. The fix was correct. But it had just changed the routing logic that governs what agents are allowed to do — the control plane itself. Nothing prevented it.

That’s when I wrote autonomy-policy.yml: a machine-readable file that defines what agents can and cannot do. Application code, tests, documentation — autonomous. Workflow definitions, deployment routes, secrets, authentication logic — human-required. The policy is declarative. Agents read it. Unrecognized actions fail closed. The core rule: agents can execute inside policy, but they cannot redefine policy. Any change that expands system authority requires a human.

Decision state machine

The first version of the review system let agents leave freeform comments on PRs. The result was unpredictable. Sometimes an agent would “approve with suggestions,” a state that doesn’t exist in any merge logic. Sometimes it would write three paragraphs of feedback that no downstream system could parse into an action.

I replaced this with a constrained enum. The review agent returns one of two verdicts: APPROVE or REQUEST_CHANGES. The merge workflow parses the verdict from a structured comment. If it can’t parse, it defaults to REQUEST_CHANGES. No freeform decisions means every state is reasoned about in advance, every transition is handled. The agent’s judgment still matters — it decides which verdict to issue — but the system never has to interpret ambiguity.

Identity separation

In an early run, the same agent that implemented a feature also reviewed its own PR and approved it. Technically sound. Practically meaningless — the system was auditing itself. The review was a formality that checked nothing.

Now the agent that writes code can never be the same identity that approves it. repo-assist implements. pr-review-agent reviews against the specification and the autonomy policy. pr-review-submit handles the merge decision as a separate workflow. Three distinct actors, three distinct scopes of authority. The separation makes the review step actually catch things.

Self-healing loops

CI breaks. In a human workflow, someone notices, triages, and fixes. In an autonomous pipeline running at 2 AM, a CI failure with no triage is a dead pipeline.

The self-healing layer has three parts. A failure router watches CI runs and classifies failures using log extraction. It parses the failed step’s output, categorizes the failure type, and decides whether to attempt a fix or escalate. If it’s fixable, it creates a structured issue with the failure context, the hypothesis, and the correlated files. A repair agent picks up the issue, implements the fix as a new PR. A review agent evaluates the repair against the original failure. The fix goes through the same review and merge path as any other PR: identity separation, autonomy policy, decision state machine. No shortcuts for self-repair.

The escape hatch: PIPELINE_HEALING_ENABLED=false pauses auto-repair while keeping detection active. The system can tell you something broke without trying to fix it.

Deterministic scaffolding

Agents are nondeterministic. The same prompt can produce different code and different review outcomes. The scaffolding around them can’t be.

Standard GitHub Actions owns all routing, policy enforcement, deployment, and merge authority. 37 workflow files define the control plane. Agents operate inside this scaffolding, never above it. The dispatch system is deterministic: a PRD triggers decomposition, each issue triggers implementation, each PR triggers review, each approval triggers the merge check. The state transitions are defined in YAML, not in agent output. When an agent completes a task, it doesn’t decide what happens next. The workflow does.

This is the layer that makes the whole thing auditable. You can read the workflow files and know exactly what the system is capable of, without running it.

The factory

When all five layers hold, something changes. The pipeline stops being a tool you operate and becomes a system that operates on its own.

One evening, the pipeline shipped a compliance feature — six PRs across domain models, API endpoints, a scan engine, a dashboard page, and navigation links. Each PR was implemented by an agent, reviewed by a separate agent, and merged through the deterministic scaffolding. The last one landed around 7 PM. I closed my laptop.

Three hours later, CI broke. The compliance pages had changed the navigation structure, and the existing nav tests still asserted the old links. The failure router caught the break, extracted the xUnit assertion failure from the logs, and created a structured issue with the root cause and the affected files. Seven minutes later, a repair agent opened a fix PR. The review agent verified it. It merged in four minutes.

That fix surfaced a second failure — the test suite itself needed updating to match the new navigation. The CI Failure Doctor diagnosed it. Another repair PR went up three minutes later. Merged. Then a third pass: the pipeline trimmed the navigation to its final four-page structure and updated every remaining assertion. Merged.

Break Nav tests fail after compliance feature ships
Detect Failure router catches break, extracts root cause
Fix #1 Repair agent restores missing nav links (4 min)
Fix #2 Second agent updates test assertions (3 min)
Fix #3 Third pass trims nav to final structure
Stable All 124 tests pass. Pipeline resumes.

Three fix PRs, two diagnostic investigations, 45 minutes start to finish. I was not at my computer for any of it. I found out the next morning when I opened the repo and saw the closed issues.

The pipeline didn’t nail it on the first try. It iterated — detected a break, fixed it, found the fix was incomplete, fixed again, then cleaned up. Three passes. That’s closer to how a human would work through a problem, except no human was involved. The system just needed the five layers to hold.

That’s not a code generator. That’s a factory. Multiple agents working in parallel, decomposing, implementing, reviewing, deploying, repairing, while a human-authored policy governs what they’re allowed to do and where they must escalate.


When agents do the work, human authority becomes the most important part of the system. The question is: who decides what ships, and can you prove it.


The delivery infrastructure — the harness — is what makes the answer to that question legible. The autonomy policy says what’s allowed. The decision state machine says how decisions get made. Identity separation says who checks whom. Self-healing says what happens when things break at 2 AM. Deterministic scaffolding says you can read the answer in YAML, not hope the agent got it right.

The hard part of autonomous software delivery is building the system that makes agent output trustworthy enough to ship without a human in the loop, and rigorous enough to prove it.

I’m building in this space. If you are too, I want to talk.