I Stopped Correcting 40% of My AI's Work.
Here's What Changed.
I run product and engineering for regulated healthcare software. Over the past year, I've shipped three products using AI as a core member of the delivery team — not for code suggestions or autocomplete, but for full execution: architecture, domain modeling, compliance documentation, sprint planning, pull requests.
Early on, I was correcting roughly 40% of what the AI delivered. Not hallucinations or obvious errors — structural misalignment. The code was clean, the documents were well-written, and the logic was sound. But 4 out of 10 deliverables didn't match the intent. A story would be implemented against the wrong architectural assumption. A document would reference a pattern we'd discussed but never decided on. A pull request would make a reasonable judgment call that happened to be the wrong one.
That 40% correction rate was consistent across projects, across domains, across complexity levels. It wasn't a capability problem. It was a structural one.
Today, across three shipped products, that number is roughly 5%.
The difference isn't a better model. It's a methodology.
The Problem Isn't Intelligence. It's Ambiguity.
When you hand an AI a story that says "implement the patient authorization flow," the AI will deliver something. It will make decisions about state management, error handling, API boundaries, security constraints, and data persistence. It will make those decisions confidently, because that's what large language models do — they produce plausible outputs.
The problem is that "plausible" and "correct" diverge exactly at the decision points that matter most. The AI doesn't know that your team decided last month to isolate token storage in a dedicated service. It doesn't know that your compliance framework requires audit events at the command level, not the API level. It doesn't know that the domain expert deferred the acuity algorithm to a later phase, so the placeholder invariant in the spec is intentional, not an oversight.
Every ambiguous decision point in a ticket is a coin flip. The AI will resolve it, but it won't tell you it's guessing. And the resolution will be internally consistent, well-documented, and wrong in ways that require deep domain knowledge to catch.
This is why smarter models don't fix the problem. The bottleneck was never reasoning capability. It was the absence of explicit, locked decisions upstream of execution.
Three Layers That Eliminated Guessing
I built a methodology — iteratively, over the course of these three products — that moves every material decision upstream of execution. It has three layers, and they must be completed in sequence.
Layer 1: Regulatory Foundation
Before any product design work begins, the compliance scaffolding gets built. For healthcare, that means a complete Quality Management System: policies, standard operating procedures, forms, evidence records, and a control mapping to whatever regulatory framework applies.
This isn't checkbox compliance. Every SOP becomes a governing document that execution references. When a story says "per SOP-014," that SOP exists, it's specific to the organization's architecture and tooling, and it defines exactly what evidence the story must produce. The AI doesn't interpret regulatory requirements at implementation time — the interpretation was done during the compliance authoring phase and locked in a controlled document.
On the first product, a two-person team (me and the AI) produced 50 controlled documents in roughly 30 hours. Traditional timeline for that scope is 4–6 weeks with a dedicated compliance team.
Layer 2: Signal-Driven Design
This is where the product's architecture gets defined — not as a set of diagrams or a PRD that engineering interprets, but as a formal domain model that converges to zero ambiguity through iterative adversarial passes. I call this methodology Signal-Driven Design (SDD).
SDD draws from domain-driven design, event storming, and user journey mapping, but its core mechanic is adversarial convergence. The process works like this:
Extract a domain specification from the PRD and architectural decision records. This produces bounded contexts, aggregates, commands, events, invariants, policies, and sagas — the full vocabulary of what the system does.
Run an adversarial gap analysis against the specification you just produced. The same session that wrote the spec tries to break it: structural gaps (missing aggregates, orphaned events), heuristic gaps (oversized aggregates, thin contexts), language gaps (inconsistent terminology), and decision gaps (unresolved architectural questions).
Resolve every gap collaboratively, one at a time. Some are mechanical fixes the AI handles as an interpreter. Others require architectural judgment — those escalate to me. The role separation is explicit: the AI proposes, I decide, and the decision gets recorded with rationale.
Regenerate the full specification incorporating all resolutions and run the adversarial analysis again. Repeat until gaps hit zero.
On every product, this converges in three passes. The gap trajectory is predictable by pass two — you can see whether the model is tightening or oscillating. What matters is that when it converges, every command, every event, every invariant has been examined adversarially and either confirmed or corrected. There are no assumptions left in the specification.
The items that can't be resolved — because a domain expert hasn't made the call yet, or a technical evaluation hasn't happened — go into a deferred resolutions tracker with an explicit owner. They don't get guessed at. They get flagged as implementation-blocking and carried forward visibly.
Layer 3: Execution Planning With Closed Boundaries and Enforced Quality Gates
The converged domain specification feeds into an execution plan manifest — a single document that maps every milestone, epic, and story to the domain model, the governing ADRs, and the regulatory SOPs. Every story has:
A closed boundary defining exactly which files it creates, modifies, or reads
Machine-verifiable exit criteria tied to the domain specification
References to the specific governing documents that define how it's built
Parallelization tags based on file-level conflict analysis between stories
But the manifest alone is just a document. What makes it enforceable is the CI pipeline.
Every quality gate from the methodology gets encoded into continuous integration. Coverage thresholds, test evidence requirements, traceability checks, linting rules that enforce architectural boundaries — these run on every pull request, automatically, without human intervention. The pipeline doesn't care who wrote the code. It validates that the output conforms to the specification and the governing documents.
This is what keeps execution, tracking, and the AI in sync. You can cut corners, but the quality gates will catch you. A story that skips its test specification fails CI. A pull request that modifies files outside its closed boundary fails CI. Code that doesn't meet coverage thresholds fails CI. The methodology's discipline isn't enforced by willpower or code review — it's enforced mechanically, on every commit.
The manifest becomes the contract between planning and execution. When the AI picks up a story, it doesn't need to make architectural decisions — every decision was already made during layer 2, documented in an ADR, and referenced on the ticket. The AI's job is translation: take the locked specification and produce code that matches it. And CI verifies that the translation is faithful.
This is the key insight: the framework doesn't make the executor smarter. It makes the executor's judgment irrelevant to the outcome.
What the Numbers Actually Mean
The 95% accuracy rate isn't the AI getting things right 95% of the time through better reasoning. It's the AI having nothing to get wrong. When every decision is locked in a governing document, every story has a closed boundary, and every invariant has been adversarially validated, the implementation is a mechanical translation exercise. The 5% that still requires correction is edge cases where the specification was ambiguous in ways the gap analysis didn't catch — and those corrections feed back into the next pass.
The 40% correction rate without the framework isn't the AI being bad at coding. It's the AI being confident at guessing. Remove the guessing, and the number drops to the floor.
My time split now is roughly 60% planning (working through the three layers with the AI as a collaborator) and 40% delivery oversight. That 40% isn't debugging code line by line. It's reviewing pull requests for shape, reading CI reports, checking test coverage summaries, and confirming that the pipeline's quality gates passed cleanly. When enough quality is locked into CI, I defer my trust to the reporting of the pipeline jobs rather than having to validate implementation details myself. The planning investment is front-loaded and significant. But the delivery phase is fast, predictable, and the reviews are architectural confirmations rather than defect hunts.
What This Changes About AI-Augmented Teams
Yes, code is cheap. AI can produce it faster than any human team, and the quality floor keeps rising with every model generation. But producing code was never the hard part.
The hard part is knowing what to build and why. It's tying together domain-driven design, event storming, user journey mapping, product management, architecture, engineering, testing, and compliance into a coherent system where every decision reinforces every other decision. That cross-discipline integration doesn't come from a tool. It comes from years of building things, shipping things, breaking things, and learning which decisions cascade and which ones don't.
There's no bootcamp for this. No certificate course. No fast path. It's the accumulated judgment of someone who has been a product owner, an architect, a quality manager, and a compliance officer — often simultaneously — and who understands how those roles constrain and inform each other. The methodology I've described isn't a process anyone can follow mechanically. It requires someone who can see the gap between a domain specification and a regulatory requirement, between an architectural decision and its downstream impact on sales cycles, between a testing strategy and its evidence value during an audit.
AI is contextually aware of the now. It can reason about the specification in front of it with remarkable depth. But it doesn't see how today's decisions shape the product six months from now. It doesn't know that shifting regulatory compliance left — building it into the foundation instead of retrofitting it — puts a business at the front of the line in competitive markets where enterprise buyers require certification before the first demo. It doesn't know that a particular domain modeling decision will make or break a pricing tier, or that deferring a feature creates a dependency that blocks three other features in the next quarter.
That strategic reasoning — the ability to see the future implications of present decisions — is what the human brings. And it's not reducible to a prompt.
The methodology I've described is labor-intensive on the front end. You can't skip the compliance layer, shortcut the adversarial convergence, or hand-wave the closed story boundaries. Each layer produces the inputs the next layer requires, and the traceability between layers is what makes the whole chain auditable — which matters in regulated environments, but also matters in any environment where you want to understand why a decision was made six months from now.
The pattern is transferable. I've run it across three products in different domains, with different tech stacks, different team sizes, and different regulatory requirements. The 40% correction rate without the framework is consistent. The 95% accuracy with it is consistent. The methodology scales because the problem it solves — ambiguity at execution time — is universal.
The Human Role Didn't Shrink. It Clarified.
The job of humans in this era has changed. We are no longer responsible for the mundane, repetitive tasks that AI handles better and faster. What we own is design and vision.
My role shifted entirely to where it has the highest leverage: making decisions during planning, setting architectural direction during domain convergence, and reviewing delivery for intent preservation. I'm not reviewing less. I'm reviewing differently. Instead of reading every line of code to find defects, I'm looking at the shape of what's delivered, confirming it matches the specification, and trusting the CI pipeline to enforce the details mechanically.
The AI is a better executor than it is a decision-maker. The methodology I've built accepts that constraint and designs around it. Every decision gets made by a human with domain context, documented with traceability, and locked before execution begins. The AI then does what it's actually good at: translating explicit instructions into consistent output at speed. And the CI pipeline validates every translation, every time, without human intervention.
That's the unlock. Not better AI. Better inputs to AI — created by humans who understand what they're building and why it matters.
Get new posts in your inbox. No spam, unsubscribe anytime.