Pull Requests Need a Total Rethink in the AI Era

Code review was built for one human writing every line. AI broke that assumption. The fix is layered review, not faster reviewers.

A reviewer approved a 2,400-line PR in three minutes this morning. They did not read the diff. They could not have.

That is not a review. That is a signature.

This is what code review looks like on most AI-forward teams right now. Nobody has named it yet, but everyone knows.


The bottleneck moved again

Last week's post was about the bottleneck moving from coding to requirements once AI sped up the build step. Move forward one stage and the same dynamic plays out at review.

When one engineer ships ten PRs in a day, two reviewers cannot read forty PRs between them with attention. They scroll. They scan. They approve. The dashboard looks healthy. The codebase quietly accumulates work nobody actually understood.

Most teams respond by adding reviewers. That is rational and it does not work. Two reviewers at twenty PRs each is the same problem. Four at ten each is the same problem with more meetings.

You cannot scale judgment by hiring more pairs of eyes. Judgment does not parallelize.


Why traditional review breaks

Traditional PR review was built on one assumption. A human wrote every line, and that human can defend it. The reviewer's job was a sanity check on a person's craft.

That assumption is dead. A developer using an agent did not write every line. They directed the agent. They might not be able to defend a particular helper function any better than the reviewer can.

So both sides of the review are partly disclaiming the code. The author trusts the agent. The reviewer trusts the author. The code lands. Nobody is sure who actually owns it.

The failure mode isn't obvious at first. Volume only makes it visible.


A three-layer review architecture

Three layers. Each does work the next layer cannot do. The point of the architecture is to make a non-deterministic input pass through a deterministic process.

Layer 1: Automated gates

Linting. Type checks. Unit tests. Security scans. Coverage thresholds. Contract tests against neighboring services.

These run before any human or AI eye is on the diff. If they fail, the PR does not advance. There is no "I will review around the broken test." There is no "the linter is wrong, ignore it."

The point of Layer 1 is not catching bugs. It is reserving the higher layers for problems only the higher layers can solve. Every minute a senior engineer spends asking an LLM-authored PR to please use four spaces is a minute stolen from architectural review.

Layer 2: AI-assisted review

Same brain, different hat. This is the pattern from Chapter 4 of the book applied at the PR boundary. The same agent that wrote the code, or a sibling agent with a different role, reads the diff with a reviewer prompt instead of an author prompt.

The reviewer prompt asks different questions. Where are the edge cases? Which integrations does this touch? What error paths are missing? What does this PR break if the upstream service goes down for thirty seconds?

The author prompt is biased toward shipping. The reviewer prompt is biased toward not shipping. Run both. They will disagree. The disagreement is the value.

Most teams skip this layer because it feels redundant. It is not redundant. It is the only layer where machine work is reviewed by machines, which is the only ratio that scales with AI throughput.

Layer 3: Human review

Scoped to the things only a human can decide. Is this the right thing to build? Does this design fit the system we are converging toward? Are we creating a coupling that will hurt us in six months? Does this PR encode a product decision that needs a product owner's signoff?

Layer 3 is not a line-by-line audit. It is a focused judgment call on the parts of the change that need domain context and taste.

If your Layer 3 reviewer is rewriting variable names, your earlier layers failed and they are paying the bill.


Where it breaks

This sounds clean on a slide. The breakdowns are predictable.

  • Skipping Layer 1 because it is annoying. It is not slowing people down. It is removing the cheapest possible work from the most expensive layer. If linting feels annoying, your linter config is wrong, not the principle.
  • Trusting Layer 2 blindly. A reviewer agent with the same prompt and context as the author agent has the same blind spots. If you give the reviewer the same role, you bought nothing. The reviewer needs different instructions, ideally a different model, and explicitly oppositional framing.
  • Compressing Layer 3 into a checklist. Judgment does not survive being turned into a form. The moment Layer 3 becomes "did you check these seven boxes," it collapses back into ceremony. You will know it happened because reviewers will start clicking through in under a minute again.
  • Treating PR size as a virtue. "This is a big PR because it is important work" is not a defense. The right PR size is whatever a human can hold in their head for the parts that need human attention. Everything else is Layer 1 and Layer 2 territory regardless of line count.
  • Reviewers becoming editors. If the Layer 3 reviewer is rewriting code, the spec was too loose or the author agent was underconstrained. Fix that upstream. Do not let the reviewer absorb the cost of weak inputs at the most expensive part of the pipeline.

Implementation checklist

Most of this is process, not tooling. The tooling is already in your repo.

  • Make Layer 1 a hard gate. Failed checks should not be reviewable. Not "approvable with override." Not reviewable.
  • Stand up Layer 2 with a different prompt and a different role. A reviewer agent that mirrors the author agent is theater. Give it an opposing brief.
  • Define what Layer 3 reviewers are responsible for, in writing. Architecture, integration boundaries, business logic, system invariants. Not formatting. Not coverage. Not edge cases the agent should have caught.
  • Use tests as a pre-review gate, not a Layer 3 deliverable. Chapter 9's point applies cleanly here: if tests are written during review, they are not protecting anything. They are decoration.
  • Track review duration alongside PR throughput. A two-hour merge with a six-second review is not a fast pipeline. It is an unreviewed pipeline.
  • Tag the parts of a PR that need human attention. A 2,000-line PR with 1,800 lines of generated migrations and 200 lines of business logic is a 200-line review. Total line count is the wrong metric.
  • Audit your rubber-stamp rate. If the average review on a non-trivial diff is under two minutes, you do not have a review process. You have a queue with a checkbox.
A review that nobody actually performed is not a review. It is a queue with a checkbox, and the codebase pays the bill later.

One question

Pull request velocity makes a great chart. Chapter 5.3 is direct about this: velocity is the most visible metric and the easiest to optimize the wrong way. PRs merged is a gate count. It tells you how often the door opened. It does not tell you what walked through.

What does your team's review process actually catch right now, and which of those things should a machine have caught two layers earlier?

For related field notes, browse the blog archive.