Business

SaaS

Technology

•

Apr 17, 2026

Why Agent-Generated Code Is Breaking the Pull Request Review Model

The pull request model was built around one assumption: a human wrote the code. One developer, one feature, submitted when they were done. Reviewers could ask the author what they meant. The pace was human.

AI coding agents have ended that assumption. They generate PRs at machine speed, in machine volume, with batch sizes no team was built to handle. The result is verification debt — unreviewed or rubber-stamped agent code piling up in your codebase while your team nods it through because the queue never empties.

This article is part of our complete guide to AI coding agents as autonomous engineering teammates, where we examine the full landscape of the shift from autocomplete tools to autonomous agents and what it means for engineering teams.

Faros AI telemetry found that AI usage correlates with 98% more PRs, PRs that are 154% larger, and review times that are 91% longer. The 2025 DORA State of DevOps Report frames this as a systems problem: AI is an amplifier, and teams with weaker feedback loops see increased rework rates, incident response delays, and compounding cognitive load.

Thomas Dohmke, former GitHub CEO, raised $60M in February 2026 to rebuild the software production system from the ground up. His thesis: the whole model — issues, Git, pull requests, deployment — was never designed for AI agents as authors.

This article explains why the PR model is failing under agent volume, how to tell whether your team is already in review collapse, and what a redesigned process looks like in practice.

Why Is the Pull Request Model Breaking Down With AI Coding Agents?

The PR model was designed to solve three things: peer accountability, shared context between author and reviewer, and a lightweight async collaboration mechanism for distributed teams. It works when code arrives in ones and twos and the author can explain their intent.

AI coding agents violate all three assumptions simultaneously. They generate code without contextual understanding, submit at machine speed — 20 concurrent PRs is routine — and produce code whose reasoning is opaque because there is no author to ask.

The outcome is SDLC backpressure: the pipeline stalls at the review gate, not the generation stage. Code flows into the review queue far faster than it flows out. Agents keep submitting regardless of pipeline state.

There is also a second failure mode: concurrency explosion at the CI/CD layer. Shared infrastructure designed for sequential human submissions hits race conditions and flaky tests when multiple agents submit in parallel.

The failure is structural, not behavioural. You cannot tell your team to read 500-line agent PRs with the same scrutiny they’d apply to a 50-line human PR while the queue keeps growing.

For a map of which five levels of AI coding agent autonomy generate the review bottleneck, levels three through five — where agents act with task-level or goal-level autonomy — are where the volume problem begins.

What Does the DORA 2025 Report Find About AI Adoption and Code Review Time?

The 2025 DORA State of DevOps Report positions AI as an organisational amplifier, not a productivity guarantee. In teams with strong feedback loops, AI accelerates delivery. In teams where those loops are weak, it amplifies the dysfunction — more rework, more incidents, more cognitive load.

The specific mechanism DORA flags is batch size. The report identifies working in small batches as foundational for AI to have a positive impact. The problem is that AI usage in practice creates larger batches, not smaller. Larger batches demand proportionally more reviewer attention — and review capacity has not scaled to match.

An empirical study of 567 agent-generated PRs found that 83.77% are eventually accepted, versus 91.01% for human-authored PRs. That acceptance rate looks reasonable until you see the detail: 45.1% required human revision for correctness, documentation, or code style. Nearly half of merged agent PRs slipped through with problems a careful reviewer would have caught.

Swarmia’s coding agents view operationalises this, tracking merge rate, review time per agent PR, batch size, and task success rate as distinct metrics. If your team is in the autonomous coding agent landscape and not yet tracking these separately, you’re measuring the wrong things.

Why Did Thomas Dohmke Raise $60M to Rebuild Code Review From Scratch?

Thomas Dohmke served as CEO of GitHub for four years before leaving in August 2025 to found Entire. He oversaw the rise of GitHub Copilot and knows the PR model better than almost anyone. He concluded it needed rebuilding, not iterating.

Entire raised $60M in seed funding in February 2026 at a $300M post-money valuation, led by Felicis Ventures. Co-investors include Madrona, M12, Basis Set, 20VC, Cherry Ventures, Picus Capital, and Global Founders Capital. Angels include Jerry Yang, Olivier Pomel, Garry Tan, and Gergely Orosz.

The founding thesis: “Our manual system of software production — from issues, to git repositories, to pull requests, to deployment — was never designed for the era of AI in the first place.” Dohmke is not arguing that code review needs improvement. He is arguing the entire production system needs rebuilding for a world where agents are the primary authors.

Felicis investor Aydin Senkut put it plainly: “Trying to bolt agents onto human-centric workflows is creating friction, complexity, and real bottlenecks across the ecosystem.”

Entire sits on top of GitHub and GitLab rather than replacing them. The bet is that Git alone is not enough when agents are the authors.

What Does Entire’s Checkpoints Tool Actually Do Differently?

Checkpoints is Entire’s first open-source product — a CLI tool, not a SaaS platform. No vendor lock-in, no external service required. The data lives directly in Git.

What Checkpoints captures on every AI-generated commit: the agent’s prompt, the reasoning steps it followed, and every fork in its decision tree. When someone is debugging a module six months from now, they can read the agent’s decision trail rather than reverse-engineering intent from a diff.

That makes a different kind of review possible — one where the reviewer evaluates whether the agent’s reasoning was sound, not just whether each line looks right. It also catches reasoning failures that code inspection alone misses: the agent that produces syntactically correct but semantically wrong code because its prompt was underspecified, for example.

Checkpoints currently supports Claude Code and Gemini CLI. The data is visualised through a dedicated UI letting reviewers navigate the reasoning trail rather than reading the diff line by line. Being open-source means your team can inspect how context is captured, extend it for unsupported tools, and avoid proprietary lock-in.

How Do You Diagnose Review Collapse in Your Own Engineering Team?

Review collapse does not announce itself. It looks like velocity. PRs are merging, agents are shipping, the dashboard looks healthy. The verification debt is accumulating underneath.

There are five measurable signals you can track with existing tooling — Swarmia‘s agent metrics view, GitHub analytics, or equivalent:

1. PR cycle time by author type. If agent-generated PRs are merging faster than human-authored PRs despite being larger, the rubber stamp effect is already active. Faster merges on larger PRs means less scrutiny, not better code.

2. Merge rate trend. Track the ratio of agent PRs submitted to agent PRs merged over time. If the ratio is staying flat only because teams are approving faster, that is not a healthy signal.

3. Batch size growth. Increasing average line count per agent PR is a leading indicator. The DORA 2025 finding on “working in small batches” applies directly — if batch size is trending up, review compression follows.

4. Review time per agent PR. If this is decreasing while batch size is increasing, reviewers are spending less time on larger PRs. That is the definition of rubber-stamping, and the empirical data on what gets missed is not encouraging: correctness issues, documentation gaps, and code style problems in 45.1% of merged agent PRs.

5. Task success rate. The proportion of agent PRs that merge without human revision. A rising success rate sounds positive — validate it against change failure rate and post-merge bugs before treating it as evidence of improving review quality.

The rubber stamp check you can run right now: compare review comment density on agent PRs versus human PRs. If agent PRs attract fewer comments despite larger size, reviewers have already disengaged. GitHub’s PR analytics surface this directly.

The downstream risk is code rot that accumulates through review collapse — a slower-moving problem than a production incident, but a more expensive one to fix.

What Does a Redesigned Code Review Process for Agent-Generated PRs Look Like?

The emerging model replaces the single human review gate with a layered verification pipeline. Each layer handles a different type of failure mode.

Layer 1 — Deterministic gates. Before any human reviewer sees a PR, it must pass automated, non-AI verification: type checkers, linters, property-based tests, security scanners. These filter out the mechanical errors that inflate human review load — issues that currently consume reviewer attention on things that should never reach a human.

Layer 2 — Intent-level review. Human reviewers evaluate the agent’s recorded reasoning via Checkpoints or equivalent, rather than inspecting code line by line. The question shifts from “is this code correct?” to “was the agent’s reasoning sound for this task?” For the human review layer, focus on the constraints and requirements, not the diff itself.

Layer 3 — Selective adversarial review. For high-risk or complex PRs, adversarial agents challenge the submission before human sign-off — automated red-teaming as a review layer. AI reviewers can catch 70-80% of low-hanging fruit; adversarial agents push that coverage further on the highest-risk PRs.

Near-term: deploy deterministic gates if not already in place. Start tracking the five diagnostic metrics. Evaluate Checkpoints for the agent tools your team uses — it is open-source and Git-native, which makes adoption straightforward.

On a 12-month horizon, the direction is spec-driven development as upstream prevention — agents implementing against formal, machine-readable specifications rather than vague prompts. Because the specification is verifiable, correctness can be checked automatically, reducing the volume of PRs that require human intent-level review at source.

Human review does not go away in this model — it gets redirected to architectural decisions, intent validation, and strategic trade-offs. The reviewers who come out ahead in an agent-heavy team are the ones doing that work, not the ones approving 500-line diffs.

For a broader AI coding agent overview — covering the full autonomy spectrum, context engineering, infrastructure, and investment decisions — see our complete guide to AI coding agents as autonomous engineering teammates.

Frequently Asked Questions

What is the rubber stamp effect in AI code review?

The rubber stamp effect is when engineers approve large agent-generated PRs without meaningful scrutiny — relying on passing tests and linting as a proxy for genuine review. Empirical data from 567 agent PRs found that 45.1% of merged agent contributions required human revision for correctness, documentation, and code style issues.

What is verification debt?

Verification debt is the backlog of AI-generated code that has not been meaningfully reviewed. Like technical debt, it compounds silently and surfaces as production incidents or quality degradation — often long after the agent PRs that created it were merged.

Where can I find the Entire Checkpoints open-source tool?

Checkpoints is an open-source CLI tool built by Entire (entire.io). It integrates with Claude Code and Gemini CLI, and stores captured agent reasoning data directly in Git. No external service is required.

Is the pull request model completely dead now that AI writes code?

No — but it requires structural redesign for teams with high AI adoption. The PR as a human reading exercise is inadequate for agent-generated code. The required change is the review methodology: from line-by-line code inspection to intent-level review of agent reasoning.

What is the difference between reviewing AI-generated code and reviewing agent intent?

Reviewing code evaluates what the agent produced. Reviewing intent — via tools like Checkpoints — evaluates why the agent made the decisions it did, inspecting the prompt, reasoning chain, and decision tree rather than the resulting diff. Intent-level review is faster for large PRs and catches reasoning failures that code review alone misses.

Can’t teams just use linting and automated tests instead of reviewing AI code?

Automated checks are the first layer of a redesigned verification pipeline, not a replacement for human review. They filter mechanical errors efficiently. They do not detect architectural misjudgements, misunderstood requirements, or reasoning failures that produce syntactically correct but semantically wrong code.

What metrics should I track to detect review collapse before it becomes a crisis?

Five key metrics: PR cycle time by author type, merge rate trend for agent PRs, batch size growth, review time per agent PR, and task success rate. Swarmia’s coding agents view provides all five in a dedicated dashboard designed specifically for agent PR tracking.

How does Checkpoints integrate with existing Git and GitHub workflows?

Checkpoints attaches to agent-generated commits at creation and records the reasoning chain directly in the Git repository — no external service required. It works within existing PR-based workflows, with data surfacing through a dedicated UI overlay.

What was Thomas Dohmke’s role at GitHub before founding Entire?

Thomas Dohmke served as CEO of GitHub for four years before leaving in August 2025 to found Entire. He oversaw the rise of GitHub Copilot — giving him operational authority on the PR model’s design and limitations that few founders in this space can match.

What is SDLC backpressure and how does it affect engineering teams?

SDLC backpressure is what happens when review throughput cannot keep pace with agent-generated code input. The pipeline fills at the review gate — agents keep submitting PRs but human review capacity is saturated. Velocity metrics look healthy at the generation stage but degrade at delivery.

How is the Entire funding round structured?

Entire raised a $60M seed round in February 2026 at a $300M post-money valuation, led by Felicis Ventures. Co-investors include Madrona, M12, Basis Set, 20VC, Cherry Ventures, Picus Capital, and Global Founders Capital. Angel investors include Jerry Yang, Olivier Pomel, Garry Tan, and Gergely Orosz.

What is spec-driven development and how does it relate to the review bottleneck?

Spec-driven development means agents implement against formal, machine-readable specifications rather than natural language prompts. Because the spec is verifiable, correctness can be checked automatically — reducing the volume of PRs requiring human intent-level review and addressing the bottleneck at source.

Why Agent-Generated Code Is Breaking the Pull Request Review Model

Why Is the Pull Request Model Breaking Down With AI Coding Agents?

What Does the DORA 2025 Report Find About AI Adoption and Code Review Time?

Why Did Thomas Dohmke Raise $60M to Rebuild Code Review From Scratch?

What Does Entire’s Checkpoints Tool Actually Do Differently?

How Do You Diagnose Review Collapse in Your Own Engineering Team?

What Does a Redesigned Code Review Process for Agent-Generated PRs Look Like?

Frequently Asked Questions

What is the rubber stamp effect in AI code review?

What is verification debt?

Where can I find the Entire Checkpoints open-source tool?

Is the pull request model completely dead now that AI writes code?

What is the difference between reviewing AI-generated code and reviewing agent intent?

Can’t teams just use linting and automated tests instead of reviewing AI code?

What metrics should I track to detect review collapse before it becomes a crisis?

How does Checkpoints integrate with existing Git and GitHub workflows?

What was Thomas Dohmke’s role at GitHub before founding Entire?

What is SDLC backpressure and how does it affect engineering teams?

How is the Entire funding round structured?

What is spec-driven development and how does it relate to the review bottleneck?

Related Articles

Freelancers vs team extension – why freelancers always lose

The best no code / low code tools and strategies for building your MVP

12 Years Strong – How SoftwareSeni’s Culture Drives Our Success

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG