Business

SaaS

Technology

•

Jun 25, 2026

How Ramp, Dropbox, and Stripe Measure the Real Impact of AI Coding Agents

Ramp’s Inspect now authors over half its merged PRs. Stripe’s Minions processes more than a thousand AI-authored pull requests a week. Dropbox has Nova. HubSpot has Crucible. These numbers have been circulating as proof that enterprise AI coding agents have arrived and are working.

But those are output numbers. They count how much code an agent shipped, not whether that code delivered value, whether it was reviewed properly, or what it cost to maintain six months later. The measurement infrastructure that would separate output from productivity does not yet exist, and the frameworks organisations are inheriting from the pre-AI era are breaking under the weight. DX research across nearly 40,000 developers puts actual productivity gains in the 5 to 15% range, not the 50 to 100% claimed in marketing, and Faros AI’s 2025 dataset shows PR volume rising 98% per developer with no measurable improvement in DORA metrics over the same period.

What Are the Real Productivity Numbers Behind Enterprise AI Coding Agents?

Ramp’s Inspect handles over 50% of merged PRs with 6,300% year-over-year growth in AI usage. Dropbox’s Nova accounts for roughly 8% of production PRs, and the company has published a four-stage measurement model: Fuel, Adoption, Output, and Impact. The fourth stage, Impact, remains empty. Dropbox has not published business-outcome data.

Stripe’s Minions processes more than 1,000 AI-authored PRs per week, the highest raw throughput among the named enterprises. But Stripe has not published its measurement framework or ROI methodology. The numbers are large. The methodology is opaque.

Then there is the perception gap. The METR study found developers using AI tools took 19% longer to complete tasks while believing they had been 20% faster. That is a 39-percentage-point gap between what is measured and what is felt. PR merge rates and weekly throughput measure output. Productivity, meaning whether that output creates value or improves quality, requires different instrumentation. And that instrumentation, in turn, depends on whether the code itself is sound.

How Serious Is the Code Review Gap, and How Does AI-Generated Code Compare to Human-Written Code?

The code review gap is an under-measured dimension of AI coding agent impact. It breaks into three layers.

First, unreviewed code. Faros AI telemetry shows 31% of PRs begin merging with no human review at all. The widely cited Kniberg study pegged it at 24%. If you cannot measure how much AI code skips review, you lack a key signal for governing your development pipeline.

Second, review quality. When AI code is reviewed, two biases distort the result. Automation bias means reviewers trust AI-authored code more than they should. A 2026 study found AI-generated PRs containing nearly twice the code redundancy drew fewer negative reactions from reviewers than equivalent human-written ones. Algorithm aversion runs the other way, with reviewers over-scrutinising AI code and slowing merge cycles.

Third, the code itself. CodeRabbit’s analysis of 470 open-source PRs found AI-coauthored PRs at 1.7 times the per-PR issue rate of human-only PRs, with logic defects up 75% and security issues up 174%. Sourcery Intelligence reports a 14.3% vulnerability rate in AI-generated code. These are not the defect patterns human reviewers are trained to spot. And the metrics we use to track code quality are themselves beginning to fail under the strain.

Why Do DORA Metrics Break When AI Coding Agents Enter the Development Workflow?

Classic DORA metrics were designed for a world where humans wrote every line. AI agents break them in specific ways.

Lead time for change has not shrunk. It has shifted. AI compressed authorship time and handed every saved minute to the reviewer, with median PR review time growing 441% year over year. To see what is actually happening, you need to decompose lead time into time-to-first-review, review duration, and review-to-merge as separate sub-metrics.

Deployment frequency inflates when AI agents produce more PRs. Higher deployment frequency paired with rising churn (GitClear’s two-week churn rose from 3.1% to 5.7%) signals rework, not velocity.

Change failure rate becomes harder to attribute because the authorship chain is no longer simple. Decomposing CFR by authorship source reveals whether AI-authored changes fail at different rates than human-authored ones, surfacing patterns that an aggregate CFR conceals. DORA 2025 added rework rate as a fifth metric, tracking unplanned production fixes shortly after deploy. It is a better signal for AI-assisted workflows than classic CFR because it catches the code that passes review but breaks in production.

How Are Ramp, Dropbox, HubSpot, and Stripe Building Their Own AI Coding Agent Infrastructure?

Ramp built Inspect on Modal’s infrastructure: sandboxed VMs starting almost instantly, persistent session state on Cloudflare Durable Objects, and a queue system routing prompts from Slack, web, CLI, and Chrome extension into the same running session. The integration depth is what sets it apart.

Dropbox built Nova as a centralised execution layer because off-the-shelf tools could not operate inside Dropbox’s monorepo, Bazel build system, and on-premise infrastructure. Nova supports both interactive developer sessions and asynchronous workflows like flaky test remediation and large-scale migrations.

Stripe’s Minions runs on a remote development platform Stripe built years before GPT-3 existed. It integrates 400-plus MCP tools via Toolshed and processes more than a thousand PRs a week.

HubSpot’s Crucible runs Claude Code on Kubernetes with custom Docker images, handling agent executions as one-to-one Kubernetes Jobs. Over 7,000 fully AI-generated PRs have been merged.

The four enterprises never coordinated, yet all landed on the same five architecture primitives: sandboxed environments, context connectivity, triggers, fleet orchestration, and governance. The pattern has standardised.

How Does Ramp’s Inspect Compare to Dropbox’s Nova and Stripe’s Minions?

The three platforms represent different architectural bets, and those bets determine what each enterprise can measure.

Ramp’s Inspect optimises for deep integration into a single workflow. It tracks per-agent throughput and adoption with a level of detail no other platform matches. Adoption was voluntary, and engineers chose it because it reduced toil.

Dropbox’s Nova is built for concurrent multi-session orchestration. It can measure coordination efficiency and session overlap in ways Inspect cannot, because Inspect was never designed to coordinate agents.

Stripe’s Minions operates as a parallel fleet. It can measure aggregate fleet throughput at a scale no single-agent platform matches. HubSpot’s Crucible sits adjacent, a managed-model-plus-custom-platform pattern that can attribute cost and quality at the model and policy layer.

The architectural divergence is not incidental. It reflects each company’s existing infrastructure investments, engineering culture, and what they chose to instrument. And that choice of instrumentation is what shapes their measurement frameworks.

How Do Ramp, Dropbox, and Stripe Differ in How They Measure AI Coding Agent Productivity?

Each enterprise’s measurement framework reflects the platform it was built to instrument, and the divergence follows directly from the architectural choices covered above.

Ramp measures what Inspect makes visible: PR merge rate, AI usage growth, and per-developer adoption. It is PR-centric because Inspect is a single-agent platform that lives inside the development workflow. The 6,300% year-over-year growth figure is the headline, but the underlying measurement tracks adoption patterns across teams.

Dropbox has published the most structured framework: a four-stage model that separates leading indicators from lagging ones. It tracks quality signals alongside speed: code review turnaround time, first-run test pass rate, defect ratio, and rework rate. Impact-stage measurement, the company acknowledges, is still in development.

Stripe’s measurement approach is the least documented. Published data covers throughput, but the company has not disclosed metric definitions, baseline comparisons, or ROI methodology. For the highest-throughput deployment among the four enterprises, the measurement gap is significant.

The DX AI Measurement Framework, used by Dropbox and others, tracks three dimensions: Utilization, Impact, and Cost. It represents the vendor alternative to internal measurement. No standard exists across the industry, and each enterprise is developing frameworks that suit its architecture.

What Should Organisations Look for When Auditing AI-Generated Code for Security and Technical Debt?

AI-authored code passes surface-level checks more easily than human code but embeds risks that existing audit frameworks were not built to address.

Veracode’s 2025 report found 45% of AI-generated code contains a security vulnerability. AI-assisted commits expose secrets at more than twice the rate of human-only commits. CVEs formally attributed to AI-generated code jumped from 6 in January 2026 to 35 in March of that same year, and researchers estimate the actual count is five to ten times higher because most AI tools leave no commit metadata.

Technical debt audits must assess architectural fit. AI agents optimise for local correctness but can introduce patterns inconsistent with the broader codebase. GitClear’s analysis shows AI-generated code created four times more duplication, and refactoring dropped from 25% of changed lines to under 10%.

Test coverage is a misleading signal. AI-generated tests often achieve high coverage with assertion-light tests that pass thresholds without meaningful validation. The “80% problem” describes a pattern where AI agents reliably produce roughly 80% of a working solution but systematically omit error handling, security, observability, and edge cases. The remaining 20% is what creates production incidents. Audit frameworks work best when your business has a before to compare against, which brings us to the baseline problem.

How Can Organisations Establish a Pre-AI Baseline for Engineering Metrics?

Most teams never captured pre-AI baselines. The no-baseline scenario is the norm, not the exception.

Retroactive baselines can be constructed from git history: PR cycle times, merge rates, rework patterns, and code churn. But git mining cannot recover qualitative data like review thoroughness or developer satisfaction. The before-and-after comparison will always be approximate.

A minimum viable baseline captures five metrics: PR cycle time, code churn rate, change failure rate by authorship, rework rate, and developer-reported satisfaction. The 30-60-90 day measurement rollout recommends 30 days of instrumentation and baseline capture before agent rollout, 60 days of parallel measurement, and a 90-day review comparing both datasets. For your business, this means accepting that the comparison you want most will be approximate regardless.

If retroactive baselines are impossible, Dropbox’s staged approach provides a progressive alternative: begin with Fuel metrics and progress toward Output and Impact as instrumentation matures. The starting-point challenge most organisations face is not missing data. It is attempting to measure after the agents are already in production.

The headline numbers are not what they appear to be. Ramp’s PR merge rate and Stripe’s weekly throughput measure output, not productivity. The distinction matters because AI agents shift work toward review, rework, and prompt engineering rather than eliminating it.

Code review quality is the dimension most organisations are not measuring. The unreviewed rate, automation bias, and AI-specific defect patterns mean that reviewed AI code is not equivalent to reviewed human code. DORA metrics must be decomposed by authorship source to remain useful, and the architectural divergence across Ramp, Dropbox, and Stripe explains why no standard measurement framework exists. Each platform’s architecture determines what it can instrument.

A practical first step for any organisation is establishing a pre-AI baseline, even a retroactive one, and adopting a staged measurement model that acknowledges Impact-stage measurement is still under development industry-wide. Without a before, no after can be interpreted with confidence.

Frequently Asked Questions

What is the difference between an AI coding agent and a traditional AI coding assistant like GitHub Copilot?

A coding assistant suggests completions within your editor; a coding agent accepts a task, writes code across multiple files, runs tests, and opens a pull request with no intermediate human interaction. Stripe’s Minions exemplify the agent model: a developer posts a task in Slack and the agent returns a merged PR. Agents operate in sandboxed environments with tool access, while assistants remain tightly scoped to in-editor suggestions.

Do smaller organisations need to build their own AI coding platforms, or can they use commercial tools?

Most smaller organisations should start with commercial tools rather than building internal platforms. The Ramp/Stripe/Dropbox pattern of custom infrastructure emerged because these enterprises hit control, security, and scale limits that commercial tools could not address. For teams not operating at that scale, the DX measurement framework and managed offerings provide sufficient capability without the infrastructure investment. Build only when commercial tools demonstrably constrain your outcomes.

Is it true that AI coding agents will replace software developers?

No. The data tells a different story: even at 30% of merged PRs (Ramp), AI agents generate code that requires review, architectural oversight, and maintenance, all of which demand experienced developers. The METR study found developers using AI tools were actually 19% slower despite feeling 20% faster. What changes is the nature of developer work: less boilerplate generation, more code review, architectural decision-making, and coordination. The role shifts, it does not disappear.

What happens if an organisation deploys AI coding agents without any measurement framework?

You lose the ability to distinguish between productive AI use and activity theatre. Without measurement, the 24% unreviewed code rate goes undetected, review burden silently expands, rework accumulates over six-month windows, and you cannot answer the board’s inevitable ROI question. The Faros AI data showing PR review time increasing 91% alongside 98% more PR volume illustrates what happens when output grows faster than quality monitoring. Measurement infrastructure is not optional; it is the prerequisite for governance.

How do AI coding agents affect junior developers differently than senior developers?

AI agents risk widening the experience gap. Senior engineers absorb the expanded review burden created by AI-generated code, leaving less time for mentorship. Junior developers lose the deliberate practice of writing code from scratch and may struggle to evaluate AI-generated solutions critically. The Carnegie Mellon finding that cognitive complexity rose 41% and persisted over time is particularly concerning for juniors who lack the experience to recognise when AI output is architecturally unsound. Organisations should pair AI adoption with structured mentoring programmes.

Can AI coding agents handle legacy codebases as well as greenfield projects?

Legacy codebases present distinct challenges. AI agents trained primarily on modern code patterns may introduce abstractions that clash with established architecture. On Stripe’s 30-million-line Ruby codebase, Minions succeed because they have access to 400+ internal tools via Toolshed that encode organisational conventions. Without equivalent context connectivity, agents on legacy systems risk producing code that works in isolation but creates architectural drift. The Carnegie Mellon data on persistent complexity increase is a warning for legacy environments.

What skills should developers focus on building as AI coding agents become more capable?

Three capabilities become disproportionately valuable: code review and critical evaluation (the 24% unreviewed code rate makes this urgent), system design and architectural judgement (agents generate implementation, not architecture), and prompt engineering with domain-specific context. The developer who can specify tasks precisely, evaluate AI output critically, and integrate generated code into existing systems will outperform the developer who writes more code manually. DX research confirms that coding is only 14% of a developer’s day; the other 86% is where human judgment compounds.

How do I convince my leadership team to invest in measurement before scaling AI tool adoption?

Lead with the cost of not measuring. Without measurement infrastructure, AI tool spend rises predictably (tokenmaxxing creates unpredictable usage-based billing) while delivery throughput stalls: DX data shows 93% adoption producing only 8% median throughput increase. Frame measurement as a cost-tracking exercise rather than an academic project. The six-month DX implementation roadmap (baseline, controlled rollout, cohort analysis) provides a concrete timeline that leadership can evaluate against the quarterly cost of unmeasured AI tooling expansion.

Should organisations mandate AI coding tool usage or follow Ramp’s voluntary adoption model?

Ramp’s voluntary adoption model succeeded for specific reasons: engineers adopted Inspect because it demonstrably reduced toil, not because leadership mandated it. The data supports this approach. DX found that forced adoption without demonstrated value creates resistance and distorts feedback. However, voluntary adoption must be paired with measurement: track adoption rates by team, correlate usage with throughput outcomes, and address the teams where AI tools create net drag. Mandates work only after the tools have proven their value in your specific environment.

Are there open-source alternatives to building an internal AI coding platform?

Yes, and they are maturing rapidly. Open-SWE, released by LangChain under an MIT licence with over 6,000 GitHub stars, implements the same five-primitive architecture (Manager, Planner, Programmer plus Reviewer sub-agents in isolated cloud sandboxes) that Ramp, Stripe, and Dropbox built independently. It is not a turnkey replacement for Inspect or Minions, but it provides a production-grade starting point for organisations that need internal platform capabilities without building from scratch. The convergence of enterprise architecture into open-source code is a signal that the pattern has standardised.

Is the productivity perception gap from the METR study still accurate with today’s more advanced agentic tools?

The METR study tested pre-agentic tools, and a follow-up with newer agentic tooling is underway. However, the structural dynamics that explain the gap have not changed: coding remains only 14% of a developer’s day, review burden expands to absorb generation speed gains, and coordination bottlenecks persist regardless of how capable the individual agent becomes. The DX Q1 2026 data showing daily AI users merge 2.4 PRs per week versus 1.5 for non-users suggests the gap narrows with agentic tools, but the 8% median organisational throughput figure warns against assuming the perception gap has closed entirely.