Insights Business| SaaS| Technology The Evidence Against Vibe Coding: What Research Reveals About AI Code Quality
Business
|
SaaS
|
Technology
Jan 28, 2026

The Evidence Against Vibe Coding: What Research Reveals About AI Code Quality

AUTHOR

James A. Wondrasek James A. Wondrasek
Graphic representation of the topic The Evidence Against Vibe Coding: What Research Reveals About AI Code Quality

AI coding tools promise massive productivity gains. GitHub talks about 55% faster completion. Cursor advertises 2× productivity. Your developers are probably already using them—63% of professional developers are, according to Stack Overflow’s 2024 survey.

But independent research tells a different story. Experienced developers were 19% slower with AI tools despite feeling 20% faster. GitClear analysed 211 million lines of code and found refactoring collapsed from 25% to under 10%. CodeRabbit discovered AI code had 2.74× more security vulnerabilities.

These aren’t vendor case studies. They’re randomised controlled trials, longitudinal analyses, and systematic code reviews measuring actual outcomes.

As part of our comprehensive exploration of understanding vibe coding and the future of software craftsmanship, this article examines the empirical evidence that CTOs need before committing to enterprise-wide AI tool adoption—what research reveals about code quality, developer performance, and the hidden costs that offset those promised productivity gains.

Why were developers 19% slower with AI tools in the METR study?

METR conducted a randomised controlled trial on how early-2025 AI tools affect experienced open-source developers’ productivity. They recruited 16 developers from major repositories averaging 22,000+ stars and over 1 million lines of code each.

They assigned 246 real issues—not toy problems—to developers who could either use AI tools or work without them. Tasks averaged two hours each. Developers could use any tools they wanted, though most chose Cursor Pro with Claude models.

The results? Developers took 19% longer when AI tools were permitted.

Here’s the kicker. Participants predicted a 24% speedup before the study. After experiencing the slowdown, they still believed AI had made them 20% faster. That’s a 39-percentage-point perception-reality gap—the largest METR documented in their productivity research.

Why the slowdown? The study identified five contributors: time spent learning tools, debugging AI-generated code, reviewing and refactoring outputs, context switching between AI and manual work, and managing tool limitations and errors.

Unlike vendor benchmarks using isolated functions and greenfield projects, METR used multi-file tasks with existing codebases. Real work. The kind that requires understanding architectural context, integrating with existing code, and handling edge cases that AI tools consistently miss.

The perception gap exists because AI tools reduce cognitive load during initial coding. That feels faster. It feels easier. But the time you save typing gets consumed by verification—checking that suggested APIs exist, testing edge cases AI overlooked, understanding what the generated code does before you can extend it.

GitHub reports 55% faster completion. Cursor advertises 2× productivity multipliers. The methodology differences explain the discrepancy—toy problems vs. production scenarios, self-selected enthusiastic early adopters vs. randomised assignment, satisfaction scores vs. objective time measurement.

When evaluating what vibe coding means for your team, the METR study answers the question vendor benchmarks dodge: what happens when experienced developers use AI tools on realistic tasks in existing codebases?

They get slower. But they feel faster. And that perception gap is dangerous when making strategic decisions about tool adoption.

How does AI-generated code impact software quality metrics?

GitClear analysed 211 million lines of code changes from 2020-2024, sourced from repositories owned by major tech companies and enterprises. The longitudinal design tracked how code quality metrics changed as AI adoption increased.

Three patterns emerged: refactoring collapsed, code duplication exploded, and code churn accelerated.

Refactoring—the practice of cleaning up working code to improve its structure—dropped from 25% of changed lines in 2021 to under 10% by 2024. Developers accept AI output without the iterative improvement they’d apply to human-written code.

Code duplication increased from 8.3% of changed lines in 2021 to 12.3% by 2024—approximately 4× growth. AI lacks whole-codebase context. It regenerates similar logic instead of reusing existing functions. And developers don’t cross-reference before accepting, because that would eliminate the time savings.

Code reuse decreased with moved lines continuing to decline. For the first time historically, copy-paste code exceeded moved code. That’s a reversal of two decades of best practices around DRY principles.

Code churn—premature revisions where code gets rewritten shortly after merging—nearly doubled. AI generates code that passes tests but requires revision after integration testing, architectural review, or production deployment.

Rod Cope, CTO at Perforce Software, explains the context problem: “AI is better the more context it has, but there is a limit on how much information can be supplied.” AI tools can’t see your entire codebase, understand your architectural decisions, or know that someone already solved this problem three months ago in a different module.

ThoughtWorks researchers observed that complacency sets in with prolonged use of coding assistants. Duplicate code and code churn rose even more than predicted in GitClear’s 2024 research.

Tests verify functionality. Maintainability requires different qualities—readable structure and consistent patterns. These metrics quantify the gap between functional code and production-ready systems. Maintainability determines whether you’re building a system or accumulating technical debt.

For those examining the economic implications of vibe coding, these metrics translate directly to maintenance costs. Duplicated code means bugs require fixes in multiple locations. Reduced refactoring means complexity compounds until rewrites become necessary. Code churn means development time gets consumed by rework instead of new features.

What did CodeRabbit’s comparison reveal about AI code quality and security?

CodeRabbit analysed 470 open-source GitHub pull requests—320 AI-co-authored and 150 human-only—using a structured issue taxonomy to compare quality systematically.

AI-generated PRs contained 10.83 issues per PR compared to 6.45 for human-only PRs. That’s 1.7× more issues overall.

The gaps weren’t uniform. Readability issues spiked 3× higher in AI contributions—the single largest difference across the entire dataset. AI optimises for working code, not human comprehension. It generates long functions, inconsistent naming, minimal comments, and nested complexity that experienced developers would refactor before committing.

Security issues measured 2.74× higher in AI code. The most common pattern? Improper password handling. But the catalogue extends to input validation failures, authentication bypasses, SQL injection risks, and hardcoded credentials. AI training data includes insecure examples, and models lack security-first thinking.

Logic and correctness issues were 75% more common in AI PRs—business logic errors, misconfigurations, edge case handling failures. Error handling and exception-path gaps appeared nearly 2× more often. AI often omits null checks, early returns, guardrails, and comprehensive exception logic.

Performance regressions, though small in number, skewed heavily toward AI with excessive I/O operations approximately 8× more common. Concurrency and dependency correctness saw roughly 2× increases. Formatting problems appeared 2.66× more often despite automated formatters. AI introduced nearly 2× more naming inconsistencies with unclear naming and generic identifiers.

High-issue outliers were much more common in AI PRs, creating heavy review workloads. The variability matters as much as the average. You can’t predict which AI-generated PR will explode with issues.

Unlike human code where error rates correlate with developer experience, AI code quality is unpredictable. Every line requires verification regardless of how plausible it appears. That eliminates efficiency from trust-based review—you can’t skim code from a senior developer the way you’d scrutinise a junior’s work.

A Cortex report found that while pull requests per author increased 20% year-over-year thanks to AI, incidents per pull request increased 23.5%. More output without faster cycle times increases review workload without productivity gains.

The CodeRabbit researchers conclude that “AI-generated code is consistently more variable, more error-prone, and more likely to introduce high-severity issues without the right protections in place.”

For teams working on security implications of vibe coding, the 2.74× vulnerability density isn’t just a statistic. It’s a production incident waiting to happen. It’s a penetration test failure. It’s a security audit finding that blocks a customer deal.

What is the “productivity tax” in AI-assisted development?

The productivity tax describes hidden costs that offset AI coding tool productivity gains. Time saved during initial code generation gets consumed by downstream work: debugging hallucinations, reviewing plausible-but-wrong code, refactoring duplicated outputs, and resolving premature revisions that passed tests but failed in production.

METR found developers took 19% longer despite time savings during initial coding. The tax exceeded the benefit.

The tax manifests in several ways. Consider hallucinations first. AI confidently suggests non-existent libraries, deprecated APIs, and wrong function signatures. You spend time verifying suggestions are real, checking documentation, rewriting when AI invents features. OpenAI’s own research shows AI agents “fail to root cause, resulting in partial or flawed solutions.”

Then review overhead. Traditional code review scales with developer experience—senior developers need less scrutiny because you trust their judgement. AI code requires full review regardless of confidence level. The output might be perfect or it might have a subtle authentication bypass. You can’t tell without checking.

Faros AI observed PR sizes increased 154%—more verbose, less incremental AI-generated code. Review times increased 91%, influenced by larger diff sizes and increased throughput. Organisations saw 9% more bugs per developer as AI adoption grew.

Accepting AI output without cleanup creates refactoring debt. The technical debt accumulates invisibly until it forces refactoring. Future changes require understanding poorly structured code, debugging duplicated logic, and refactoring before you can extend functionality.

ThoughtWorks warns: “The rise of coding agents further amplifies these risks, since AI now generates larger change sets that are harder to review.”

When does the productivity tax exceed gains? Complex existing codebases where architectural context matters. Domains requiring deep expertise where surface-level correctness isn’t sufficient. Security-facing applications where vulnerabilities create business risk. Long-lived systems where maintenance burden compounds over years.

Mitigation strategies exist—augmented coding practices, automated quality gates, policy-as-code enforcement, mandatory code review. These reduce the tax but don’t eliminate it. You’re trading AI speed for quality discipline, which puts you back at roughly human-level productivity with additional tooling overhead.

For those building economic frameworks around AI coding tools, the productivity tax is the line item vendor ROI calculations omit. It’s the reason METR’s objective measurements diverge from GitHub’s satisfaction scores.

What are the common error patterns in AI-generated code?

AI-generated code exhibits three primary error categories: hallucinations, logic errors, and security flaws. The severity ranges from compilation failures to exploitable vulnerabilities.

Hallucinations include fake libraries—inventing packages that don’t exist. API misuse—using real libraries with wrong methods. Deprecated functions—suggesting outdated approaches. Invented features—adding parameters that don’t exist.

Models infer code patterns statistically, not semantically. They miss system rules that senior engineers internalise through months of working in a codebase.

Logic errors manifest as edge case blindness—missing null checks, empty array handling, boundary conditions. Algorithmic mistakes—off-by-one errors, incorrect loop conditions. Concurrency issues—race conditions and deadlocks in multi-threaded code.

AI generates surface-level correctness. Code that looks right but skips control-flow protections. According to CodeRabbit’s analysis, AI often omits null checks, early returns, guardrails, and comprehensive exception logic.

Security vulnerabilities follow patterns: input validation failures allowing malicious data, authentication bypasses from logic flaws in access control, injection attacks (SQL, command, XSS), credential exposure through hardcoded secrets or logging sensitive data.

Security patterns degrade without explicit prompts as models recreate legacy patterns or outdated practices from training data. Apiiro found data breaches through AI-generated code surged threefold since mid-2023.

Why does AI make these errors? AI relies on pattern matching from training data rather than semantic understanding of business logic. Training data includes bad examples. Lack of adversarial thinking when considering edge cases. No security-first design thinking.

Gary Marcus emphasises problems with generalisation beyond training data. AI-generated code works reasonably with familiar systems but fails tackling novel problems. The tools excel at pattern recognition but “cannot build new things that previously did not exist.”

Detection strategies vary by error type. Test-driven development catches logic errors. Security scanning tools identify vulnerabilities. Code review reveals hallucinations. Compilation catches obvious fakes like non-existent libraries.

But detection isn’t prevention. You’re debugging AI mistakes instead of writing correct code yourself. That’s the productivity tax again—time spent fixing problems AI introduced.

For security-focused teams, the error patterns matter less than the unpredictability. You can train humans to avoid SQL injection. You can’t train AI to stop hallucinating authentication logic that looks secure but isn’t.

How can CTOs evaluate research methodology to distinguish vendor claims from independent studies?

As discussed in our comprehensive guide to understanding vibe coding, evaluate research credibility by examining four dimensions: study design, task realism, measurement objectivity, and researcher independence.

Study design hierarchy runs from randomised controlled trials at the top through controlled observational studies, self-selected samples, vendor case studies, to anecdotal evidence at the bottom.

METR’s RCT design controls for developer experience, randomly assigns AI tools vs. baseline, and measures objective outcomes. That’s the gold standard. Compare to vendor studies where enthusiastic early adopters self-select into using new tools and report satisfaction scores.

Task realism matters. METR used multi-file tasks with existing codebases. Real architectural context. Integration challenges. Edge cases. Vendor benchmarks use isolated functions—write a sorting algorithm, implement a REST endpoint, generate a CSS layout. Toy problems that reveal nothing about production scenarios.

Realistic tasks reveal integration costs, debugging time, and architectural understanding requirements that isolated exercises hide.

Measurement objectivity separates completion time (objective, METR) from self-reported productivity (subjective, vendor surveys) from satisfaction scores (sentiment, GitHub) from code quality metrics (objective, GitClear and CodeRabbit).

METR measured how long tasks took. Vendors ask developers “do you feel more productive?” Those questions measure different things.

Researcher independence affects framing. METR operates as an AI safety non-profit. GitClear sells code analytics. CodeRabbit builds review automation. None sell the AI coding tools they’re evaluating.

GitHub sells Copilot. Cursor sells their AI IDE. OpenAI sells model access. Financial incentives shape how results get presented—not necessarily the results themselves, but which results get highlighted and how limitations get discussed.

Faros AI’s analysis tracked 1,255 teams through natural work over time—full software delivery pipeline from coding through deployment. Longitudinal observational studies sit below RCTs but above vendor case studies in credibility.

Red flags: small sample sizes, self-selected samples, toy problem tasks, self-reported metrics only, vendor funding without independent analysis, unpublished claims.

Critical evaluation checklist: randomisation, sample size, task realism, measurement objectivity, researcher independence, peer review status, reproducibility.

How to use research in business decisions? Weight evidence by quality. Triangulate across multiple independent studies—METR, GitClear, and CodeRabbit converge on quality degradation despite different methodologies. Demand vendor benchmark methodology transparency. Pilot internal measurements before enterprise rollout.

When GitHub claims 55% faster completion, ask: faster at what? Isolated toy problems or production tasks? Self-selected early adopters or randomised assignment? Satisfaction or objective completion time? What happened to code quality, security, and maintainability?

The methodology matters more than the headline number. A well-designed study showing 19% slower performance tells you more than a vendor case study claiming 10× productivity from enthusiastic beta testers writing greenfield code.

What do these research findings mean for engineering leaders making AI tool decisions?

Treat AI coding tools as productivity-neutral or slightly negative for experienced developers on complex tasks. Plan for code review overhead. Budget for technical debt payback. Implement quality gates before enterprise adoption.

The evidence converges: METR’s 19% slower performance, GitClear’s refactoring collapse and 4× duplication, CodeRabbit’s 2.74× security vulnerabilities.

Index.dev reports 41% of global code is now AI-generated, rising to 61% in Java projects. Over 25% of all new code at Google is written by AI according to CEO Sundar Pichai. This isn’t a future concern—it’s current reality.

When do AI tools add value? Prototyping and proof-of-concepts where code quality matters less than speed to validation. Isolated greenfield projects without complex architectural context. Experienced developers on tedious tasks like boilerplate and test scaffolding. Learning new frameworks where AI provides examples and documentation.

When do AI tools create risk? Production systems with long lifecycles where maintenance burden compounds. Security-facing applications where vulnerabilities create business exposure. Complex existing codebases where architectural understanding matters. Junior developer-heavy teams without senior oversight.

Kent Beck distinguishes “augmented coding” from vibe coding: “In vibe coding you don’t care about the code, just the behaviour of the system. In augmented coding you care about the code, its complexity, the tests, & their coverage.”

Beck maintains strict TDD enforcement and actively intervenes to prevent AI overreach. His augmented coding approach with test-driven development, mandatory code review, and architectural oversight mitigates quality degradation. The productivity tax shrinks with discipline but realistic productivity gains come from process discipline rather than tool features alone.

Simon Willison proposes “vibe engineering” as distinct from vibe coding—how experienced engineers leverage LLMs while maintaining accountability for production code quality. He emphasises that “AI tools amplify existing expertise” and advanced LLM collaboration demands operating “at the top of your game.”

Willison identifies eleven practices for effective AI usage: automated testing, advance planning, documentation, version control discipline, automation infrastructure, code review culture, management skills, manual QA expertise, research capabilities, preview environments, strategic judgement.

These practices create productivity through process discipline. The AI tool amplifies existing capabilities but doesn’t replace foundational engineering practices.

Implementation framework: Start with a constrained pilot on non-production work. Measure quality metrics objectively—refactoring rate (target: maintain 25%), code duplication (target: no increase), code churn (target: stable), security vulnerability density (target: below human baseline), actual vs. perceived productivity.

Implement quality gates before scaling: test coverage minimums before merging AI code, security scanning on all PRs, senior developer review for junior AI usage, prompt engineering guidelines, approved tool lists.

Train teams on augmented coding workflow, code review for AI output, prompt engineering for better results, recognising hallucinations and logic errors, security awareness.

Policy recommendations: test coverage requirements, mandatory review, security scanning. Make reckless AI usage difficult while supporting disciplined usage. ThoughtWorks notes that outright bans are impractical and counterproductive. Focus on quality outcomes, not tool prohibition.

Executive-level conversation framing: Present total cost of ownership (licensing + productivity tax + technical debt), quantify risk exposure (security vulnerabilities, maintenance burden), propose pilot with measurement before enterprise rollout.

Licence costs are the smallest expense. Budget for extended review time, security audits, refactoring sprints, and technical debt payback. Calculate total cost of ownership, not just procurement cost.

Gary Marcus presents evidence that vibe coding is experiencing declining adoption after initial hype, with usage declining for months following an early spike. He characterises vibe coding as producing projects that start promisingly but “end badly.”

The research doesn’t say AI coding tools are useless. It says they’re not productivity multipliers for experienced developers on production work. They’re tools that require discipline, oversight, and quality gates to use safely.

Reject vendor claims of massive productivity gains. Plan for neutral or slightly negative individual productivity offset by gains in specific scenarios. Invest in process discipline—that’s where sustainable productivity comes from. Augmented coding provides a responsible framework for using AI tools while maintaining code quality and professional standards.

For a complete overview of AI-assisted development practices, quality considerations, and implementation strategies, see our comprehensive guide to understanding vibe coding and the future of software craftsmanship.

FAQ Section

What percentage of code is AI-generated today?

Index.dev reports 41% of global code is now AI-generated, rising to 61% in Java projects. However, “generated” includes everything from single-line autocomplete to entire applications—making the metric less useful than measuring what percentage goes to production unreviewed.

Do all studies show negative AI coding tool results?

No—vendor studies (GitHub, Cursor) report 50-100% productivity gains, but use self-selected early adopters and toy problems. The distinction is between tightly controlled independent research (METR, GitClear, CodeRabbit) finding quality issues and vendor marketing claiming massive gains. Methodology matters more than headline numbers.

Can augmented coding practices eliminate the productivity tax?

Partially—Kent Beck’s augmented coding approach with test-driven development, mandatory code review, and architectural oversight mitigates quality degradation but doesn’t eliminate all hidden costs. The productivity tax shrinks with discipline but realistic productivity gains come from process discipline rather than tool features alone.

How do you measure code quality degradation in your own codebase?

Track refactoring rate (target: maintain 25% of changes), code duplication (monitor for increases), code churn (flag premature revisions), security vulnerability density (compare AI vs. human code), and test coverage (ensure AI code meets minimums). GitClear’s metrics provide baseline expectations.

Are some programming languages more vulnerable to AI code quality issues?

Yes—Java shows highest AI generation rate (61%) and higher quality issues due to verbose syntax encouraging copy-paste. Languages with strong type systems (Rust, TypeScript) catch more AI errors at compile time. Security-facing languages (C/C++) show higher vulnerability risks from AI hallucinations.

Should CTOs ban vibe coding entirely?

Outright bans are impractical and counterproductive—better to implement augmented coding policies (test coverage requirements, mandatory review, security scanning) that make reckless AI usage difficult while supporting disciplined usage. Focus on quality outcomes not tool prohibition.

How long does it take for technical debt from AI code to manifest?

GitClear’s longitudinal data shows quality degradation within months—refactoring collapse and duplication increase appear in quarterly metrics. Security vulnerabilities may not surface until production incidents, making them higher-risk and harder to track preventatively.

What’s the difference between AI code quality and AI code security?

Quality encompasses readability, maintainability, architectural consistency, and defect density—measurable through code review and testing. Security focuses specifically on exploitable vulnerabilities (injection attacks, authentication bypasses)—measurable through security scanning tools. CodeRabbit found both quality (3× worse readability) and security (2.74× more vulnerabilities) degradation.

Can AI tools improve code quality over time with better prompts?

Prompt engineering reduces hallucination rates and improves initial output quality, but doesn’t eliminate limitations—AI lacks whole-codebase architectural understanding, can’t reason about security threat models, and doesn’t refactor for maintainability. Better prompts help, but don’t achieve human-expert quality levels.

How do you convince executives to invest in quality gates when vendor demos look impressive?

Present METR’s perception-reality gap (20% faster feeling vs. 19% slower measurement), GitClear’s technical debt metrics (4× duplication, refactoring collapse), and CodeRabbit’s security findings (2.74× vulnerabilities). Frame as risk mitigation—quality gates cost less than production incidents, security breaches, and refactoring projects.

What research is being conducted on AI coding tool improvements?

Current research focuses on better code understanding (larger context windows, retrieval-augmented generation), security-aware generation (training on secure code patterns), and automated quality improvement (AI-powered refactoring, test generation). However, limitations (lack of architectural reasoning, security threat modelling) remain unsolved.

Should different team experience levels use AI tools differently?

Yes—senior developers benefit from AI handling tedious work (boilerplate, test scaffolding) while maintaining quality through experience-based review. Junior developers risk skill development gaps and should use AI under supervision with mandatory senior review. The METR study used experienced developers—junior developer impacts may be worse.

AUTHOR

James A. Wondrasek James A. Wondrasek

SHARE ARTICLE

Share
Copy Link

Related Articles

Need a reliable team to help achieve your software goals?

Drop us a line! We'd love to discuss your project.

Offices
Sydney

SYDNEY

55 Pyrmont Bridge Road
Pyrmont, NSW, 2009
Australia

55 Pyrmont Bridge Road, Pyrmont, NSW, 2009, Australia

+61 2-8123-0997

Jakarta

JAKARTA

Plaza Indonesia, 5th Level Unit
E021AB
Jl. M.H. Thamrin Kav. 28-30
Jakarta 10350
Indonesia

Plaza Indonesia, 5th Level Unit E021AB, Jl. M.H. Thamrin Kav. 28-30, Jakarta 10350, Indonesia

+62 858-6514-9577

Bandung

BANDUNG

Jl. Banda No. 30
Bandung 40115
Indonesia

Jl. Banda No. 30, Bandung 40115, Indonesia

+62 858-6514-9577

Yogyakarta

YOGYAKARTA

Unit A & B
Jl. Prof. Herman Yohanes No.1125, Terban, Gondokusuman, Yogyakarta,
Daerah Istimewa Yogyakarta 55223
Indonesia

Unit A & B Jl. Prof. Herman Yohanes No.1125, Yogyakarta, Daerah Istimewa Yogyakarta 55223, Indonesia

+62 274-4539660