Business

SaaS

Technology

•

Feb 17, 2026

The Case Against Vibe Coding – Understanding, Craftsmanship, and Long-Term Costs

Q: Does the Thoughtworks Hold rating mean organisations should avoid AI coding tools entirely?

No. Hold means adopt with extreme caution, not never use. Thoughtworks specifically warns against AI coding complacency—the pattern of adopting tools without quality gates, review standards, or comprehension requirements. Organisations can use AI coding tools responsibly through augmented coding practices that maintain human oversight.

AI coding tools promise massive productivity gains. And they deliver, at least initially. But something is happening beneath the surface—your codebase is piling up debt faster than you realise.

GitClear analysed 211 million lines of code across 2020-2024. CodeRabbit compared 470 pull requests—320 AI-generated versus 150 human-written. Veracode tested over 100 LLMs across four programming languages. All three studies point the same way: vibe coding—generating code without understanding it—undermines maintainability, security, and the resilience of your team.

This article unpacks the difference between vibe coding and augmented coding—Kent Beck’s disciplined alternative that keeps understanding at the centre while still using AI. It’s about sustainable pace versus short-term velocity.

For a complete strategic overview of vibe coding and its implications for engineering leaders, see our comprehensive guide. This article presents the evidence for why understanding code still matters.

Why Does Understanding Code Matter for Long-Term System Health?

Understanding is the foundation of maintainability. When developers actually comprehend their codebase, they can debug, extend, and refactor it efficiently. Without comprehension, every change becomes a high-risk experiment.

Kent Beck put it simply: in vibe coding you don’t care about the code, just the behaviour of the system. In augmented coding you care about the code, its complexity, the tests, and their coverage. The value system in augmented coding is similar to hand coding—tidy code that works.

His B+Tree project demonstrated this. Augmented coding can create production-ready, performance-competitive library code while maintaining code comprehension. You focus on the consequential design decisions rather than the repetitive implementation details.

Debugging code you did not write and do not understand is expensive. The time cost compounds as the codebase grows and the original context is lost. Jeremy Twei coined the term “comprehension debt” for this—the growing gap between code a team can review syntactically and code they actually understand architecturally.

Unlike technical debt, which shows up in metrics like duplication and complexity, comprehension debt is hidden until an incident reveals that no one truly understands how something works. You can review code competently even after your ability to write it from scratch has atrophied, but there’s a threshold where “review” becomes “rubber stamping”.

For smaller teams, losing institutional knowledge is serious. Fewer engineers means each person’s understanding carries more weight. When developers rely on AI to generate code they never deeply engage with, institutional knowledge degrades. Your bus factor increases.

Organisational resilience depends on shared understanding across the team. That’s not Luddite resistance. That’s stewardship.

What Is the 70% Problem and Why Does “Almost Right” Code Cost More?

The 70% Problem, coined by Addy Osmani, describes AI-generated code that appears mostly correct but requires disproportionate human effort to complete, debug, and make production-ready.

AI errors have evolved from syntax bugs to conceptual failures—the kind a sloppy, hasty junior developer might make under time pressure.

Stack Overflow’s 2025 survey showed 66% of developers experience “AI solutions that are almost right, but not quite” as their top frustration. 45% reported that “debugging AI code takes longer than writing it myself”. Only 16% reported great productivity improvements from AI tools, while half saw only modest gains.

The completion cost paradox is real. Finishing the remaining 30% often takes longer than writing the code from scratch, because you have to reverse-engineer AI assumptions before fixing them.

Here’s a concrete example. An AI generates an authentication module. The happy path works—valid login credentials succeed. But edge cases fail: password reset flows break, session timeout handling doesn’t exist, concurrent login conflicts cause data corruption. You spend three hours debugging and fixing what would have taken one hour to write manually.

The “almost right” pattern is psychologically dangerous. You trust code that looks correct, which reduces scrutiny and delays bug discovery until production. Andrej Karpathy described the problem: “The models make wrong assumptions on your behalf and run with them without checking. They don’t manage confusion, don’t seek clarifications, don’t surface inconsistencies, don’t present tradeoffs, don’t push back when they should.”

Assumption propagation compounds the issue. The model misunderstands something early and builds an entire feature on faulty premises. You don’t notice until you’re five PRs deep and the architecture is cemented.

Yoko Li captured the psychological hook: “The agent implements an amazing feature and got maybe 10% of the thing wrong, and you’re like ‘hey I can fix this if I just prompt it for 5 more mins.’ And that was 5 hrs ago.”

When completion cost exceeds writing from scratch, the productivity gain is negative.

What Does GitClear’s Analysis of 211 Million Lines of Code Reveal About Technical Debt?

GitClear analysed 211 million lines of code changed between January 2020 and December 2024 from repos owned by Google, Microsoft, Meta, and enterprise C-Corps. It’s the largest longitudinal dataset on how AI coding tools affect codebase health.

Refactoring collapsed from 25% to under 10% of developer activity—a 60% decline. Developers are generating new code rather than improving existing code.

Code duplication increased 4x in volume. For blocks of five or more lines, duplication increased 8x. This violates the DRY principle at scale and creates maintenance burdens across entire codebases.

Code churn—code written and then rewritten or deleted shortly after—nearly doubled, indicating wasted effort and instability. For the first time in history, “copy/paste” code exceeded “moved” code (code reuse).

These metrics compound over time. Technical debt increases an estimated 30-41%. Unlike financial debt, technical debt accrues interest in the form of slower development, more bugs, and higher incident rates.

Research suggests that DRY, modular approaches retain high project velocity over years. Canonical systems are documented, well-tested, reused, and periodically upgraded. AI-generated code moves in the opposite direction.

Smaller teams with fewer engineers cannot absorb a 30-41% increase in maintenance burden. The debt accumulates faster than you can pay it down. For your organisation, this may be a serious threat, not an academic concern.

Thoughtworks Technology Radar cited GitClear’s research when placing “AI coding complacency” on “Hold” status—their strongest cautionary rating.

How Does AI-Generated Code Compare to Human Code on Quality Metrics?

CodeRabbit analysed 470 pull requests—320 AI-generated, 150 human-written. AI code has 1.7x more issues overall than human-written code. AI-authored changes produced 10.83 issues per PR, compared to 6.45 for human-only PRs.

Logic errors are 75% more frequent in AI-generated code. These are not formatting or style issues but functional defects that affect correctness—business logic mistakes, incorrect dependencies, flawed control flow, and misconfigurations.

Readability issues are 3x more common. Poor variable naming, convoluted structure, and inconsistent patterns make code harder for humans to review and maintain. Readability spiked more than anything else in the dataset—the single biggest difference.

Error handling and exception-path gaps were nearly 2x more common. Performance regressions, though small in number, skewed heavily toward AI—excessive I/O operations were 8x more common in AI-authored PRs.

Concurrency and dependency correctness saw 2x increases in AI PRs. Formatting problems were 2.66x more common. AI introduced nearly 2x more naming inconsistencies.

These quality gaps aren’t random. AI optimises for “looking correct” rather than being maintainable. It produces code that passes superficial review but degrades over time.

AI lacks local business logic. Models infer code patterns statistically, not semantically. Without strict constraints, they miss the rules of the system that senior engineers internalise. They generate surface-level correctness—code that looks right but may skip control-flow protections or misuse dependency ordering. Naming patterns, architectural norms, and formatting conventions often drift toward generic defaults. AI favours clarity over efficiency, often defaulting to simple loops, repeated I/O, or unoptimised data structures.

Review fatigue compounds the problem. Reviewers spend 91% more time on AI-generated PRs. Under volume pressure, quality of review declines—leading to rubber-stamping. A recent Cortex report found that while pull requests per author increased by 20% year-over-year thanks to AI, incidents per pull request increased by 23.5%.

No issue category was uniquely AI, but most categories saw significantly more errors in AI-authored PRs. Humans and AI make the same kinds of mistakes. AI just makes many of them more often and at a larger scale.

Here are the numbers in one view:

AI Code vs Human Code Quality Metrics

Overall issues: 1.7x more in AI code
Logic errors: 1.75x more in AI code
Readability issues: 3x more in AI code
Code duplication: 4x increase
Code churn: 2x increase
Formatting problems: 2.66x more in AI code

These are measurable, compounding costs.

Why Does AI Struggle With Complex and Legacy Codebases?

These quality problems get worse in complex production environments.

LLMs have a fundamental limitation: context windows restrict the amount of code they can process at once. Output quality declines as codebases grow—a pattern called context rot.

AI excels at generating boilerplate and greenfield code but struggles with architectural decisions that require understanding system-wide implications. Most real-world development happens in complex legacy systems, not greenfield projects—the exact environment where AI tools perform worst. In mature codebases with complex invariants, the calculus inverts. The agent doesn’t know what it doesn’t know. It can’t intuit the unwritten rules. Its confidence scales inversely with context understanding.

If you’re managing existing codebases—and most teams are—the productivity narrative around AI tools is misleading. Gains demonstrated on simple projects do not transfer to production systems with years of accumulated context.

Abstraction bloat emerges when AI creates elaborate class hierarchies or 1000-line implementations where 100 lines would suffice. It optimises for “looking comprehensive” rather than maintainability.

Productivity benchmarks on simple tasks misrepresent real-world performance. If you’re managing a complex legacy codebase—and most of you are—the gains you’ve read about don’t apply.

How Vulnerable Is AI-Generated Code and What Are the Security Risks?

Veracode analysed 100+ LLMs across 4 programming languages and found AI-generated code introduced risky security flaws in 45% of tests. Security issues were up to 2.74x higher in AI-generated code compared to human-written code.

45% of AI-generated code fails secure coding benchmarks—nearly half of all AI output introduces potential attack vectors.

The most prominent security pattern involved improper password handling and insecure object references. Common vulnerability types include SQL injection (CWE-89), weak cryptography (CWE-327), cross-site scripting (CWE-80), and log injection (CWE-117)—patterns where AI defaults to insecure but functional implementations.

No vulnerability type was unique to AI, but nearly all were amplified.

Stack Overflow’s case study on a vibe-coded app found it was ripe for hacking—there were no security features present to stop someone from accessing any of the data it was storing. Because vibe coding tools promise powerful results without the need for developer experience, there are probably a lot of people without experience who will use something like Bolt to create passion projects that may ask for information like ZIP code, email address, date of birth, or passwords without proper security.

Bigger models did not equal more secure code. Larger, newer AI models didn’t improve security. No major language was immune.

Security patterns degrade without explicit prompts. Unless guarded, models recreate legacy patterns or outdated practices found in older training data. AI lacks security context—it generates code that works but does not understand threat models, attack surfaces, or the security implications of architectural choices.

Security debt compounds technical debt. Unresolved vulnerabilities persist and multiply, creating compliance exposure for regulated industries and increasing incident risk.

This provides a brief overview. A comprehensive security deep-dive with specific vulnerability analysis is available in our dedicated security article. For mitigation strategies and quality gates for AI code, see our responsible AI framework.

What Does the Thoughtworks Technology Radar Say About AI Coding?

These security and quality concerns haven’t gone unnoticed by industry authorities, as we explore throughout our complete guide for engineering leaders.

Thoughtworks Technology Radar placed “AI coding complacency” on “Hold” status—their strongest cautionary rating. The “Hold” status means the risks currently outweigh the benefits for most organisations and that uncritical adoption is dangerous.

Thoughtworks specifically warns against the complacency pattern—teams adopting AI coding tools without quality gates, review standards, or comprehension requirements. The rise of coding agents further amplifies these risks, since AI now generates larger change sets that are harder to review.

GitClear’s research found that duplicate code and code churn have risen more than expected, while refactoring activity in commit histories has dropped. Microsoft research on knowledge workers shows that AI-driven confidence often comes at the expense of thinking—a pattern observed as complacency sets in with prolonged use of coding assistants.

The emergence of “vibe coding”—where developers let AI generate code with minimal review—illustrates the growing trust of AI-generated outputs. Thoughtworks strongly cautions against using vibe coding for production code, though this approach can be appropriate for things like prototypes or other types of throw-away code.

The Thoughtworks assessment validates sceptical engineering leaders. It’s responsible risk management backed by an industry-respected authority.

This assessment aligns with independent findings from Kent Beck, Addy Osmani, and Chris Lattner—multiple authoritative voices converging on the same conclusion.

If you’re facing pressure to adopt AI coding tools, the Thoughtworks “Hold” provides an evidence-based rationale for cautious, measured adoption rather than wholesale embrace.

As with any system, speeding up one part of the workflow increases pressure on the others. Studies show that code quality can decline over time with prolonged AI assistant use. Thoughtworks teams are finding that using AI effectively in production requires renewed focus on code quality.

Thoughtworks recommends reinforcing established practices such as TDD and static analysis, and embedding them directly into coding workflows. For precise definitions of vibe coding versus augmented coding referenced by Thoughtworks, see our terminology guide.

What Is the Case for Cautious Adoption as a Stewardship Duty?

The evidence from GitClear, CodeRabbit, Veracode, and Thoughtworks converges on one conclusion: unrestricted AI code generation creates measurable, compounding costs that threaten long-term system health.

There’s ample evidence these tools can accelerate development—especially for prototyping and greenfield projects—but studies show that code quality can decline over time. Data from Faros AI and Google’s DORA report show teams with high AI adoption merged 98% more PRs but saw review times balloon 91%.

PR size increased 154% on average. Code review became the new bottleneck.

Atlassian’s 2025 survey found the paradox in stark terms: 99% of AI-using developers reported saving 10+ hours per week, yet most reported no decrease in overall workload. The time saved writing code was consumed by organisational friction—more context switching, more coordination overhead, managing the higher volume of changes.

When you make a resource cheaper—code generation—consumption increases faster than efficiency improves, and total resource use goes up.

DORA’s 2025 report crystallised the reality: AI is an amplifier of your development practices. Good processes get better (high-performing teams saw 55-70% faster delivery). Bad processes get worse (accumulating debt at unprecedented speed).

You are accountable not just for shipping features but for the long-term health of the systems your organisation depends on. That responsibility demands caution with tools that degrade quality.

Smaller organisations face disproportionate risk. Smaller teams cannot absorb a 30-41% technical debt increase. Fewer senior engineers are available for review. Recovery from accumulated debt is harder.

The choice is between vibe coding (comprehension-free) and augmented coding (disciplined, quality-focused). Responsible leaders choose the approach that preserves understanding.

Kent Beck on the future: “LLM coding will split up engineers based on those who primarily liked coding and those who primarily liked building.” If you liked the act of writing code itself—the craft of it, the meditation of it—this transition might feel like loss. If you liked building things and code was the necessary means, this feels like liberation.

But the danger isn’t that the agent fails—it’s that it succeeds in the wrong direction while you stop checking the compass.

Effective patterns include agent-first drafts with tight iteration loops, declarative communication (spend 70% of effort on problem definition, 30% on execution), automated verification, deliberate learning versus production focus, and architectural hygiene.

The developers who thrive won’t be those who generate the most code—they’ll be those who know which code to generate, when to question the output, and how to maintain comprehension even as their hands leave the keyboard.

Use AI to accelerate learning, not skip it. Focus on fundamentals that matter more than ever: robust architecture, clean code, thorough tests, thoughtful UX.

Choosing cautious adoption represents a commitment to building software that lasts.

For a complete guide to navigating vibe coding as an engineering leader, see our vibe coding strategic overview for engineering leaders. For risk assessment and vulnerability analysis, read our security deep-dive on AI-generated code vulnerabilities. For implementation guidance and quality gates, explore our responsible AI development framework.

FAQ Section

Is vibe coding always bad or are there legitimate use cases?

Vibe coding can work for throwaway prototypes, hackathon projects, and personal experiments where long-term maintenance isn’t a concern. The risks emerge when vibe-coded output enters production codebases that teams need to maintain, debug, and extend over months and years. The key distinction is whether the code carries ongoing maintenance responsibility.

It works in personal projects where you control everything, MVPs where “good enough” is actually good enough, startups in greenfield territory without legacy constraints, and teams small enough that comprehension debt stays manageable.

What is the difference between vibe coding and augmented coding?

Vibe coding means generating code via AI prompts without deeply understanding the output—you focus on behaviour over quality. Augmented coding, defined by Kent Beck, means using AI as a tool while maintaining discipline around testing, code quality, architecture, and comprehension. The difference is whether you care about and understand the code, not whether you use AI.

How much more technical debt does AI-generated code create?

GitClear’s analysis found significant declines in refactoring activity, increases in code duplication, and nearly doubled code churn. These metrics compound to an estimated 30-41% increase in technical debt, which translates to slower development velocity, more bugs, and higher incident rates over time.

Can code review catch the quality problems in AI-generated code?

Code review helps but faces significant challenges with AI-generated code. CodeRabbit found reviewers spend 91% more time on AI pull requests, and the volume of AI-generated code creates review fatigue that leads to rubber-stamping. Effective review requires comprehension—which is precisely what vibe coding bypasses.

Only 48% of developers consistently check AI-assisted code before committing it, even though 38% find that reviewing AI-generated logic actually requires more effort than reviewing human-written code.

Why does AI-generated code have more security vulnerabilities?

AI models optimise for functional correctness, not security. They default to insecure but working implementations because their training data includes vast amounts of insecure code. Veracode found AI code contains 2.74x more vulnerabilities because models lack threat model awareness and cannot reason about attack surfaces the way security-conscious developers can.

Does the Thoughtworks “Hold” rating mean organisations should avoid AI coding tools entirely?

No. “Hold” means adopt with extreme caution, not never use. Thoughtworks specifically warns against AI coding complacency—the pattern of adopting tools without quality gates, review standards, or comprehension requirements. Organisations can use AI coding tools responsibly through augmented coding practices that maintain human oversight.

What is comprehension debt and how does it differ from technical debt?

Comprehension debt, coined by Jeremy Twei, describes the growing gap between code a team can review syntactically and code they actually understand architecturally. Unlike technical debt, which is visible in metrics like duplication and complexity, comprehension debt is hidden until an incident reveals that no one truly understands how something works.

How does the 70% Problem affect team productivity in practice?

The 70% Problem means AI-generated code appears mostly correct but the final 30% requires disproportionate effort. In practice, you might accept AI-generated authentication code that handles the happy path but spend 3 hours debugging edge cases that would have taken 1 hour to write manually. At scale, this erases the productivity gains AI tools promise.

Are smaller organisations more at risk from vibe coding than larger enterprises?

Yes. Smaller organisations face disproportionate risk because they have fewer senior engineers to review AI code, less capacity to absorb a 30-41% technical debt increase, and smaller teams where each developer’s understanding matters more. Larger enterprises can dedicate specialised teams to code quality. Smaller organisations often cannot.

What metrics should you track to detect vibe coding problems early?

Key indicators include: refactoring rate (should not decline below 15-20% of activity), code duplication trends (rising duplication signals lack of abstraction), code churn (high write-delete cycles indicate instability), cognitive complexity scores (rising complexity means harder maintenance), and PR review time (increasing review burden suggests quality degradation).

Can AI coding tools improve over time and eliminate these quality concerns?

Current limitations are architectural, not just training deficiencies. Context window constraints, lack of threat model awareness, and inability to reason about system-wide implications are fundamental to how LLMs work. While AI tools will improve incrementally, the need for human comprehension and quality oversight is unlikely to disappear. Augmented coding—human-AI collaboration with human understanding—remains the responsible approach.