Business

SaaS

Technology

•

May 29, 2026

From Vibe to Spec — Why AI Coding Is Growing Up

Q: What is the flow–debt tradeoff in vibe coding?

The arxiv paper 'Vibe Coding in Practice: Flow, Technical Debt, and Guidelines for Sustainable Use' provides the formal framing. 'Flow' is the rapid, low-friction development experience. 'Debt' is the accumulating architectural inconsistencies, security vulnerabilities, and testing gaps. The tradeoff is structural — the workflow optimises for flow by skipping the specification work that prevents debt. Short-term productivity gains can be followed by maintenance costs that erode them.

Q: Is spec-driven development related to test-driven development (TDD)?

SDD is a direct descendant. InfoQ frames SDD as a 'fifth-generation programming shift' with TDD, BDD, and MDD as predecessors. The key difference is that SDD is designed for AI agents as implementers — it uses natural language specifications that LLMs can interpret rather than executable tests in code. SDD's spec layer is upstream of TDD's test layer. They're complementary, not competing.

Q: What is a 'constitution' in spec-driven development?

The arxiv codified context paper describes a 'hot-memory constitution' as the top-level specification set — always-loaded conventions committed to the repository. Typically Mission, Tech Stack, and Roadmap documents. Any agent starting a new session reads the constitution first. The architectural context lives in the repository rather than in the AI's context window, accessible to any agent in any session.

AI-assisted development has a production problem. Three failure modes keep showing up across independent data sources: context decay, hallucinated architecture, and quality debt accumulation. These aren’t edge cases or operator errors. They’re what you get when you run a workflow at production scale that was never designed to run there.

In April 2026 an Amazon AI coding incident made this concrete at enterprise scale — a 13-hour service disruption, officially classified as user error, but read by the engineering community as something more structural. That incident has accelerated a shift that was already underway: from vibe coding to spec-driven development.

This is a diagnostic piece, not a tutorial. The broader strategic context is in the pillar. What follows is an evidence-based walkthrough of why the shift is happening, what it means technically, and whether it’s a genuine change or just a rebranding of ideas we’ve been cycling through since TDD.

What is vibe coding and why did engineers embrace it?

Andrej Karpathy coined “vibe coding” in February 2025 to describe a specific way of working: write natural language descriptions of what you want, let the AI generate the implementation, accept the output if the application runs, and move on. His original description was explicit about the posture — “I ‘Accept All’ always, I don’t read the diffs anymore.” He also framed the scope honestly: suited for throwaway weekend projects, not production systems.

The productivity case was real. Nearly 44% of developers had adopted AI coding tools by 2023, with some projects reporting gains of up to 55%. Platforms like Cursor, GitHub Copilot, Replit, Lovable, and Bolt proliferated faster than anyone expected. The barrier to prototype dropped fast, and people who’d never thought of themselves as developers started shipping real products.

Vibe coding works in contexts where rapid exploration matters more than long-term maintainability: scripts, greenfield experiments, single-session throwaway builds. The problem is assuming a workflow optimised for small-scope single-session work will hold up when you scale it. At a certain threshold of session length, codebase size, and production consequence, the structural assumptions collapse — and that’s what the next section is about.

For spec-driven development and what it means for engineering leaders in production contexts, the pillar article covers the broader picture.

Where does vibe coding break down in production?

The three failure modes aren’t random. They’re built into the workflow’s architecture.

Context decay happens because every AI session starts with an empty context window. Architectural decisions made in session one are invisible in session ten. The codebase accumulates inconsistencies the AI cannot see.

Hallucinated architecture is the patchwork problem: the AI generates structural decisions — module boundaries, data models, inter-service dependencies — that were never requested and often don’t fit together. Steve Tarcza, director of Amazon Stores, saw this directly: AI was “doing work you didn’t ask it to, going further than you wanted.”

Quality debt accumulation is the formal name for this. The arxiv flow–debt tradeoff paper documents the mechanism: vibe coding optimises for rapid flow at the cost of accumulating debt — security vulnerabilities, missing documentation, architectural fragility. Apiiro found a 322% spike in privilege escalation flaws in high AI code contribution repositories. More than 170 applications built on the Lovable platform had misconfigurations exposing sensitive data.

The failure modes compound. Context decay causes hallucinated architecture, which accelerates quality debt. By the time the problem is visible, remediation is expensive.

What is context decay and why does it matter?

Context decay is what happens when an AI agent’s working memory resets between sessions, discarding all the architectural decisions that were accumulated. Every time you start a new session, the AI has no memory of what was decided in the last one. Over a single session, that’s fine. Over a multi-session production project, the AI in session ten has no idea why module A was designed the way it was in session two.

LLM-based agentic coding assistants lack persistent memory: they lose coherence across sessions, forget project conventions, and repeat known mistakes. This is a constraint on how LLMs work, not a model quality problem.

The result is a codebase that’s internally inconsistent in ways that are hard to see from any one vantage point. The InnoGames engineering blog describes what it looks like in practice: the model “starts hallucinating issues that don’t exist, initiating unrequested refactorings, or losing track of what it was actually asked to do. This is not a theoretical concern — it happens reliably and predictably once the context gets crowded enough.”

Debugging becomes archaeology. Refactoring becomes guesswork.

SDD’s response is structural: persistent specification documents committed to the repository serve as the AI’s external memory. Any agent reads them at session start. A study of 283 development sessions building a 108,000-line distributed system found that agents with a “hot-memory constitution” — always-loaded specification documents — repeated far fewer known mistakes and maintained coherence across sessions.

That context decay problem is also what makes the Amazon incident worth understanding.

What did the Amazon incident reveal about agentic coding at scale?

In April 2026, an AI coding incident at Amazon produced a 13-hour service disruption. Amazon officially classified it as user error — a classification that matters and shouldn’t be dismissed.

Aragon Research’s framing adds the structural dimension: “The decision by an AI to delete a production environment reflects a lack of contextual awareness — the agent understands the technical goal but lacks the business judgement to weigh the cost of a thirteen-hour outage against the benefits of a fresh environment.” The agent had the capability. It lacked the contextual constraints that human engineers carry implicitly. Both frames can be true simultaneously.

The permission model is what made it possible. When an AI agent inherits senior engineer permissions, the speed of execution removes the human buffer that typically stops destructive commands from being finalised. Amazon’s response was governance escalation alongside continued adoption: Steve Tarcza stated “nothing ships without a human checking it first” — while Amazon simultaneously mandated 80% AI tool adoption.

The full structural analysis is in Amazon’s documented AI coding incident.

What is spec-driven development and how does it respond to these failures?

Spec-driven development (SDD) is a software development approach where structured, human-reviewed specifications are written and committed to the repository before any code is generated. The spec is the persistent source of truth across sessions and agents.

Prezi engineers describe the shift directly: “With spec-driven development, your code merely becomes the output of your work. It’s like the rendered MP4 file in a video project. You want a change? You edit the project, not the pixels in the rendered video.”

The core architecture is a “constitution” — typically Mission, Tech Stack, and Roadmap documents committed to the repository. Any agent starting a new session reads the constitution first. That’s the mechanism that addresses context decay: external memory, version-controlled, accessible in any session.

Against hallucinated architecture, SDD constrains the solution space before generation begins. Against quality debt, conformance tests paired with the spec mean the AI can’t silently expand scope.

The honest position: SDD reduces hallucination, it doesn’t eliminate it. Tarcza was direct: does spec-driven development solve hallucination and prompt injection? “No,” says Tarcza. “It reduces it at best. And even then, there are cases where it still does go beyond the specification.”

Karpathy’s proposed successor term — “agentic engineering” — describes exactly this model: developers orchestrate AI agents against specifications with human oversight, rather than prompting and accepting. The coiner of “vibe coding” has effectively moved past the pure form.

Commercial implementations include AWS Kiro, Amazon’s spec-first agentic IDE, GitHub Spec Kit, and Intent from Augment Code. The full 30-plus framework landscape is covered in its own article.

Is this a genuine paradigm shift or just a rebranding of existing practices?

The sceptical position is worth taking seriously. TDD was supposed to solve the spec-before-code problem around 2000. BDD extended it in 2006. MDD tried to make models primary through the 1990s. Each generation stalled. To a sceptical reader, SDD looks like the same idea with new branding.

RedMonk analyst Kate Holterhoff’s “Adventures in Vibe Coding” contextualises this as a maturation phase rather than a revolution. That’s the useful framing — pattern recognition from people who’ve watched several of these cycles.

The load-bearing argument for why this cycle is different sits with MDD’s failure. Model-driven development collapsed because spec languages were too rigid and code generators couldn’t handle complexity. LLMs change both sides. As Thoughtworks observes: “Given LLMs’ ability to manipulate text, it’s unsurprising that specs may play so nicely with the growth of AI in software.” The constraint that defeated previous generations has been removed.

Prezi engineers describe SDD as the synthesis that LLM availability finally makes viable. The full treatment is in A Sufficiently Detailed Spec Is Code — the community principle behind the paradigm shift.

The maturation signal that distinguishes this cycle is not coming from a vendor keynote. Amazon Stores mandating human review of all AI output, post-incident, is a different kind of signal than a framework announcement. The evidence is still early, but the source matters.

Where does this leave engineering teams right now?

Vibe coding isn’t over. The arxiv practitioner guidelines are specific about its scope: small, self-contained, low-consequence tasks where rapid exploration matters more than long-term maintainability. The obituaries are premature. The scope boundary is what matters.

The decision you face right now is not whether to adopt SDD. It’s to identify which of your current AI coding workflows are already producing the conditions for context decay, hallucinated architecture, or quality debt — and to address those first.

The numbers make the gap concrete. Tarcza’s engineers spend less than 30% of their time on core engineering — the rest goes to oversight and governance. The Sonar 2026 State of Code survey found that 96% of developers don’t fully trust AI-generated code, yet only 48% always check it before committing. That gap is where quality debt accumulates.

Your practical starting point: prioritise specs where the cost of context decay is highest — long-running projects, multi-session workflows, production systems with security or data consequences. Add governance gates proportional to blast radius. Treat tooling as secondary to process.

The direction of travel, in Karpathy’s framing, is agentic engineering: you as the orchestrator of AI agents working against specifications, with human oversight at the gates that matter.

This article’s scope ends at the diagnostic. The governance and implementation action items are in Spec-Driven Development in Regulated Industries — governance, compliance, and audit trails. The tooling evaluation is in the 30-plus framework landscape. The question of what comes next is answered across the cluster.

FAQ

What is the difference between vibe coding and agentic engineering?

Vibe coding prioritises speed over structure: prompt, accept the output, iterate without reading every line. Agentic engineering keeps AI-driven implementation but adds orchestration — you manage agents against specifications with human oversight. The key distinction is governance. “Vibe coding is dead” is a less accurate read than “vibe coding has a defined scope.”

Why is context decay specifically a production problem rather than a development problem?

In development, context decay produces inconsistency that slows future work but doesn’t immediately cause failures. In production, the decay accumulates architectural decisions that no single review can fully audit. The AI in session ten genuinely cannot know why the auth module was designed in session two. Human reviewers face the same opacity when auditing a vibe-coded codebase under incident conditions.

Does spec-driven development actually fix hallucinations, or just reduce them?

It reduces them. Steve Tarcza at Amazon Stores is explicit that SDD reduces hallucination at best — there are still cases where the AI goes beyond the specification. A spec narrows the solution space before generation, but it can’t prevent all hallucinated decisions. SDD makes hallucinations more detectable and bounded — that’s the honest expectation to set.

What is the flow–debt tradeoff in vibe coding?

The arxiv paper “Vibe Coding in Practice: Flow, Technical Debt, and Guidelines for Sustainable Use” provides the formal framing. “Flow” is the rapid, low-friction development experience. “Debt” is the accumulating architectural inconsistencies, security vulnerabilities, and testing gaps. The tradeoff is structural — the workflow optimises for flow by skipping the specification work that prevents debt. Short-term productivity gains can be followed by maintenance costs that erode them.

Was the April 2026 AWS incident really caused by vibe coding?

Amazon officially classified it as user error, and that classification shouldn’t be dismissed. What the engineering community observed was a governance and permission model failure — an AI agent operating with broad authorisation made decisions that required business judgement it didn’t have. Both frames can be true simultaneously: user error in the permissioning, and a structural condition that made the consequences possible at that scale.

When is vibe coding still appropriate and when does SDD make more sense?

Vibe coding works for small scope, self-contained tasks, throwaway scripts, and single-session rapid prototyping with no production consequence. SDD is the right call when projects span multiple sessions, involve production systems, carry security or data consequences, or need to be maintained by others. The heuristic: if context decay would produce a codebase you can’t safely audit, vibe coding is not the right workflow.

SDD is a direct descendant. InfoQ frames SDD as a “fifth-generation programming shift” with TDD, BDD, and MDD as predecessors. The key difference is that SDD is designed for AI agents as implementers — it uses natural language specifications that LLMs can interpret rather than executable tests in code. SDD’s spec layer is upstream of TDD’s test layer. They’re complementary, not competing.

What is a “constitution” in spec-driven development?

The arxiv codified context paper describes a “hot-memory constitution” as the top-level specification set — always-loaded conventions committed to the repository. Typically Mission, Tech Stack, and Roadmap documents. Any agent starting a new session reads the constitution first. The architectural context lives in the repository rather than in the AI’s context window, accessible to any agent in any session.

What did Prezi engineers mean by calling SDD the “culmination of TDD, BDD, and MDD”?

Prezi engineers described SDD as synthesising the core insight from each predecessor — TDD’s executable specs, BDD’s natural language bridge, MDD’s model-as-primary-artifact. The synthesis is viable now because LLMs can interpret natural language specifications, removing the constraint that defeated MDD’s rigid spec languages. For engineers who know TDD and BDD, SDD is less a departure than a completion.

Are there companies using spec-driven development at scale in production right now?

Yes. Amazon Stores (StoreGen), led by Steve Tarcza, is the most detailed public case study — mandatory human-approval gates on all mutating AI actions, structured specifications before any AI-generated code goes to production. The Amazon Stores case emerged from governance necessity post-incident, not from framework advocacy. JetBrains and DeepLearning.AI jointly offer a “Spec-Driven Development with Coding Agents” course — practitioner training at this scale suggests adoption beyond early experimentation.

Does RedMonk think spec-driven development is just hype?

RedMonk analyst Rachel Stephens’ “Spec vs Vibes” contextualises SDD as the organised engineering response to the vibe coding failure pattern, consistent with how mature engineering practices have emerged historically. Kate Holterhoff’s “Adventures in Vibe Coding” frames it as a maturation phase, not a revolution. RedMonk’s value is vendor-neutrality and a track record of distinguishing genuine structural shifts from marketing cycles.

From Vibe to Spec — Why AI Coding Is Growing Up

What is vibe coding and why did engineers embrace it?

Where does vibe coding break down in production?

What is context decay and why does it matter?

What did the Amazon incident reveal about agentic coding at scale?

What is spec-driven development and how does it respond to these failures?

Is this a genuine paradigm shift or just a rebranding of existing practices?

Where does this leave engineering teams right now?

FAQ

What is the difference between vibe coding and agentic engineering?

Why is context decay specifically a production problem rather than a development problem?

Does spec-driven development actually fix hallucinations, or just reduce them?

What is the flow–debt tradeoff in vibe coding?

Was the April 2026 AWS incident really caused by vibe coding?

When is vibe coding still appropriate and when does SDD make more sense?

What is a “constitution” in spec-driven development?

What did Prezi engineers mean by calling SDD the “culmination of TDD, BDD, and MDD”?

Are there companies using spec-driven development at scale in production right now?

Does RedMonk think spec-driven development is just hype?

Related Articles

Spec Driven Development Looks Like Programming If You Do It Right

The 5 Most Important Metrics CTOs Should Track For Development Success

How thinking like Frankenstein will help your MVP

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG