James A. Wondrasek, Author at SoftwareSeni

Spec-Driven Development and the End of Vibe Coding — What Engineering Leaders Need to Know

In April 2026, a production disruption at Amazon — linked by analysts to an agentic coding session that misconfigured access controls — pushed AI coding governance onto engineering leadership’s agenda. Agentic AI coding tools had become powerful enough to do real damage without formal constraints. The response that has gained the most traction is spec-driven development (SDD): a methodology where structured specification documents, written before any code is generated, serve as the binding contract for AI agents.

This page answers the broad questions in plain terms and points you to the dedicated articles for the depth you need.

In this series:

What is spec-driven development and why is it gaining ground now?

Spec-driven development is a methodology where structured specification documents — written before any code — serve as the source of truth for AI agents that then generate, validate, and iterate on the implementation. Unlike vibe coding, where prompts generate code with minimal formal constraints, SDD enforces scope boundaries, architectural decisions, and verification criteria from the outset. It gained commercial traction in 2025–2026 as autonomous AI agents became powerful enough that undirected prompting started producing costly failures in production.

Its intellectual lineage runs through formal methods (Hoare, Meyer) and industrial practitioners, but the urgency is new — Thoughtworks placed it in the Assess ring of their 2025 Tech Radar as a genuine maturation phase, not a rebranding. The methodology sits in direct lineage with TDD and BDD: specs govern AI agents the way tests govern interfaces.

For the full diagnostic case — why the shift is happening now and what the documented failure modes look like — see the vibe coding failure-mode analysis.

What is the difference between vibe coding and spec-driven development?

Vibe coding — coined by Andrej Karpathy in February 2025 — describes the practice of using natural language prompts to generate complete application code with minimal structured constraints or review. SDD is the counter-pattern: it front-loads the definition of outcomes, scope boundaries, and verification criteria in a formal specification before any code is generated. The spec acts as a persistent contract the agent must satisfy; vibe coding provides no such contract.

Karpathy was candid about what vibe coding was designed for: “I ‘Accept All’ always, I don’t read the diffs anymore” — and he flagged it explicitly as “not too bad for throwaway weekend projects.” The problems emerged when teams applied the same approach to production systems. Hallucinated APIs, mixed library versions, and unintended side effects followed. The distinction is not about tools — you can run a vibe coding session and an SDD session in the same IDE. It is whether a formalised spec governs the agent’s work.

The full failure-mode breakdown covers why production deployments diverge from prompt intent.

What happened with Amazon’s AI coding tools in April 2026?

In April 2026, a 13-hour production disruption at Amazon was linked — by The Register (29 April) and Aragon Research — to an agentic coding session that misconfigured access controls. Amazon officially classified the event as “user error” and denied direct involvement by its Kiro IDE, but the incident prompted an internal mandate: all AI-generated code must be reviewed by an engineer before it is accepted. Amazon’s official position and the analysts’ framing remain in tension.

Aragon Research’s assessment was direct: “The primary driver behind these incidents was the deployment of agentic AI tools… granted broad permissions that allowed autonomous actions to bypass traditional human-in-the-loop safeguards.” The outcome was Amazon’s mandatory review policy: “Nothing ships without someone looking at it and validating it. Spec-driven development helps reduce how much time that takes” (Steve Tarcza, Amazon Stores).

The full timeline, with sourcing, is in Amazon’s Internal Probe — What AI Coding Outages Reveal About Production Risk.

What is AWS Kiro and how does it implement spec-driven development?

AWS Kiro is Amazon’s agentic IDE — built on Code OSS, the VS Code base — that enforces a three-phase spec workflow before any code is generated: requirements.md (user stories in EARS notation), design.md (architecture decisions), and tasks.md (testable implementation units). Agent Hooks extend the IDE with event-driven automations that fire on file save, handling tasks such as test updates and security scans without manual prompting. Kiro replaced Amazon Q Developer (EOL announced April 30, 2026) as AWS’s primary AI coding product.

EARS (Easy Approach to Requirements Syntax) is the structured format Kiro uses for acceptance criteria — it produces machine-parseable requirements that cover edge cases by default. Steering Files embed compliance standards and architectural non-negotiables as persistent context the agent always references. Kiro does not require an AWS account, is built on VS Code so the environment is familiar, and has GovCloud availability for regulated verticals.

For the full three-phase workflow, Agent Hooks configuration, and a Kiro-versus-Cursor evaluation, see AWS Kiro — Amazon’s Spec-First Bet on Agentic Development.

How does GitHub SpecKit compare to Kiro?

GitHub SpecKit is an IDE-agnostic, open-source Python CLI framework with 93,000+ GitHub stars (v0.8.7, May 2026) that runs a four-phase workflow: Specify, Plan, Tasks, Implement. Its key differentiator is the “constitution” — a persistent, project-wide principles document that governs every agent session across tools, comparable to an ADR or RFC in function. Kiro mandates its three-phase workflow inside a VS Code environment; SpecKit’s governance layer works with any compatible agent, including Claude Code, Gemini CLI, and GitHub Copilot.

The choice between them often resolves on stack. Microsoft-ecosystem teams — Copilot, Azure DevOps — align naturally with SpecKit. AWS-native teams lean toward Kiro. SpecKit’s IDE-agnostic design is its portability argument; Kiro’s deeper IDE integration enables Agent Hooks that SpecKit cannot replicate natively. Both enforce spec-first discipline; neither is objectively superior — they suit different organisational contexts.

The constitution concept, SpecKit’s four-phase workflow in practice, and a full Kiro comparison are all covered in GitHub SpecKit and the Microsoft Approach to AI Coding Governance.

Which SDD framework is right for my team?

The primary evaluation axis is brownfield versus greenfield. For new projects, AWS Kiro and GitHub SpecKit are the vendor-backed starting points. For existing codebases, OpenSpec‘s delta-marker workflow (ADDED/MODIFIED/REMOVED) is designed specifically for change-scoped specs. BMAD-METHOD (46,700+ GitHub stars) suits complex multi-agent orchestration; GSD (61,000+ GitHub stars) is a leaner alternative for Claude Code users who want meta-prompting without ceremony. Cursor Plan Mode is a low-friction entry point for teams not yet ready for a full framework.

Three tiers organise the landscape: vendor-backed (Kiro, SpecKit), community-led (BMAD, GSD, Cursor Plan Mode), and niche-optimised (OpenSpec for brownfield, Tessl for API hallucination prevention). Per-feature cost signals from RanTheBuilder (February 2026): BMAD Full at ~200, OpenSpecat 95, SpecKit at ~$75. Most teams need to assess only the 2–3 frameworks that match their codebase type and existing toolchain.

The three-tier comparison with per-feature cost data and the brownfield/greenfield decision guide are in The 30-Plus Framework Landscape — Navigating Spec-Driven Development Options in 2026.

What is the “A Sufficiently Detailed Spec Is Code” principle?

The principle — articulated most clearly by Prezi engineers and the specdriven.com community — holds that when a specification is detailed enough to constrain an AI agent’s output completely, it is functionally equivalent to code. Code becomes a generated artifact; the spec is the deliverable. This is the spec-as-source end of Martin Fowler’s three-level taxonomy (spec-first, spec-anchored, spec-as-source) and represents the philosophical north star of the SDD movement.

Spec Drift — the divergence between a spec and the actual codebase over time — is the failure mode the principle is designed to prevent. Living specs, which auto-update as agents complete work, are the practical response. See A Sufficiently Detailed Spec Is Code — The Community Principle Behind Spec-Driven Development for the TDD/BDD/MDD lineage and an honest account of where the current frontier sits.

Is spec-driven development just waterfall with a new name?

No — but the concern is legitimate and worth addressing directly. Waterfall front-loads all specification work before any execution begins and treats the spec as a fixed contract. SDD treats the spec as a living document that evolves with the project; implementation begins incrementally from task-level specs, not a completed requirements freeze. The key structural difference: SDD specs govern AI agents continuously, not human developers once at the start of a project.

The “big upfront specification” critique applies to the spec-as-source end of the spectrum — the most radical position. Most practitioners operate at spec-first or spec-anchored, where the process is iterative within a feature or change scope. The TDD parallel is useful: tests drive development iteratively; specs do the same at the architecture and scope level.

A Sufficiently Detailed Spec Is Code addresses the lineage and the antipattern critique in full.

How does spec-driven development satisfy EU AI Act requirements?

The EU AI Act‘s full enforcement deadline is August 2, 2026. For organisations deploying high-risk AI systems — which includes many AI-assisted development workflows in FinTech, HealthTech, and government — Articles 9–17 (provider obligations), Article 26 (deployer obligations), and Article 50 (AI content disclosure) create documentation and traceability requirements. Spec-driven workflows produce the compliance artifacts these articles require: a structured audit trail, human-oversight records at each review checkpoint, and AI authorship attribution in version control.

The AugmentCode compliance evaluation (May 2026) ranks tools by EU AI Act posture: Intent (Augment Code) and Claude Code at Tier 1; Kiro at Tier 2 (partial); Cursor at Tier 3. ISO/IEC 42001 and SOC 2 Type II certification are the credibility signals to look for when evaluating tools for regulated environments. The April 2026 AWS incident has elevated this from a compliance checkbox to a board-level accountability question.

EU AI Act article citations, the full compliance matrix, and an August 2026 action checklist are in Spec-Driven Development in Regulated Industries — Governance, Compliance, and Audit Trails.

What should you do before deploying AI coding agents to production?

At minimum: establish a human-in-the-loop review policy for all AI-generated code before it is merged or deployed. Amazon’s internal mandate after the April 2026 incident is the operational baseline — “nothing ships without someone looking at it.” Beyond that, introduce a specification layer (even a lightweight CLAUDE.md or project rules file is a starting point) and select a framework matched to your codebase type and team size before scaling agentic workflows.

Human-in-the-loop (HITL) governance is not just best practice — EU AI Act Article 14 mandates it for high-risk AI systems. Start there regardless of which framework you adopt. A full framework adoption — Kiro, SpecKit, BMAD — is the next step once your team has validated the spec-review loop on a contained project.

For tool evaluation, see The 30-Plus Framework Landscape. For governance and compliance specifics, see Spec-Driven Development in Regulated Industries. For the incident that drove Amazon’s policy, see Amazon’s Internal Probe.

Spec-Driven Development — Article Series

Understanding the Shift (Start Here)

From Vibe to Spec — Why AI Coding Is Growing Up: The diagnostic case — vibe coding’s documented failure modes, context decay, and the April 2026 incident that crystallised the shift.
A Sufficiently Detailed Spec Is Code — The Community Principle: The intellectual foundation — TDD/BDD/MDD lineage, context engineering, and the paradigm-shift argument.
Amazon’s Internal Probe — What AI Coding Outages Reveal About Production Risk: The incident in full — timeline, attribution, Amazon’s mandatory review policy, and the accountability question.

Evaluating Tools

AWS Kiro — Amazon’s Spec-First Bet on Agentic Development: Three-phase workflow (requirements.md → design.md → tasks.md), Agent Hooks, Steering Files, and a Kiro-versus-Cursor evaluation.
GitHub SpecKit and the Microsoft Approach to AI Coding Governance: The constitution concept, four-phase workflow, and IDE-agnostic portability versus Kiro’s VS Code environment.
The 30-Plus Framework Landscape — Navigating SDD Options in 2026: BMAD, GSD, OpenSpec, Cursor Plan Mode, Tessl — three-tier comparison with per-feature cost signals.

Governance and Compliance

Spec-Driven Development in Regulated Industries — Governance, Compliance, and Audit Trails: EU AI Act Articles 9–17, 26, and 50; the August 2026 enforcement deadline; compliance matrix; and the CTO liability angle.

FAQ

What is an “agentic IDE”?

An agentic IDE is a development environment where the AI model operates as an autonomous agent — planning, implementing, and iterating across multi-step tasks with minimal mid-task prompting, rather than responding to individual queries. AWS Kiro is the highest-profile current example. For a full evaluation, see Kiro’s three-phase spec workflow and Agent Hooks.

How does SDD differ from TDD (test-driven development)?

TDD uses unit tests to drive interface design at the code level. SDD operates at a higher architectural layer — the spec defines outcomes, scope, and constraints before any tests or code are written. The two are complementary: an SDD workflow typically produces tests as part of the task list, which are then driven by TDD at implementation. A Sufficiently Detailed Spec Is Code traces the full TDD/BDD/MDD lineage.

What is context engineering and how does it relate to SDD?

Context engineering is the discipline of curating which information AI agents receive — providing precise, task-relevant context rather than exposing them to a full repository. Thoughtworks identifies it as the operational complement to SDD: the specification defines what the agent must achieve; context engineering defines what the agent is allowed to see. It is distinct from prompt engineering, which optimises human-to-LLM interaction.

Is spec-driven development worth the overhead for small teams?

For teams below about five engineers on a greenfield project with a contained scope, a lightweight approach — a project constitution file (CLAUDE.md or equivalent) plus a structured task list — captures most of the benefit without the overhead of a full framework. The overhead of a framework like BMAD or Kiro pays off when multiple agents run in parallel, when the codebase is large, or when compliance requirements mandate an audit trail.

What are BMAD, OpenSpec, and GSD?

BMAD-METHOD (Build More Architect Dreams) is an open-source multi-agent orchestration framework with 12+ specialised agent roles and 46,700+ GitHub stars. OpenSpec is a proposal-centred workflow designed for brownfield codebases, using delta markers (ADDED/MODIFIED/REMOVED) to scope specs to the change rather than the full system. GSD (Get Shit Done) is a lean, low-ceremony meta-prompting framework built primarily for Claude Code. All three are compared in The 30-Plus Framework Landscape.

Where can I find the AWS Kiro documentation?

Kiro’s official documentation and download are at kiro.dev. For an independent technical walkthrough of the three-phase spec workflow, Agent Hooks, and Steering Files — without the marketing framing — see the independent Kiro technical review.

Whether your starting point is a lightweight project constitution or a full framework adoption, the spec-review loop is the minimum viable governance layer for any team running AI agents in production.

A Sufficiently Detailed Spec Is Code — The Community Principle Behind Spec-Driven Development

There is a phrase doing the rounds in developer communities that started life in Haskell type-theory and has ended up in everyday AI-assisted development conversations: “a sufficiently detailed spec is code.” It came from Gabriella Gonzalez’s March 2026 post on Haskell for All, where she used it as a critique — when a spec is precise enough to determine a unique implementation, it becomes indistinguishable from thinly-veiled code. The SDD community heard the same argument and drew the opposite conclusion: if a spec and its implementation are informationally equivalent, and AI agents are the execution layer, then the spec is the primary artefact.

If you know TDD and BDD but haven’t seen the formal argument behind spec-driven development laid out, this is that piece. It covers the intellectual lineage, names the key operational concepts — Context Engineering, Spec Drift, Living Spec — and closes with Malte Ubl’s “free as in puppies” framing that keeps the whole thing honest. For the full SDD picture, start with our guide to what spec-driven development is and what it means for engineering practice.

What does “a sufficiently detailed spec is code” actually mean?

When a specification is precise enough that only one valid implementation exists, the spec and the code are functionally equivalent — the gap between intent and execution closes. Gonzalez’s original argument from Haskell type theory is that in a sufficiently constrained formal system, the correspondence between a spec and a unique implementation is not an aspiration; it is a theorem.

For AI-assisted development, the same equivalence holds — not as a formal proof, but as an observable engineering property. A well-detailed spec eliminates the guesswork that produces unreliable AI-generated code. The agent does not fill gaps with judgement; it fills them with whatever it can infer, which is not the same thing.

Hillel Wayne has a useful counter-argument worth sitting with: “a specification corresponds to a set of possible implementations, and code is a single implementation in that set.” This is not a refutation — it is a precision requirement. It defines what “sufficiently detailed” actually means: unambiguous data structures, explicit error behaviours, defined state transitions, testable acceptance criteria. Working through that list is where the real effort of SDD lives.

The distinction worth drawing: “spec as documentation” records what was built; “spec as code” determines what must be built. Where it sits in time relative to implementation is what separates the two.

Where did this idea come from? Tracing the lineage from TDD to SDD

Spec-driven development is the fourth step in a lineage that began with TDD in the late 1990s. Each step moved the source of truth further from code and closer to intent.

TDD (Kent Beck, ~2000): write a failing test before writing code. Tests become the specification proxy — they describe what the code should do before the code exists.

BDD (Dan North, mid-2000s): extends TDD by expressing tests in natural language — Gherkin, Given/When/Then — so that specifications are readable by non-technical stakeholders while staying machine-executable.

MDD (OMG/UML era): the specification becomes a formal model — a UML diagram, a DSL. Code is generated from the model. An artefact other than code becomes the source of truth. MDD largely failed because the spec languages were too rigid and the generators could not handle real-world complexity. But it proved the concept.

SDD (2025–2026): the AI agent replaces the code generator; the specification replaces the model. Agents execute directly from the spec, completing the transfer of authority from code to intent.

The Prezi engineering team named this lineage well. Their post on trying spec-driven development describes how working with SDD made them understand “even aspects that made TDD, BDD and why not even MDD interesting and popular at least for a while.” Each methodology moved the source of truth one step further toward intent, and SDD is where that progression terminates. The ArXiv paper on SDD frames it as a “spec spectrum” — from Spec-First through Spec-Anchored to Spec-as-Source, where code is entirely generated from specifications.

What is Context Engineering and how does it operationalise the spec-as-code principle?

Context Engineering is the discipline of designing the full information environment — system prompts, memory, retrieved context, tools — that an AI agent operates within. It is the successor to prompt engineering, which was about crafting single-turn instructions.

Tobi Lutke (Shopify CEO) put it directly: “I really like the term ‘context engineering’ over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.” Short version: prompt engineering is what you do inside the context window; context engineering is how you decide what fills it.

Writing a sufficiently detailed spec is doing context engineering. A vague spec does not produce agent failure — it produces context ambiguity, which produces inconsistent behaviour. Lutke’s framing of AI agents as new-hire contractors is apt: contractors need a full brief to function without constant supervision. The spec is that brief. CLAUDE.md files, Project Rules, and steering files are all lightweight versions of the same principle.

What is Spec Drift and why does it matter for engineering teams?

Spec Drift is the progressive divergence between a specification and the codebase it originally described, caused by undocumented implementation decisions accumulating over time. It is the operational failure mode that makes the spec-as-code principle collapse in practice. The academic literature calls the same thing “specification rot” — same phenomenon, different discourse community.

The mechanism is simple: every implementation decision that deviates from the spec without a corresponding spec update creates a gap. Over time, the spec describes a system that no longer exists, and agents working from it reproduce stale intent. AI-assisted development makes this worse — agents generate code faster than spec updates can follow. The classic failure: three months later, the team has adopted Vitest, two packages have been restructured, one library deprecated entirely, and the CLAUDE.md still says “we use Jest.”

JGCarmona’s practitioner framing captures it well: specifications act as “semantic anchors”. When the anchor drags, the agent drifts with it.

The Living Spec is the response — a specification treated as a live document, updated whenever implementation decisions diverge. Conformance tests are the detection mechanism. Without drift detection, SDD collapses back into documentation-driven development. With it, the system becomes self-policing.

For teams operating in regulated environments, the stakes of Spec Drift are considerably higher. The article on why regulated industries find spec-as-code compelling as a compliance strategy covers that case in detail.

What does the frontier evidence tell us about what spec-driven development can and cannot do?

The clearest current demonstration of SDD in a production-grade open-source codebase is whenwords, built by Drew Breunig. He distributed a library without implementation code — just a markdown specification, approximately 750 conformance tests in YAML format, and an installation guide for agent integration. The library attracted over 1,000 GitHub stars. Community members submitted pull requests identifying inconsistencies between specifications and tests, showing that collaborative development was viable without traditional code.

That is the SDD Triangle in practice: Spec drives Implementation; Conformance Tests verify against the Spec; failed tests flag divergence; Spec or Implementation is updated accordingly; the loop runs continuously. Vercel, Anthropic, and Pydantic operate as Spec-Anchored development organisations — specs as living documents maintained throughout feature lifecycles, which ArXiv identifies as “the sweet spot for most production systems.” For the tools that implement the spec-as-code principle, the frameworks that implement the spec-as-code principle are mapped out in the framework landscape article.

The honest limit: Malte Ubl (Vercel CTO) framed it as “Software is free now. (Free as in puppies).” You can ship something fast, but now you have to take care of it. The spec must be kept current or the cost compounds. SDD is not a free lunch.

Hillel Wayne’s counter resurfaces here too: the equivalence holds only for well-bounded problems. For ill-bounded problems — highly ambiguous domains, tasks requiring significant judgement calls, greenfield architecture without precedent — the valid implementation set stays large. And even if code generation from specifications became fully automated, humans would still need to write the specifications.

How does this principle compare to RFCs, ADRs, and existing architecture documentation practices?

RFCs and ADRs share the spec-first intent — document decisions before or alongside implementation — but they were designed for human readers who will then write code. SDD specs are designed for agents that will execute code directly. That changes the required level of precision substantially.

What RFCs and ADRs do well: record architectural intent, socialise decisions, provide rationale for future maintainers. SDD inherits all of those goals. The difference is tolerance for ambiguity. RFCs tolerate it because human engineers fill gaps with judgement. SDD specs cannot — agents generate from what is present.

The InnoBlog analysis is practical: “Write ADRs. Even two or three short decision records give Claude significantly better context than none. Start with the decisions that would be most expensive to violate.” That is good advice regardless of whether you call it SDD or not.

The gap between a well-written ADR and a good SDD spec is in completeness and testability, not format or philosophy. The SDD equivalent of an ADR is a constitution.md or spec.md — stored in the repo the same way ADRs live in docs/decisions/, with the same immutability principle: supersede rather than overwrite. Teams with existing RFC/ADR cultures have a shorter distance to travel. The discipline is already present — the upgrade is adding testability and completeness.

Where does a team start if they want to apply this principle without adopting a full framework?

The lowest-friction entry point is a CLAUDE.md or equivalent steering file — a structured markdown document at the repository root that describes the codebase’s architecture, conventions, and constraints persistently. Many developers are already practising a form of SDD with Claude Code without formally labelling it as such.

Martin Fowler’s framing of what belongs in a CLAUDE.md is practical: “we use yarn, not npm”; “don’t forget to activate the virtual environment before running anything”; “when we refactor, we don’t care about backwards compatibility.” These are the conventions that would otherwise require repeated explanation.

The three-step progression: write a CLAUDE.md that describes what currently exists; extend it with what must remain true — the invariants the codebase should never violate; add conformance tests that verify those invariants. At step three, the team has adopted SDD without necessarily calling it that.

Start with data structures and API contracts — the most constrained parts of any system — before behaviour and error handling. Simple rule: if you explained it twice, write it down. Treat a spec-code divergence as a bug rather than documentation debt, and commit spec changes alongside code changes. When coordination overhead across multiple agents and subsystems becomes a bottleneck, that is when a full SDD framework earns its place. Until then, a well-maintained CLAUDE.md and growing conformance test suite will take you further than most teams expect. For the broader SDD landscape, the full spec-driven development overview maps where this fits.

Frequently Asked Questions

What is the difference between a spec and a requirement in software development?

Requirements describe what a system must do at a business or user level. A spec in the SDD sense describes how the system must behave at a technical level, with enough precision that an AI agent can implement it without ambiguity. The distinction is precision and testability, not vocabulary.

What is “specification rot” and is it the same as spec drift?

Specification rot is the academic term — appearing in ArXiv 2602.00180 — for the same phenomenon practitioners call Spec Drift: the progressive divergence between a specification and the implemented codebase. Same phenomenon, different discourse communities. Spec Drift is the preferred practitioner-facing term.

Can you use spec-driven development with any AI coding tool, or does it require specific frameworks?

The principle applies to any AI coding agent. Claude Code, for example, “handles large specification documents well within a single session, processing complete requirement sets and generating implementations in one coherent pass” without any dedicated SDD framework. A well-written CLAUDE.md or Project Rules file is sufficient to begin.

Is context engineering just a rebranding of prompt engineering?

No. Anthropic draws a clear line: “Prompt engineering refers to methods for writing and organizing LLM instructions for optimal outcomes… Context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens during LLM inference.” Prompt engineering is a single-session activity. Context engineering is permanent infrastructure, evolving over time. The scope difference is fundamental, not cosmetic.

What are conformance tests and how are they different from unit tests?

Conformance tests verify that the codebase matches the spec — they test the relationship between artefacts (spec and code) rather than the internal correctness of isolated units. Dbreunig’s whenwords project uses approximately 750 conformance tests in YAML format as the primary SDD verification mechanism. Functionally equivalent to BDD acceptance tests, but spec-specific in framing.

Does spec-driven development work for greenfield projects or only for existing codebases?

SDD is well-suited to greenfield projects where the spec can be written before any implementation exists. ArXiv identifies Spec-Anchored Development as “the sweet spot for most production systems,” whether greenfield or existing. For existing codebases, spec-writing requires reverse-engineering current behaviour — a useful exercise, but an expensive one.

Why do Prezi engineers say SDD is the “culmination” of TDD, BDD, and MDD?

Because each methodology moved the source of truth one step further from code toward intent: TDD made tests the spec proxy; BDD made tests human-readable; MDD made a formal model the source of truth; SDD makes the spec itself the execution directive. The Prezi engineering post describes this as the insight that emerges once you actually try SDD — you begin to understand what each prior methodology was attempting.

What is the “free as in puppies” metaphor and why does it matter for SDD?

Malte Ubl (Vercel CTO) used the phrase to describe AI-generated code: like a free puppy, it has a low acquisition cost but a high ongoing maintenance burden. For SDD, the spec work that makes code generation reliable creates a maintenance obligation — the spec must be kept current or the cost compounds. It is a concise framing of why SDD is not a free lunch.

How does spec-driven development relate to formal methods in computer science?

Formal methods — Z notation, TLA+, Alloy — share the spec-as-code intuition. In a fully formal system, a verified spec is the authoritative artefact. SDD is a pragmatic, lightweight version of the same idea applied to AI agent execution rather than formal verification. The specdriven.com timeline documents the lineage: formal methods “proved something important: specifications could be mathematically verified.” Gabriella Gonzalez’s original argument draws on this tradition from Haskell type theory.

How does Hillel Wayne’s counter-argument affect the case for SDD?

Hillel Wayne argues that a specification corresponds to a set of possible implementations, not a single one, which means a spec is only equivalent to code when it is complete enough to eliminate all implementation ambiguity. This is not a refutation — it is a precision requirement. It defines the threshold of “sufficiently detailed” and identifies the class of problems (well-bounded, high-constraint) where SDD works best.

What is the SDD Triangle and how does it create a self-correcting system?

The SDD Triangle (dbreunig) describes the iterative loop: the spec drives implementation; conformance tests verify the implementation against the spec; failed tests identify where spec and implementation have diverged; spec or implementation is updated accordingly; the loop runs continuously. Whenwords makes this concrete — a production-grade demonstration of the triangle as a self-correcting feedback system rather than a linear process.

Can spec-driven development co-exist with agile and iterative development?

Yes. Mark Brooker’s framing: “Spec Driven Development isn’t Waterfall. In specification driven development, the specification is the thing being iterated on, rather than the implementation.” The spec evolves alongside the codebase — what changes is the ordering discipline: spec changes precede or accompany implementation changes rather than following them. Microsoft’s developer blog documents SDD running alongside agile sprints using GitHub issues as locked specifications during delivery.

Spec-Driven Development in Regulated Industries — Governance, Compliance, and Audit Trails

AI coding tools are now table stakes in most engineering organisations — including those in finance, healthcare, and government. Everyone’s using them. The governance question — how you use them responsibly in a regulated context — is the one that still hasn’t been answered cleanly. The EU AI Act’s high-risk enforcement obligations begin August 2, 2026, and most engineering teams are not ready.

The gap between vibe coding and structured spec-driven development (SDD) is about to become a regulatory liability. SDD produces the compliance artefacts that EU AI Act Articles 11 and 12 require — not as additional overhead, but as a direct by-product of how the workflow operates. That’s the practical case for it in regulated industries.

What follows covers what the EU AI Act actually obligates your organisation to do, how SDD produces the required artefacts, how the leading AI coding tools compare on compliance, and a concrete pre-August 2026 action checklist. Spec-driven development and what it means for engineering leadership has the broader methodology context if you need it.

Why do regulated industries need more than good intentions from AI coding tools?

Finance, healthcare, and government engineering teams already operate under documentation and accountability obligations that have nothing to do with AI. Every production change must be traceable, reviewable, and attributable. AI coding tools don’t change that requirement — they make it harder to satisfy unless you have deliberate process in place.

The compliance risk has a name: AI Dark Code. In a regulated environment, AI Dark Code is a compliance breach — code with no upstream specification, no attribution, and no documented human review step.

Only 18% of surveyed organisations had approved tools for vibe coding — that informal, prompt-driven style of AI-assisted development that can’t produce the traceability artefacts regulators require. And 81% of enterprise technology leaders report production failures tied to AI-generated code. Governance frameworks remain weak across the board.

Under the EU AI Act, documentation of governance practices is the required evidence for compliance — not just awareness of the obligation. SDD converts an informal AI-assisted process into a documented, auditable one. Which is exactly what regulators will ask for.

What does the EU AI Act actually require from AI coding tool deployers?

The EU AI Act classifies AI systems by risk tier. For most engineering teams, the first question is whether your AI coding tool use triggers Annex III high-risk classification. There are three real triggers: using AI to evaluate developer productivity or rank engineers; agentic tools that autonomously deploy to financial or healthcare systems; and building software that itself qualifies as high-risk under the Act.

Article 26 is the provision that directly addresses your organisation as a deployer. You must implement human oversight, maintain usage records, ensure staff training, and assign a named person responsible for overseeing the AI system’s operation. Log retention is a minimum of six months per interaction. Article 50 requires that systems generating code or text make that generation transparent to users in the review chain.

Full enforcement begins August 2, 2026. ISO/IEC 42001 — the international AI management system standard, think ISO 27001 but for AI governance — is referenced by the EU AI Act as a recognised conformity pathway. Certification isn’t required, but it shifts the conversation in a regulatory investigation from “did you have processes?” to “were your processes sufficient?” The November 2025 Digital Omnibus proposal would delay some obligations to December 2, 2027 — but it hasn’t been enacted. Plan to the August 2026 deadline and treat any extension as a bonus.

How does spec-driven development produce the compliance artefacts regulators require?

Article 11 requires full technical documentation for high-risk AI systems, drawn up before deployment. In a spec-first workflow, the specification exists before code generation begins. It is the Article 11 pre-market artefact — not something you have to produce retrospectively.

Article 12 requires automatic logging of AI system decisions and outputs, integrated into the core design rather than bolted on afterward. In a spec-driven pipeline, every agent action is traceable back to the specification that authorised it, producing a decision trail that satisfies Article 12 without bespoke tooling.

Article 25 adds a further obligation for multi-agent pipelines: when multiple AI systems co-author code, each contribution must be attributed separately. SDD’s explicit specification layer enables per-agent attribution because each agent’s actions are scoped to specific parts of the specification.

The audit trail a regulator will examine has four components: the upstream specification; the AI system and version that acted on it; the human authorisation step; and the version control record linking all three. SDD produces all four. Vibe coding produces only the fourth — a commit exists, but there’s no upstream specification, no AI system attribution, no documented review.

AI authorship attribution makes component two auditable. The most widely adopted implementation is Co-Authored-By git attribution: a commit metadata tag that explicitly records an AI system as a co-author. Claude Code applies this natively, making the attribution visible in version control without additional tooling — and it lives in git history permanently, outside any vendor’s retention window.

AugmentCode’s living specs take this further: rather than a static requirements document, the specification is continuously updated as the codebase evolves, creating a persistent compliance record auditable at any point in the system’s lifecycle.

How do the leading AI coding tools compare on EU AI Act compliance?

No tool delivers full compliance out of the box. The choice comes down to which gaps your organisation is best equipped to fill. Here’s how the leading tools compare across the compliance dimensions that matter for regulated-industry procurement.

Intent by AugmentCode sits at the top. The platform holds ISO/IEC 42001 certification — the first AI coding assistant to receive it — and SOC 2 Type II. Intent’s coordinator-implementor-verifier workflow creates three audit boundaries, with compliance records as structural by-products rather than add-ons. Still in public beta at time of writing.

Claude Code (Anthropic) is second. Native Co-Authored-By git attribution directly addresses Article 26 traceability obligations. Enterprise admins can push managed configurations, the permission system defaults to strict read-only, and Anthropic has confirmed it will sign the EU GPAI Code of Practice. No ISO 42001 certification, but strong auditability overall.

OpenAI Codex is third. Its Compliance Logs Platform provides immutable JSONL audit events — solid Article 12 infrastructure. The gap: 30-day default log retention means meeting the six-month Article 26(6) requirement needs a continuous export pipeline into your own archive.

Kiro (AWS) and Kiro GovCloud are spec-first by design — requirements.md, design.md, and tasks.md generate Article 11 documentation as a by-product of normal development. The gap is certification: there’s no explicit EU AI Act positioning in official materials.

Cursor, Devin (Cognition), and Google Antigravity are the lowest tier. Capable tools, but none produces a persistent, compliance-grade audit record without significant wrapper infrastructure.

The AugmentCode compliance evaluation frames this as a decision tool, not a ranking. Building high-risk software? Intent. Priority is exportable audit logs today? Codex. Durable AI authorship attribution? Claude Code. For a broader look at where each tool sits in the full spec-driven development movement, the pillar article maps the landscape across all seven dimensions.

What is AI authorship attribution and why does it matter legally?

AI authorship attribution is the practice of recording, in a traceable way, which AI system generated which portion of a code artefact. In a compliance audit, every line of AI-generated code needs to be traceable back to its originating system, version, and authorising specification.

Without it, you can’t satisfy Article 26 deployer obligations requiring documented human oversight. That gap can constitute a compliance breach independent of whether the underlying code caused harm.

Co-Authored-By git attribution is the practical mechanism: a commit metadata tag records the AI system as a co-author, visible in the git log and on GitHub. Claude Code applies this natively. 57.5% of developers in one study claimed sole authorship when implementing reviewed AI suggestions — exactly the kind of ambiguity this resolves.

The December 2025 AWS outage — and the April 2026 reporting that followed it — frames this liability scenario concretely. The full analysis is in Amazon’s documented production incident and what it means for engineering.

What does AWS Kiro GovCloud offer for government and regulated cloud environments?

Kiro GovCloud is an AWS-hosted, network-isolated variant of Kiro designed for government and regulated-industry workloads. It runs within AWS GovCloud (US-West and US-East) infrastructure, which supports data residency, IAM Identity Centre enforcement, and private connectivity requirements.

The compliance-relevant features: data collection opt-out by default; enterprise authentication only via AWS IAM Identity Centre; private connectivity over VPN or Direct Connect; and CMEK — Customer-Managed Encryption Keys — giving you control of your own encryption keys independently of the vendor.

Kiro’s spec-first workflow generates Article 11 documentation as a by-product of standard development. Kiro’s Agent Hooks trigger automated compliance and security checks at specific workflow points — catching issues during development rather than post-deployment.

The outstanding gap: FedRAMP High and DOD CC SRG authorisation are pending, not certified. US federal agencies should treat Kiro as a tool in assessment until that status is confirmed. For commercial regulated industries under EU AI Act obligations rather than FedRAMP, Kiro GovCloud’s data residency features may satisfy EU data governance requirements even without FedRAMP status.

What does post-outage executive accountability look like in a regulated sector?

When an AI-related incident occurs in a regulated industry, the investigation doesn’t stop at the system level. Regulators, boards, and legal counsel will ask whether the engineering organisation had adequate governance documentation — and the answer to that question sits with the CTO.

The December 2025 AWS outage — where Kiro was cited as “possibly involved” in initial reporting, a characterisation AWS denied — is the clearest recent example of what this looks like in practice. The full liability analysis is in Amazon’s documented production incident and what it means for engineering.

The accountability structure has three layers: technical (can you produce an audit trail tracing the failure to its specification origin?), governance (do you have documented AI coding oversight processes satisfying Article 26?), and personal (did engineering leadership execute its duty of care?).

The Air Canada chatbot ruling established the precedent: “You own the system. The system spoke on your behalf. You are liable for what it said.” Air Canada’s defence that the chatbot was a separate legal entity was rejected. That logic applies to any organisation with a deployed AI system.

ISO/IEC 42001 certification is the organisational defence — a certified AI management system shifts the regulatory conversation from “did you have processes?” to “were your processes sufficient?” AugmentCode holds both ISO/IEC 42001 and SOC 2 Type II, which is why it sits at the top of the compliance matrix. The combination of vendor-side SOC 2 Type II and your Article 26 oversight obligations substantially contains liability exposure compared to a zero-documentation scenario.

The worst-case scenario: no specification, no audit trail, no authorship attribution. In a regulated sector, that exposes both the organisation to enforcement action and leadership to personal accountability that no indemnity clause resolves.

What should engineering leaders do before August 2026?

August 2, 2026 is a hard date. Planning to the statutory deadline and treating any Digital Omnibus extension as schedule relief — not a planning basis — is the lower-risk approach.

Step 1 — Classification. Run the Article 6 / Annex III high-risk classification test for every AI coding tool in use. This is the threshold question: tools not classified as high-risk have lower documentation obligations. Document the classification decision with supporting evidence regardless of the outcome — the documentation itself is a compliance artefact.

Step 2 — Tool audit. Assess each tool against the compliance dimensions: SOC 2 Type II, ISO 42001 status, audit trail mechanism, AI authorship attribution, data residency, and spec-first workflow support. Tools that can’t satisfy Article 26 deployer obligations need to be restricted to non-production use or replaced.

Step 3 — Spec-first workflow adoption. Mandate structured specifications as the upstream input to all AI coding tool interactions in production. This single process change generates the Article 11 technical documentation and Article 12 audit trail simultaneously. Adopting the governance framing that makes SDD necessary is the prerequisite.

Step 4 — Authorship attribution. Ensure AI authorship attribution is in place for all AI coding tool interactions in production — verify that every AI-generated commit carries an attribution record in version control. This is the minimum viable mechanism for Article 26 compliance.

Step 5 — ISO/IEC 42001 assessment. Evaluate whether your organisation’s AI governance maturity warrants beginning an ISO 42001 programme. The gap analysis phase — identifying which controls exist and which are missing — is achievable before August 2026 even if full certification isn’t.

Step 6 — Board-level documentation. Brief the board on your EU AI Act compliance posture. Aligning legal, compliance, product, and engineering teams around a common understanding of AI regulatory exposure is the prerequisite. Executive accountability is partly discharged by demonstrating the risk has been assessed, documented, and escalated to governance level.

Frequently asked questions

What is the EU AI Act enforcement deadline and what happens if we miss it?

August 2, 2026 is when high-risk AI compliance (Articles 8–15) and Article 50 transparency obligations become enforceable. Penalties reach €15 million or 3% of global annual turnover. The Digital Omnibus proposal would delay this to December 2, 2027 for standalone Annex III systems — but it hasn’t been enacted.

Does the EU AI Act apply to our AI coding tools if we are not an EU company?

Yes. The EU AI Act has extraterritorial scope similar to GDPR: it applies to any organisation placing AI systems on the EU market or whose AI outputs are used within the EU. Where your customers are located matters as much as where you are headquartered.

Is spec-driven development the same as writing traditional requirements documents?

Related, but distinct. Traditional requirements documents are static and typically disconnected from the code generation process. In an SDD context, the specification is the live upstream input to the AI agent’s actions. AugmentCode’s living specs extend this further — the specification is continuously updated as the codebase evolves. Kiro’s three interconnected spec files (requirements, design, tasks) is the same idea with different tooling.

What is ISO/IEC 42001 and do we need to be certified to comply with the EU AI Act?

ISO/IEC 42001 is the international standard for AI management systems — analogous to ISO 27001 for information security. The EU AI Act doesn’t require certification, but it is a recognised conformity pathway and a strong position in a regulatory investigation. The assessment phase (gap analysis) is achievable before August 2026 even if full certification isn’t.

How does Co-Authored-By git attribution work in practice?

The commit message includes a “Co-Authored-By: [AI system name]” tag, visible in the git log and on GitHub. Claude Code applies this natively; other tools may require configuration. Every commit containing AI-generated code carries an attribution record in version control — outside any vendor’s retention window, making it the most durable AI authorship signal currently available.

What is Kiro GovCloud and is it FedRAMP authorised?

Kiro GovCloud is an AWS-hosted, network-isolated variant of Kiro available in AWS GovCloud (US-West and US-East), supporting IAM Identity Centre, private connectivity, data collection opt-out, and CMEK. FedRAMP High and DOD CC SRG authorisation are pending — not yet certified. US federal agencies should not deploy Kiro as an authorised platform until that status is confirmed.

What does “AI Dark Code” mean and why is it a compliance risk?

AI Dark Code is AI-generated code that entered production without adequate review, audit trail, or architectural oversight — code that can’t be traced back to an authorising specification or a documented human review step. In a regulated environment, it’s a compliance breach under EU AI Act Articles 11, 12, and 26. It’s the specific failure mode that spec-driven development is designed to prevent.

How do we classify our AI coding tool use under the EU AI Act high-risk test?

The Article 6 / Annex III test asks whether the AI system’s intended purpose falls within one of eight Annex III domains (financial services, healthcare, critical infrastructure) or serves as a safety component in a regulated product. Three real classification triggers: using AI to evaluate or rank engineers; agentic tools autonomously deploying to critical infrastructure; and building software that itself qualifies as high-risk. Even if none applies, documenting the classification decision is required.

Can we use multiple AI coding tools in the same pipeline and still maintain a compliant audit trail?

Yes, but Article 25 applies: each AI system’s contribution must be separately attributed. The minimum viable multi-agent log schema covers invoking user, governing specification version, model identifier, input context, output artefact, human reviewer, and disposition. Compliance at the GPAI provider level does not discharge your organisation’s deployer obligations.

What is the difference between SOC 2 Type II and ISO/IEC 42001 as vendor evaluation criteria?

SOC 2 Type II audits security controls over a defined period — it says nothing specific about AI governance. ISO/IEC 42001 is AI-specific: it covers AI data handling, risk management, and security throughout AI pipeline operations. For regulated-industry procurement, both are relevant: SOC 2 Type II is the security baseline; ISO 42001 is the AI governance signal. AugmentCode holds both.

Amazon’s Internal Probe — What AI Coding Outages Reveal About Production Risk

In December 2025, AWS Cost Explorer went down for thirteen hours in one of Amazon’s two Mainland China regions. The Register reported it. The Financial Times blamed Amazon’s agentic coding tool, Kiro. Amazon pushed back: “user error — specifically misconfigured access controls — not AI.” What followed was mandatory review policies, a public AWS London Summit interview, and a disagreement about what the incident actually proved. This article walks through what was reported, what Amazon did about it, and what it means for anyone running agentic AI tools in production. The spec-driven development pillar covers the broader context. The thesis on why AI coding is maturing is the entry point. This is the incident evidence.

💡 Agentic AI refers to AI tools that autonomously chain multiple steps — analysing a problem, selecting a solution, executing changes — with minimal human approval between steps. This differs from traditional AI-assisted coding, where a human reviews and accepts each suggestion individually.

What was reported about the December 2025 AWS outage and when?

The timeline matters because two distinct events are easy to conflate.

The incident happened in December 2025. Amazon confirmed the scope was narrow — AWS Cost Explorer in one Mainland China region — and said it “did not receive any customer inquiries regarding the interruption.” No compute, storage, database, or AI services were affected.

The Financial Times published the original attribution in February 2026, citing four sources who said Kiro made changes that caused the outage. The Register picked up the FT story. Amazon responded on aboutamazon.com: “The brief service interruption they reported on was the result of user error — specifically misconfigured access controls — not AI.”

In April 2026, The Register published a follow-up on Amazon’s internal policy response. Aragon Research published “AWS AI Outages Raise Questions on Agentic Autonomy,” framing the incident as evidence of “a critical maturity gap in the shift from generative AI to agentic AI.” MSN carried additional coverage.

So you have two framings sitting side by side: Amazon’s “user error” classification and Aragon Research’s “governance gap” reading. Both are sourced to named publications. We’re presenting both here without taking a side.

What does “misconfigured access controls from an agentic session” actually mean?

Aragon Research documented the mechanics: an agent “determined that deleting and then recreating a specific environment was the optimal path to resolve a technical issue.” So it did exactly that. The agent understood the technical goal perfectly well. What it didn’t have was the business judgement to weigh a thirteen-hour outage against a clean environment.

That’s the structural problem with agentic tools in production. The agent chains its own decisions — assess the problem, pick a fix, execute it, move on — with no mandatory pause for a human to check the work. When you’ve handed that tool what Aragon Research calls “god-mode permissions” — service account rights broader than the task needs — a single bad decision can execute destructively before anyone even knows it’s happening.

Amazon’s Kiro documentation, via CRN, notes that “by default, Kiro requests authorisation before taking any action.” That default was overridden. Amazon’s position: misconfigured access controls are “the same issue that could occur with any developer tool or manual action.” Aragon Research’s position: agentic execution removes the human buffer that normally stops a catastrophic command from completing.

Both can be right at the same time. The misconfigured controls created the opening; the agentic tool’s autonomous chaining meant there was no human pause between the flawed permission and the destructive action. The fix — what Tarcza spelled out at the AWS London Summit — is human-in-the-loop (HITL): “every mutating step that an AI might do requires a human to approve it. That is all the way down to publishing a document for someone to read.” How Kiro’s authorisation-first design addresses this at the tool level is covered separately.

What was Amazon’s official position and what policy did it mandate?

Amazon’s official response to CRN: “This brief event was the result of user (AWS employee) error — specifically misconfigured access controls — not AI.” And the company confirmed it “implemented numerous additional safeguards, including mandatory peer review for production access.”

The clearest statement came from Steve Tarcza, Director of Amazon Stores and lead of the StoreGen team, at the AWS London Summit in April 2026: “Nothing ships without someone looking at it and validating it.” Every mutating step — deployments, infrastructure changes, even document publishing — requires explicit human approval.

Amazon is also running an internal target of 80% AI adoption. More AI output means more review burden. Tarcza’s team is managing that directly: spec-driven development puts AI output “in roughly the form that folks want it to be in,” which reduces review overhead without removing the requirement. The Kiro article covers how that works at the product level.

Why is the attribution contested and why does the ambiguity matter?

Amazon denied direct Kiro involvement. The FT attributed the outage to Kiro. The Register reported the FT story and carried Amazon’s denial. Aragon Research reframed the whole thing.

Amazon’s aboutamazon.com correction: “We want to address the inaccuracies in the Financial Times’ reporting.” A misconfigured IAM role caused the issue — the same misconfiguration could have happened with any tool.

Aragon Research’s framing doesn’t actually contradict this. Their point is that the governance conditions that would allow any agentic tool to cause a production failure were in place. “The significance of these outages extends far beyond a simple configuration error; it highlights a critical maturity gap.” User error created the vulnerability. Agentic execution made it consequential.

The ambiguity matters because it shapes how organisations respond. If every AI-involved failure gets classified as “user error,” the focus stays on the human configuration decision — not on the conditions under which agentic tools are allowed to act autonomously. And the accountability for those conditions sits with whoever holds the engineering mandate. That’s probably you.

What does this incident reveal about the broader risk of agentic coding in production?

The AWS incident isn’t a one-off. CloudBees’ State of Code Abundance 2026 (May 2026, 213 enterprise technology leaders) reports that 81% have had a production failure they could attribute to AI-generated code. AI adoption is running well ahead of governance maturity.

Aragon Research calls this the “AI honeymoon phase” — efficiency gains are visible, governance failures not yet consequential. “Enterprises are now facing the repercussions of replacing human oversight with unproven autonomous logic.”

Most development workflows don’t ask engineers to flag which parts of a codebase were AI-generated. The organisation can’t assess how much unreviewed AI output is already sitting in production. Governance literature calls this AI dark code — we get into that below.

Vibe coding — ad-hoc AI-assisted development without structured specifications or review gates — is what produces AI dark code. Run that code in an agentic session with broad permissions and you have exactly the conditions the AWS incident illustrates. CloudBees CEO Anuj Kapur put it well: “Enterprises are living through the same movie they watched with cloud. Adopt fast, figure out the economics and security implications later, and panic when the bill arrives.” The From Vibe to Spec article covers the broader failure-mode context.

What does accepting AI-generated code mean for engineering accountability?

CloudBees’ data: 46% of enterprise technology leaders say the CTO or VP of Engineering is ultimately accountable when AI-generated code causes a production failure.

No dedicated governance function means accountability defaults upward to whoever holds the engineering mandate. When every AI-involved failure is “user error,” the accountability chain runs to whoever authorised the tool’s use. That’s your exposure.

Tarcza flagged a compounding risk: “We can’t get to the point where we don’t have more junior engineers coming in. We can’t end up in a spot where there are not folks to maintain these systems.” Mandatory review policies only work if you have a functioning reviewer pool. Layoffs shrink that pool. More AI dark code reaches production without human eyes on it.

The legal dimension is the gap in publicly available sourcing — no source addresses contractual liability directly. In regulated industries, the accountability chain runs to the CTO regardless of how the incident gets classified. The Governance article covers the compliance implications.

What is “AI dark code” and why is it a governance problem organisations need to address now?

AI dark code is AI-generated code sitting in production without adequate review, audit trail, or architectural oversight — code whose origins are invisible to the organisation. When something breaks, you can’t tell whether the failure came from AI-generated or human-written code, and you can’t show auditors that review processes were followed.

The AWS incident’s contested attribution exists partly because the provenance of the configuration decision wasn’t logged in a form that settled the question. Amazon’s Correction of Error (COE) process is the structured post-mortem model most organisations simply don’t have.

Mandatory peer review addresses the review gap — but not the audit trail requirement: a documented record of what was reviewed, by whom, against what criteria. Only 12% of organisations have a dedicated governance function for AI-generated code. For regulated industries, the audit trail gap is fast becoming a regulatory exposure. The compliance implications are in the Governance article.

What does this incident mean for organisations thinking about spec-driven development?

Spec-driven development (SDD) is an approach where an AI agent first produces a structured specification — tasks, acceptance criteria, design decisions — for human review and approval before writing any code. It’s Amazon’s own stated approach via Kiro, and it’s the structural answer to the governance conditions the AWS incident exposed. The spec-driven development landscape covers the full range of tools and frameworks that implement this approach.

Tarcza’s assessment is worth quoting because it’s an honest one: does SDD solve hallucination and prompt injection? “No. It reduces it at best. And even then, there are cases where it still does go beyond the specification.” SDD creates a mandatory pause between problem statement and code execution, inserts a human-reviewable artefact at the decision point, and makes the review process auditable. Those are the governance structures the AWS incident shows were absent.

Tarcza also noted his team doesn’t use AI for deployments — deterministic systems handle that. SDD applies to code generation. For existing codebases, the path is incremental: establish review gates for new AI-generated changes, audit existing AI-generated code for provenance, work out how much AI dark code is already in production.

The tooling layer is covered in the Kiro article. The compliance layer is in the Governance article.

Frequently asked questions

Did Amazon’s AI coding tool Kiro actually cause the AWS outage?

Amazon officially denied direct Kiro involvement. The official classification, via CRN: “user error — specifically misconfigured access controls — not AI.” The FT attributed the outage to Kiro; Amazon’s aboutamazon.com correction addressed those “inaccuracies” directly. Aragon Research’s framing doesn’t attribute fault to Kiro specifically — it argues the governance conditions enabling any agentic tool to cause a production failure were present. The direct causal attribution is contested. The governance conditions are not.

What is “AI dark code” in plain terms?

AI-generated code integrated into production without documentation of what it does, who reviewed it, or whether it meets the organisation’s standards. When something breaks, the organisation can’t trace whether the failure originated in AI-generated or human-written code — and can’t demonstrate to auditors that its review processes were followed.

How long did the AWS outage last and what was affected?

The Register reported a thirteen-hour service disruption affecting AWS Cost Explorer in one of Amazon’s two Mainland China regions. Amazon confirmed the scope was narrow and that no customer inquiries were received.

What is Amazon’s mandatory peer review policy for AI-generated code?

Amazon confirmed it implemented “mandatory peer review for production access.” Steve Tarcza described the operating principle at the AWS London Summit: “Nothing ships without someone looking at it and validating it.” Every mutating step — deployment, infrastructure change, document publishing — requires explicit human approval.

What are “god-mode permissions” in agentic AI?

Aragon Research’s term for service account permissions assigned to AI tools that are broader than the task requires, allowing destructive actions without secondary human approval. Their recommendation: audit all service account permissions assigned to AI tools before any agentic deployment.

What does the CloudBees report say about AI coding failures?

CloudBees’ State of Code Abundance 2026 (May 2026): 81% of enterprise technology leaders have experienced a production failure attributable to AI-generated code. 46% say the CTO or VP of Engineering is ultimately accountable. The CARE Index baseline is 83.6/100 — but notes a gap between perceived preparedness and actual operational capability.

What is the difference between agentic AI coding and traditional AI-assisted coding?

Traditional AI-assisted coding: the AI suggests completions; a human reviews and accepts each one. Agentic coding: the AI autonomously chains multiple steps — analysing, selecting, executing, testing — with minimal human approval between steps. The governance risk is that a single flawed decision can cascade through subsequent automated steps before anyone can intervene.

Who is Steve Tarcza and why does his position matter?

Steve Tarcza is a Director at Amazon Stores leading the StoreGen team. His team operates under Amazon’s 80% AI adoption target while enforcing mandatory human review requirements. His AWS London Summit interview is the most operationally concrete governance statement in the public record on this incident.

What is the “AI honeymoon phase” and why is it ending?

Aragon Research’s term for the early AI adoption period where efficiency gains are visible and governance failures aren’t yet consequential. Organisations adopt agentic tools before their governance frameworks are mature enough to keep them in check. The AWS incident is cited as evidence this phase is ending.

What is the human-in-the-loop requirement for agentic AI?

Human-in-the-loop (HITL) requires explicit human approval before an AI agent executes a consequential action. Tarcza’s formulation: “every mutating step that an AI might do requires a human to approve it. That is all the way down to publishing a document for someone to read.”

How does spec-driven development reduce agentic coding risk?

Spec-driven development requires the AI to produce a structured specification for human review before writing any code — a mandatory pause between problem statement and code execution. Tarcza’s assessment: “It reduces it at best. And even then, there are cases where it still does go beyond the specification.”

What should an engineering team do immediately if using agentic AI tools in production?

Aragon Research’s recommendations: audit all service account permissions — confirm no agent is operating with god-mode access. Implement mandatory human approval at every mutating step. Establish a review gate: no AI-generated change reaches production without a named engineer signing off. Then assess existing AI-generated code in production for provenance.

The 30-Plus Framework Landscape — Navigating Spec-Driven Development Options in 2026

Spec-driven development started as the industry’s answer to vibe coding — Andrej Karpathy’s term for casually prompting AI to generate code with no structure or planning. That correction worked, maybe too well. There are now more than 30 frameworks, tools, and templates all making the case that they’re the right way to put specs at the centre of AI-assisted development. For engineering leaders who need to pick one, that number is a genuine headache.

The three-tier mental model in this article is your navigation shortcut. Vendor-backed tools like AWS Kiro and GitHub SpecKit sit in Tier 1. Community-led frameworks like BMAD-METHOD, GSD, and Cursor Plan Mode sit in Tier 2. Niche-optimised tools like OpenSpec and Tessl sit in Tier 3. But before you even consult the tiers, there’s one filter question that eliminates half the field straight away: are you working in an existing codebase, or starting from scratch?

By the end of this article you’ll have that three-tier model, a concrete evaluation axis, and RanTheBuilder’s per-feature cost benchmarks to bring into a real team conversation.

How do you navigate 30-plus spec-driven development frameworks?

The ecosystem has no official taxonomy. Tools like OpenSpec, GSD, SpecKit, Superpowers, Taskmaster AI, Antigravity AgentKit, BMAD-METHOD, and Agent OS have all appeared without any coordinating body to organise them. The awesome-openspec curated list includes a spec-compare repository with six tools and scoring matrices, which helps — but even that’s a partial picture.

The six frameworks worth evaluating seriously are: AWS Kiro and GitHub SpecKit (Tier 1, vendor-backed), BMAD-METHOD, GSD, and Cursor Plan Mode (Tier 2, community-led), and OpenSpec and Tessl (Tier 3, niche-optimised). The rest — Taskmaster AI, Superpowers, Antigravity AgentKit, Agent OS, and others — have thin coverage and require significant primary research to evaluate.

The brownfield versus greenfield question is the first filter. It eliminates half the options immediately. Keep it in mind as you read through the tiers.

What do vendor-backed frameworks — AWS Kiro and GitHub SpecKit — offer that community tools don’t?

Tier 1 is defined by what “vendor-backed” actually means in practice: long-term maintenance, commercial support, IDE-native integration, and documentation that survives procurement reviews. That combination matters for regulated industries, enterprise teams with compliance requirements, or any organisation where a single-maintainer framework would fail procurement.

AWS Kiro is Amazon’s agentic IDE built on Bedrock. It runs a three-phase workflow — requirements, design, implementation — and generates user stories with acceptance criteria in EARS (Easy Approach to Requirements Syntax) notation, which structures acceptance criteria into machine-readable, testable statements rather than freeform prose. The main limitation is that it was designed for greenfield. As one Broadcom engineer put it, “most developers don’t start from a greenfield idea — they start from an existing codebase, a messy bug, or a design they already agreed on.” Kiro has added Design-first and Bugfix workflows since launch, but its core assumption is still a new project.

GitHub SpecKit is the open-source, Microsoft/GitHub-backed option. Currently at v0.8.7 with 93,000+ GitHub stars and support for 30+ AI coding agents. Its core mechanism is the “constitution” — a markdown rules file containing high-level principles that apply to every change across every session. Define it once, and every spec inherits those rules. Templates flag unknowns as NEEDS CLARIFICATION rather than guessing. One flag worth noting: the RanTheBuilder benchmark found known upgrade issues that overwrite customisation files — worth testing thoroughly before committing.

Both Kiro and SpecKit skew greenfield. Neither was designed for delta-spec work on an existing codebase. When to choose Tier 1: you’re in a regulated industry, you need a support contract, or a single-maintainer bus factor would fail procurement. For everything else, start with Tiers 2 and 3.

Augment Code / Intent holds the highest EU AI Act compliance tier among SDD tools — if that’s your constraint, the governance deep-dive covers it in full.

What are BMAD-METHOD, GSD, and Cursor Plan Mode — and how do community-led frameworks differ from vendor tools?

Tier 2 frameworks are MIT-licensed, community-maintained, and have no vendor lock-in. They move faster than Tier 1 and adapt to your team’s conventions rather than imposing their own. The trade-off is straightforward: no support contract and higher bus-factor risk.

BMAD-METHOD (Build More Architect Dreams) is the grassroots anchor of the SDD ecosystem. Version 6.6.0 shipped on April 29, 2026 with 46,700+ GitHub stars and more than 5,500 forks. It orchestrates 12+ specialised AI agent roles — Mary (Business Analyst), Preston (Product Manager), Winston (Architect), Sally (Product Owner), Simon (Scrum Master), Devon (Developer), Quinn (QA Engineer), and more. Each agent has a defined persona, prompt set, and artefact responsibility. Planning artefacts — PRD, architecture doc, story breakdown — flow into the project’s docs/ folder. The /bmad-help command reads current project state and tells you which agent to invoke next. By any measure BMAD is the healthiest open-source project in this space: 458 commits in 90 days, 94.3% issue close rate, 1-day PR median age.

BMAD’s structured artefacts are how it tackles context rot — the failure mode where a long AI coding session’s accumulated context degrades output quality until the agent starts contradicting earlier decisions. That matters a lot for large, complex features. For a well-scoped solo task, the benefit is smaller.

GSD (Get Shit Done) is the lean alternative. 61,000 GitHub stars in under five months since its December 2025 initial commit. Where BMAD adds sprint ceremonies, GSD’s philosophy is that complexity should live in the system, not the workflow. It spawns parallel researchers, planners, executors, and verifiers — each in a fresh 200K token context window, so context rot is zero by design. The Minimal Install Profile cuts the system prompt from ~12,000 to 700 tokens, making it viable for local LLMs and metered APIs.

Cursor Plan Mode is the “no framework required” entry point. Built into the Cursor AI editor — no installation, no configuration. The agent produces a detailed implementation plan, asks clarifying questions, maps affected files, and presents it for your review before writing a single line of code. No spec lifecycle, no drift detection, no living-spec synchronisation. It’s SDD-lite. And it’s the lowest-friction way to build the habit that matters.

BMAD’s two operating modes — Full and Quick — are the most consequential decision within Tier 2.

BMAD Full vs BMAD Quick — when does the extra overhead pay for itself?

The cost difference is substantial, so make this decision explicitly.

BMAD Full produces a complete PRD, architecture document, and story breakdown before any implementation code runs. The adversarial code review (/bmad-bmm-code-review) surfaces design issues before passing; “Party Mode” runs multiple agent personas through the design before implementation begins. Per RanTheBuilder’s benchmark: roughly $200 per feature and six days to first PR. Highest specification quality score of any framework tested.

BMAD Quick produces a single combined spec document. Same elicitation quality as Full, but it skips the full artefact set and the adversarial review loop. Roughly $85 per feature in about two days.

RanTheBuilder’s recommendation is direct: choose Full when design correctness is critical — the adversarial review and course-correction workflow catch design mistakes before they compound. It’s also the right call when the feature involves significant architectural decisions or cross-team dependencies, when the codebase is large and context rot is a known problem, or when a non-developer stakeholder will review the spec.

Choose Quick when the feature is well-scoped and self-contained, when time-to-PR is the main constraint, or when your team is still working out whether to commit to BMAD at all.

One important note: RanTheBuilder’s cost data is per feature at the individual developer level. It does not translate directly to team cost or total cost of ownership.

What makes OpenSpec and Tessl niche-optimised — and when do you need them?

Tier 3 tools aren’t general-purpose SDD frameworks. They solve a specific problem better than anything in Tiers 1 or 2.

OpenSpec (Fission AI) is brownfield-first by design. Its key architectural decision is separating the Source of Truth (what is currently implemented) from proposed changes. Each feature request or bugfix becomes an independent subfolder under changes/, containing a proposal.md, specs/, design.md, and tasks.md. Delta markers — ADDED, MODIFIED, REMOVED — track what changes relative to existing functionality. You specify only what’s changing, not the whole system.

The workflow is propose → apply → archive. Only after a change is implemented and accepted does the delta merge back into the main specs/ directory, building up a growing source-of-truth document over time. RanTheBuilder’s benchmark has OpenSpec at $95 per feature and roughly one day to first PR — the highest overall score across all four frameworks tested (4.00/5). It’s currently single-maintainer (Fission AI), so organisations with low bus-factor tolerance need to weigh that against how well the brownfield fit suits them. The awesome-openspec list is the best navigation resource for tutorials and comparisons.

Tessl solves a different problem entirely. Founded by Guy Podjarny, creator of Snyk, Tessl is built around the Spec Registry — an open registry of over 10,000 machine-readable specifications for external libraries and APIs. Think of it as npm for specifications: agents pull the correct spec for a package by name rather than guessing library behaviour from training data. That prevents a specific and expensive failure mode — AI-hallucinated method signatures that reach production.

What makes Tessl different is the living-spec model. Specs update as libraries update, unlike the point-in-time specs produced by BMAD, SpecKit, OpenSpec, and Kiro. Still in closed/limited beta as of mid-2026, so it’s not a same-day option. But when it is, it’s the right choice for any team integrating third-party APIs at scale.

Brownfield vs. greenfield: why is this the most important question before choosing a framework?

Most teams at 50–500 person organisations are working in brownfield codebases, not greenfield ones. The greenfield-first framing of most SDD articles doesn’t match that reality. The actual job for most teams is “build new things fast without breaking anything already in production.”

Greenfield-biased frameworks — Kiro, SpecKit, BMAD Full — were designed to define a system from scratch. Spec completeness is their primary value. They struggle with the “specify only what is changing” use case because the change surface is a small delta in a large existing system, not a blank page. Enterprise analysis of Kiro and SpecKit describes them as working well for greenfield projects and prototypes, but as a different game entirely for enterprise software with existing systems and long-term maintainability requirements.

Brownfield-first frameworks — OpenSpec — were built around the delta-spec model from the start. Only the change surface needs specifying. The accumulating source-of-truth document grows with the codebase rather than replacing it.

Dual-mode frameworks — BMAD with Quick workflow, GSD — can be adapted for brownfield work but require the team to scope the spec manually. Workable, but not native.

Cursor Plan Mode operates at the individual-change level. The plan is scoped to the current task, not the whole system. That works in both contexts, which is why it’s a sensible starting point regardless of your codebase situation.

If compliance and audit trail requirements are part of your brownfield evaluation, that adds a second axis. The governance and compliance detail lives in a separate article.

What does per-feature cost data tell us about framework overhead — and what does it leave unanswered?

The brownfield/greenfield axis narrows the shortlist. Cost data calibrates the decision within it.

RanTheBuilder (Ran Isenberg) ran the same real feature through multiple frameworks and scored them across 13 dimensions from an enterprise perspective. It’s the only published peer-level cost benchmark in the SDD ecosystem. All figures are at the individual developer level — they do not translate directly to team cost or total cost of ownership.

The benchmarked per-feature estimates are:

BMAD Full: roughly $200 per feature and six days to first PR — highest cost in the benchmark, highest specification quality score.

BMAD Quick: roughly $85 per feature and two days to first PR.

SpecKit: roughly $75 per feature and one day to first PR. Overall score 2.77/5 — lowest of the four, with community health signals to match (37 commits in 90 days, 36.8% issue close rate, 62-day PR median age).

OpenSpec: roughly $95 per feature and one day to first PR. Highest overall score (4.00/5).

RanTheBuilder’s summary on the BMAD Quick / SpecKit / OpenSpec cluster: “They land in the same ballpark on both speed and cost. The differences are noise.” BMAD Full is the outlier — worth it for architecturally complex work, hard to justify for everything else.

What the data doesn’t cover: GSD, AWS Kiro, and Tessl have no equivalent published benchmark. Name that gap when presenting these numbers to your team. Bus factor, long-term maintenance cost, and downstream quality outcomes aren’t captured either. The benchmark measures specification quality; it doesn’t tell you what happens six months down the track.

How do you evaluate and pilot a spec-driven development framework before committing?

Running a pilot on your own codebase is more useful than any benchmark. Here’s a six-step process that doesn’t turn the evaluation into a research project.

Step one: apply the brownfield/greenfield filter. Eliminate frameworks that don’t match your primary codebase context. Step two: apply the tier filter — if vendor support or compliance documentation is a hard requirement, start in Tier 1. Step three: select one framework per shortlist tier for a pilot sprint using a representative feature — not the simplest task on the backlog, not the most fraught one. Step four: capture time-to-PR, specification quality (reviewed by a non-developer if possible), token cost, and team friction. Team friction is the one that doesn’t show up in benchmarks. Step five: compare your results against the RanTheBuilder benchmarks as a calibration baseline. Step six: evaluate sustainability signals — bus factor, community activity, documentation quality, and how easy it is to migrate specs if needed.

For teams with no SDD experience at all: start with Cursor Plan Mode for two weeks. No new tooling, no configuration, no framework adoption decision. Just build the habit of reviewing AI plans before execution. Once that’s a team norm, adopting GSD or BMAD is a much smaller step. If you’re not using Cursor, GSD is the lowest-ceremony standalone option. The spec-driven development as a movement article covers the broader context for any team still deciding whether SDD is the right shift at all.

The community philosophy underlying all of these frameworks is worth understanding before you commit. The frameworks differ significantly in approach and ceremony, but they share a common premise about what makes AI-assisted coding reliable at scale.

Frequently asked questions

What is the easiest way to get started with spec-driven AI development?

Cursor Plan Mode. It requires no additional tooling beyond the Cursor editor itself. The plan-review step before code execution is the core SDD habit. Establishing it costs nothing and prepares a team for adopting a fuller framework later. If you’re not using Cursor, GSD is the lowest-ceremony standalone framework — fast installation, 14 supported runtimes including Claude Code, Windsurf, and Gemini CLI.

What is BMAD-METHOD and what do the 12-plus agent roles do?

BMAD-METHOD is an open-source, MIT-licensed, community-led SDD framework that orchestrates 12+ specialised AI agent roles across the full development lifecycle. Named roles include Mary (Business Analyst), Preston (Product Manager), Winston (Architect), Sally (Product Owner), Simon (Scrum Master), Devon (Developer), and Quinn (QA Engineer). Each operates with a defined persona, prompt set, and artefact responsibility. The /bmad-help command reads current project state and tells you which agent to invoke next.

What is the difference between BMAD Full and BMAD Quick?

BMAD Full produces a complete artefact set — PRD, architecture document, story breakdown — before any code runs. Roughly $200 per feature and six days to first PR. BMAD Quick produces a single combined spec document, skips the full artefact set, and comes in at roughly $85 per feature and two days to first PR. Per RanTheBuilder: choose Full for large, architecturally complex features; choose Quick for well-scoped work or when evaluating BMAD for the first time.

Is BMAD-METHOD too complicated for a small development team?

BMAD Full has the highest learning curve of any SDD tool. The multi-agent approach adds ceremony that may be excessive for small teams or simple features. Solo developers and small teams typically get more value from GSD or Cursor Plan Mode. GSD’s Minimal Install Profile cuts the system prompt by 94% for lightweight use.

What is OpenSpec and how does it handle brownfield codebases?

OpenSpec (Fission AI) is a brownfield-first SDD tool that uses delta markers (ADDED / MODIFIED / REMOVED) to specify only what is changing in an existing codebase. The propose → apply → archive workflow keeps specs compact and reviewable. The archive step accumulates a growing source-of-truth document. Roughly $95 per feature and one day to first PR (RanTheBuilder benchmark). Currently single-maintainer — bus-factor risk applies.

What is the Tessl Spec Registry and how does it prevent API hallucinations?

The Tessl Spec Registry is an open registry of over 10,000 machine-readable specifications for external libraries and APIs — “npm for specifications.” AI coding agents pull the correct spec for a package by name rather than guessing library behaviour from training data. Tessl uses a living-spec model: specs update as libraries update, unlike the point-in-time specs produced by BMAD, SpecKit, OpenSpec, and Kiro. Still in closed/limited beta as of mid-2026.

GSD vs BMAD — which is better for solo developers?

GSD. It has 61,000+ GitHub stars, low ceremony, native Claude Code orchestration, and fast onboarding. One solo developer documented producing 100,000 lines of code in two weeks using GSD with zero context management overhead. BMAD Full’s 12-agent artefact overhead is designed for feature complexity and team coordination — solo developers rarely need that level of governance.

How do vendor-backed SDD tools (Kiro, SpecKit) compare to community tools (BMAD, GSD) for enterprise teams?

Vendor-backed tools offer long-term maintenance guarantees, commercial support, and compliance documentation. Community tools offer faster iteration, no vendor conventions, and no licensing constraints beyond MIT. For regulated industries or enterprise procurement, Tier 1 typically clears compliance requirements more easily. The governance evaluation is covered in detail in a separate article on spec-driven development in regulated industries.

Is spec-driven development just waterfall with extra steps?

No. SDD specs are per-feature (or per-change in OpenSpec’s model), not upfront system-wide design documents. They’re generated iteratively, reviewed by humans, and consumed immediately by the AI coding agent in the same session — not handed off to a separate implementation phase. A waterfall requirements document takes weeks and goes stale. An SDD spec takes minutes, is reviewed in the same session, and is either archived or discarded after implementation.

What is context rot and how do SDD frameworks prevent it?

Context rot is the failure mode where a long AI coding session’s accumulated context degrades output quality — the agent starts contradicting earlier decisions or losing track of constraints. GSD combats it by executing each plan in isolated sub-agents with fresh 200K token context windows. BMAD and OpenSpec both use persistent artefacts — the docs/ folder and the accumulating source-of-truth document respectively — that outlast any single AI session.

What should I look for when evaluating an SDD framework’s long-term sustainability?

Bus factor first: is the framework maintained by a single person (OpenSpec/Fission AI), an organisation (GitHub/SpecKit, AWS/Kiro), or a broad community (BMAD, GSD)? Then community activity: commit frequency, issue response time, documentation completeness. Then ecosystem lock-in: how easy is it to migrate specs and workflows if needed? RanTheBuilder’s benchmark provides community health data for BMAD (458 commits/90 days, 94.3% issue close rate), SpecKit (37 commits/90 days, 36.8% issue close rate), and OpenSpec (158 commits/90 days, 24.2% issue close rate).

Can I use multiple SDD frameworks in the same organisation?

Yes. The three-tier model is a selection framework, not a mandate to pick one and exclude all others. A common pattern: Cursor Plan Mode for individual developers, BMAD or GSD for team-level feature specs, and Tessl’s Spec Registry as a complement to any framework for API-heavy work. OpenSpec’s delta-spec model can be layered on top of a GSD or BMAD greenfield spec once the codebase matures into brownfield territory.

GitHub SpecKit and the Microsoft Approach to AI Coding Governance

AI coding agents are powerful but undisciplined. Leave them without a persistent set of constraints and every new session starts from scratch — and your engineering standards quietly erode. That’s the problem GitHub SpecKit was built to solve.

SpecKit is Microsoft and GitHub’s open-source answer to unstructured AI-assisted development: MIT-licensed, agent-agnostic, and structured around a four-phase workflow anchored by a single governance document. If you’re evaluating where Microsoft sits in the spec-driven development landscape, this is the analysis.

What is GitHub SpecKit and how does it approach spec-driven development?

GitHub SpecKit (repo: spec-kit, CLI: specify-cli) is an open-source CLI toolkit released under the MIT licence. Its job is to replace unstructured, prompt-driven AI coding — vibe coding — with a governance-first workflow that produces versionable artefacts at every stage.

💡 Vibe coding refers to generating code through freeform AI prompting without formal specifications or persistent constraints, producing outputs that are difficult to reproduce or audit.

What sets SpecKit apart is IDE-agnostic portability. It works with 30+ AI coding agents — GitHub Copilot, Claude Code, Gemini CLI, Amazon Q, Cursor, Windsurf — without modification. Den Delimarsky from Microsoft described it at AI Dev Days as a toolkit for “product scenarios and predictable outcomes instead of vibe coding every piece from scratch.”

SpecKit is the tooling implementation of Microsoft’s Agentic-Agile framework: the idea that AI agent teams require the same discipline as human Agile teams — explicit intent, structured tasks, persistent constraints. Agents are treated as contributors, and every agent action carries the same downstream consequences as a human commit.

As of May 2026, SpecKit has 93,000+ GitHub stars. That’s a demand indicator for the spec-driven development category, not a quality rating for SpecKit specifically. Worth keeping in mind.

What is the “constitution” in GitHub SpecKit and how does it function?

The constitution.md is a project-wide constraints document. It sets out your defined engineering principles — TDD requirements, naming conventions, security baselines, architecture boundaries — and every AI agent in the workflow must respect them.

If you’ve written ADRs (Architectural Decision Records) or RFCs, the constitution does the same job. It records the persistent intent of your engineering team and makes that intent binding across all agents and workflow phases. It’s not advisory. It’s the contract between developer and agent, and a violation is a workflow failure.

One practitioner confirmed with a SpecKit maintainer that TDD requirements added to the constitution propagate through all workflow phases. The Microsoft Learn module frames it as encoding “internal engineering guidelines (security, performance, compliance)” and ensuring generated plans adhere to these constraints. The constitution is established in the Specify phase, before any code or design begins — which connects to the community principle explored in A Sufficiently Detailed Spec Is Code.

One gap worth noting: GitHub Issue #2362 documents community demand for a native security governance preset covering threat modelling, supply-chain transparency, and ASVS verification. No official preset exists yet. If your team has compliance requirements, you’ll need to author your own governance entries until native presets are released. The issue puts it well: “Projects with compliance or audit pressure need repeatable evidence locations and starter artifacts instead of ad hoc document structures.”

How does the SpecKit four-phase workflow operate in practice?

The four phases run in sequence: Specify, Plan, Tasks, and Implement. The specify-cli drives you through each phase inside a .specify project directory.

Specify: The developer documents requirements and writes the constitution.md. Nothing moves forward until governance is in place.

Plan: SpecKit generates a technical design with AI assistance. Constitution constraints are active throughout — no design decision can contradict them. Output: plan.md.

Tasks: The design is broken into discrete, executable agent tasks — an auditable checkpoint where engineering leads can review before execution begins. Output: tasks.md. This is the phase that structurally separates SpecKit from tools that hand a design straight to an agent.

Implement: AI agents execute the task list against the spec and constitution. In one independent evaluation, it took roughly 90 minutes to reach spec, plan, and task breakdown, followed by about 35 minutes of agent execution. Once artefacts are version-controlled, you can swap agents over time without touching the spec format.

Microsoft has also published a 13-unit Microsoft Learn curriculum covering Azure DevOps integration and multi-agent collaboration — a solid credibility signal for enterprise evaluation.

How does SpecKit compare to AWS Kiro on philosophy and workflow?

SpecKit and Kiro represent different bets about what AI coding governance actually means. SpecKit is a governance layer — a persistent constraints document that defines what every agent is allowed to do, regardless of which tool or IDE your team uses. Kiro is a workflow mandate — a structured requirements-to-implementation pipeline with tight AWS and VS Code integration built in.

SpecKit’s four-phase model separates task decomposition from implementation. Kiro’s three-phase model — Requirements, Design, Implementation — moves from design to agent execution without a dedicated task-list phase, removing one human review gate.

Kiro uses EARS notation to generate structured acceptance criteria covering edge cases.

💡 EARS (Easy Approach to Requirements Syntax) is a structured notation for software requirements that uses defined sentence patterns to ensure edge cases and alternative scenarios are explicitly captured.

SpecKit uses the free-form constitution.md — more flexible, but it demands more discipline to write governance well. The IDE constraint matters: Kiro is VS Code-first with deep AWS and Amazon Bedrock integration, constrained to Anthropic Claude models. SpecKit works across IDEs and 30+ agents. If your team runs a multi-cloud or heterogeneous tooling environment, that portability is a genuine differentiator.

SpecKit’s constitution also persists across sessions and projects. Kiro’s requirements documents are per-project artefacts. Teams inside the AWS ecosystem with VS Code standardisation have a strong case for Kiro; teams with mixed tooling or portability requirements have a strong case for SpecKit. The dedicated Kiro analysis covers the AWS side in depth.

How does SpecKit integrate with GitHub Copilot and Azure DevOps?

For teams already on the Microsoft developer stack, SpecKit forms the governance backbone of a coherent three-component setup: GitHub Copilot handles AI agent execution, Azure DevOps handles the enterprise pipeline, and SpecKit provides the spec and governance layer.

The constitution.md functions as a persistent constraints context for Copilot — an always-on governance document, not a one-time prompt. Copilot holds approximately 40% market share with over 20 million all-time users, so most Microsoft-stack teams already have it in the workflow.

Azure DevOps integration is confirmed as an official supported workflow in Microsoft Learn Unit 11: integrating SpecKit artefacts — specs, task lists, constitution — into Azure DevOps pipelines for enterprise SDLC governance. For configuration steps, the Microsoft Learn module is the authoritative reference.

Microsoft also provides the Agentic-Agile Template (microsoft/agentic-agile-template) as a companion project scaffold with persistent agent guidance files. For larger enterprises, SpecKit supports multiple AI agents running on different tasks, all governed by the same constitution.

How does SpecKit compare to OpenSpec on brownfield vs. greenfield teams?

SpecKit and OpenSpec have a clean philosophical divide. SpecKit is about governance permanence — establish the constitution before any code is written, and every agent honours it. OpenSpec (from Fission AI) is about continuity — keeping specs synchronised with existing, evolving code.

SpecKit performs best on greenfield codebases. Independent testing confirms it struggles with legacy frameworks and complex existing codebases: retrofitting a constitution means mapping existing patterns against governance principles, and those patterns often conflict. This is the most commonly cited enterprise limitation.

OpenSpec’s living-spec model is designed for exactly that context — its propose, apply, archive flow tracks changes relative to existing functionality rather than generating full specs upfront.

SpecKit upgrades can also overwrite customisation files — documented in v0.1.6 upgrade warnings. Enterprise teams need an abstraction strategy between SpecKit upgrades and their org-specific customisations.

The practical guidance: starting something new, SpecKit is a strong choice. Managing a large existing codebase, start with OpenSpec and consider SpecKit for new components in a hybrid approach. The full framework landscape including BMAD-METHOD gives you the broader comparison.

What does SpecKit’s adoption signal tell us about the SDD market?

The 93,000+ GitHub stars tells us a large number of developers are watching or bookmarking the tool — not that they are running it in production at enterprise scale. Star counts accumulate through press coverage, social sharing, and developer curiosity. The VS Magazine headline — “GitHub Spec Kit Takes Off as Antidote to Piecemeal ‘Vibe Coding'” — gives you a sense of why the figure spiked: it reflects frustration with vibe coding as much as active SpecKit adoption.

No community-originated SDD framework — BMAD-METHOD, OpenSpec — has reached comparable star counts. SpecKit’s Microsoft and GitHub backing gives it discoverability and institutional credibility that community tools simply don’t have. What the figure doesn’t tell us: whether enterprise teams are running SpecKit at scale, or whether brownfield retrofit success rates are high. Named enterprise case studies aren’t yet publicly available.

SpecKit is a well-resourced entry in the SDD market with genuine enterprise traction potential. Microsoft’s backing gives it structural advantages over grassroots alternatives. Do your due diligence on brownfield fit and security governance before committing. Broader context is at spec-driven development.

Frequently Asked Questions

What does “constitution.md” mean in GitHub SpecKit?

The constitution.md is the project-wide constraints file that establishes defined engineering principles all AI agents must follow. Think of it as an Architectural Decision Record — it records the persistent intent of your engineering team and makes that intent binding across all agents and workflow phases. It’s not a README or style guide; it’s a governance document the agent workflow enforces.

Is GitHub SpecKit only for greenfield projects?

SpecKit performs best on greenfield codebases where the constitution can be established before any code is written. Independent testing confirms it struggles with legacy frameworks and complex existing codebases. A hybrid approach works for many teams: use SpecKit for new components or services within a broader brownfield codebase.

What agents are compatible with GitHub SpecKit?

SpecKit supports 30+ AI coding agents including GitHub Copilot, Claude Code, Gemini CLI, Amazon Q, Cursor, and Windsurf. Amazon Q — a direct AWS competitor — is on the compatible list, which is the strongest demonstration of SpecKit’s portability argument.

How does GitHub SpecKit differ from BMAD-METHOD?

SpecKit is vendor-backed (Microsoft/GitHub), MIT-licensed, CLI-driven, and officially supported with a 13-unit Microsoft Learn curriculum. BMAD-METHOD is community-originated with no vendor backing or official enterprise support. Teams inside the Microsoft ecosystem with enterprise support requirements will find SpecKit’s institutional backing worth considering; teams that prefer community-driven tools may prefer BMAD.

Can I use GitHub SpecKit without GitHub Copilot?

Yes. SpecKit is agent-agnostic by design. GitHub Copilot is the primary showcase agent for Microsoft-stack teams but is not required. Teams already using Claude Code or Gemini CLI can adopt SpecKit without changing their AI tooling.

What is Microsoft’s Agentic-Agile philosophy and how does SpecKit fit into it?

Agentic-Agile is Microsoft’s framework for applying Agile engineering discipline to human-agent development teams. SpecKit is the tooling implementation — the spec and constraints layer. The companion Agentic-Agile Template (microsoft/agentic-agile-template) provides the project scaffold alongside SpecKit’s governance layer.

Does GitHub SpecKit support security and compliance governance?

Not natively. GitHub Issue #2362 documents community demand for a security governance preset covering threat modelling, supply-chain transparency, and ASVS (OWASP Application Security Verification Standard) verification. Teams can embed security requirements manually in the constitution.md, but there’s no official preset yet.

How does GitHub SpecKit handle upgrades without overwriting customisations?

This is a known limitation. SpecKit upgrades can overwrite customisation files — documented in v0.1.6 upgrade warnings. Teams with opinionated or extended SpecKit configurations need an abstraction layer between their org-specific customisations and the upgrade process. Factor this into your adoption plan.

What is the difference between GitHub SpecKit and OpenSpec?

SpecKit is governance permanence — a constitution established before coding that all agents must honour; strongest on greenfield projects. OpenSpec (Fission AI) is a continuity layer — living specs synchronised with evolving code; designed for brownfield codebases. The choice depends on codebase context. See the full framework landscape for the broader comparison.

Is GitHub SpecKit free to use?

Yes. SpecKit is MIT-licensed and free for commercial use with no paid enterprise tier. The specify-cli is open source. The Microsoft Learn training curriculum is free. Pricing only applies to the AI agent you use alongside it — GitHub Copilot has its own subscription.

Where can I find the official GitHub SpecKit documentation?

The Microsoft Learn module is the most structured starting point: 13 units covering constitution creation, the four-phase workflow, Azure DevOps integration, CI/CD, and multi-agent collaboration.

How does the SpecKit Tasks phase differ from Kiro’s approach?

SpecKit’s four-phase model adds an explicit Tasks phase between Plan and Implement — decomposing the technical design into discrete, executable agent tasks before any agent begins coding. This creates a human review checkpoint that Kiro’s three-phase model omits. For teams that need human approval gates before agent execution, that extra phase is a governance advantage.

AWS Kiro — Amazon’s Spec-First Bet on Agentic Development

Amazon launched Kiro internationally on May 7, 2026 — not as a feature update, but as a ground-up replacement for Amazon Q Developer. The timing wasn’t accidental. Spec-driven development is Amazon’s declared answer to a specific and well-documented failure mode: agentic coding without formal structure produces code that works locally but drifts from your architecture, misses edge cases, and falls apart under production load. Kiro is Amazon’s attempt to make that failure mode structurally harder to reach. In this article we walk through how it actually works — the three-phase spec workflow, Agent Hooks, and Steering Files — and how it stacks up against Cursor for teams thinking about adoption.

What is AWS Kiro and what problem does it solve?

AWS Kiro is an agentic IDE built on Code OSS, the open-source base of VS Code. Your existing VS Code extensions, keybindings, and Open VSX-compatible plugins carry across without reinstallation. If your team lives in VS Code, day one will feel familiar.

What Kiro actually changes is the development contract. Where Q Developer and Cursor’s chat mode let you generate code from natural language prompts, Kiro requires structured specifications before any code generation can begin. The IDE makes it structurally difficult to skip that step. The whole workflow is built around the spec, not around the prompt.

Kiro is the direct replacement for Amazon Q Developer. New Q Developer account creation was blocked from May 15, 2026; IDE plugins reach end of support April 30, 2027. From May 29, 2026, the latest coding models — including Claude Opus 4.7 — are available exclusively on Kiro. Teams staying on Q Developer are locked to an older model stack.

Model routing is automatic. Kiro’s auto-router selects across Claude Sonnet, Qwen, DeepSeek, GLM, and MiniMax per task, running on Amazon Bedrock as the unified model plane. You can pin a specific model if consistent behaviour matters more than cost optimisation. Kiro is free during preview — 50 credits per month, no AWS account required for general use.

How does Kiro’s three-phase spec workflow operate?

Kiro requires three sequential documents before any agent writes a single line of code: requirements.md, design.md, and tasks.md. Together these form the spec stack, and they’re included by default in every session.

Phase one: requirements.md. You describe the feature in natural language. Kiro generates structured user stories using EARS Notation — Easy Approach to Requirements Syntax — a formal standard that structures acceptance criteria using consistent WHEN/IF/WHILE sentence patterns. Happy paths, edge cases, failure modes. Precise enough for an agent to implement without guessing your intent.

Phase two: design.md. Once you’ve approved the requirements, Kiro generates a technical design document — architecture decisions, component boundaries, data models, API contracts, sequence diagrams. This is the gate where you’re reviewing architecture coherence, not syntax. The agent cannot proceed until you explicitly approve this document.

Phase three: tasks.md. Kiro breaks the approved design into a numbered list of atomic implementation tasks. Each task maps to a discrete code change that can be independently reviewed and reverted. Agent execution is not fully autonomous — you decide when each individual task runs.

The spec stack goes into version control alongside the code. That gives you a human-readable audit trail of intent, not just implementation — which becomes genuinely useful when a change six months later needs to be understood in context.

One thing worth flagging: a Kiro technical review described the required level of specification detail as “over-specification”. The counterargument from practitioners is that this is exactly the right level of detail for an agent to produce reliable output. Your engineers will need to internalise that the spec authoring phase is the work — not overhead that precedes the work.

What are Agent Hooks and what do they automate?

Agent Hooks are Kiro’s event-driven automation system. Think GitHub Actions or AWS Lambda triggers, but scoped to IDE-level file events rather than CI/CD pipeline events. A Hook fires when a specified file event occurs — save, create, delete, rename — within a defined path pattern. The trigger is local to the developer’s working session, not to the repository pipeline.

In practice, the most common automation patterns look like this: run unit tests for any file touched by an agent run, regenerate API client stubs whenever a service spec changes, and cascade spec edits through dependent consumer specs — the pattern that Digital Applied describes as most reliably justifying the migration cost.

Hooks are defined in YAML configuration files stored in the project repository. They live in the repo, not in personal config, which means they’re versioned, reviewed, and shared across the team like any other infrastructure code.

The failure mode Hooks address is documentation drift and stale tests — the classic “I forgot to update the tests” problem that compounds over time. Redmonk notes that Agent Hooks let you offload routine maintenance tasks from the main coding session, keeping the context window focused on the work at hand.

Hooks and Steering Files are the two sides of Kiro’s persistent project infrastructure. Hooks handle automation. Steering Files handle governance. Both live in the repo, both are versioned, and both operate independently of any individual coding session.

How do Steering Files enforce compliance standards?

Steering Files are Markdown documents stored in the project repository that define standing constraints Kiro must follow during code generation: security standards, naming conventions, error handling patterns, regulatory requirements. They are automatically loaded into every agent interaction — unlike a one-time system prompt, they cannot be accidentally omitted.

That distinction — persistent context versus one-time prompt — is the architectural point worth understanding. Per-developer prompt discipline is fragile at team scale. One developer might remember to include security constraints in their prompts; another won’t. A single reviewed Steering File propagates those constraints across every team member’s code generation sessions without requiring individual discipline.

Typical Steering File contents include security controls (no hardcoded secrets, input validation requirements), coding standards (naming conventions, error handling patterns), and regulatory constraints (data handling rules for GDPR or HIPAA contexts). In an AWS mainframe modernisation case study, Steering Files were used to generate microservice specifications that conformed to organisational standards without per-prompt reminders.

Steering Files complement the spec stack rather than overlap with it. requirements.md defines what to build. design.md defines how to build it. Steering Files define the constraints governing every build — regardless of feature. For teams in regulated industries, Kiro GovCloud deployments can include Steering Files pre-configured with compliance constraints. Full governance treatment is in our regulated industry analysis.

Why did Amazon replace Q Developer with Kiro?

Amazon Q Developer was a conversational coding assistant — chat-first, prompt-driven, no spec mandate. You describe what you want, the assistant generates code, you review and iterate. Amazon’s rationale for replacing it is that developers need AI that understands the entire project — architecture, requirements, tests, and intent. Kiro is that environment. These are two different product classes — agentic IDE versus coding assistant — with fundamentally different design contracts: spec-first versus prompt-first.

The context behind that strategic shift matters. The Register reported on April 29, 2026 that Amazon had instructed all engineers to review AI-generated output following a service disruption traced to an agentic coding session. Amazon classified the events as user errors. The Q Developer EOL announcement followed one day later — and that combination of a company-wide review mandate and an immediate product retirement tells you the concern extended well beyond any single incident. The full incident analysis is in our Amazon internal probe article.

The strategic reasoning, independent of the incident, is fairly straightforward: if the root cause of agentic coding failures is insufficient specification rather than insufficient AI capability, then enforcing the spec at the tooling level is a more reliable fix than asking teams to self-enforce it. Kiro embeds that enforcement in the IDE — and in doing so, it represents one of the clearest vendor implementations of spec-driven development as an engineering discipline.

For teams currently on Q Developer: the migration is not a tool swap but a workflow adoption. VS Code users get a profile import on first startup. JetBrains users get a CLI-based integration via ACP — not a native plugin, a workaround. Visual Studio and Eclipse users have no native Kiro plugin at all. The editor constraint is a real migration consideration. Don’t treat it as a footnote.

How does Kiro compare to Cursor for engineering team adoption?

Cursor, built by Anysphere, is the most widely adopted AI IDE in 2026. Its Plan Mode creates a detailed implementation plan before any code is written. Project rules add persistent context. The two tools target overlapping problems with very different philosophies: Cursor treats code as the primary artefact and makes AI assistance feel like an enhanced keyboard shortcut. Kiro inverts that order, treating the spec as source of truth and code as a build artefact.

The practical differences come down to four dimensions.

Spec mandate. Kiro requires requirements.md, design.md, and tasks.md before code generation. Cursor’s Plan Mode generates a plan but doesn’t enforce review gates or produce versioned spec artefacts. Cursor’s spec support is not native to its architecture the way Kiro’s is — no built-in spec lifecycle, no drift detection, no living-spec synchronisation.

Editor portability. Kiro is Code OSS-only. JetBrains works via ACP/CLI — a workaround, not a native experience. Visual Studio and Eclipse have no path at all. Cursor supports JetBrains and other integrations via extensions. For teams with JetBrains-heavy stacks, this needs assessing before you pilot Kiro, not during.

Vendor integration depth. Kiro integrates natively with AWS CodeCatalyst for source control, pull requests, and CI pipelines. IAM integration is built in. Cursor is cloud-agnostic. Teams on non-AWS infrastructure should weigh that integration depth as a dependency, not a feature list item.

Team adoption ramp. Moving a Cursor or VSCode team to Kiro is “less about tooling and more about adopting spec-first review culture, which is where most migrations succeed or stall.” Cursor’s gradual on-ramp is lower-friction for teams mid-migration from unstructured AI coding. Kiro’s spec mandate is higher-friction upfront — but once it’s internalised it produces more consistent output across larger teams.

Many engineering teams end up running Cursor for day-to-day development and Kiro for long-running AWS-native services — complementary rather than mutually exclusive. AWS-native teams on greenfield projects benefit from Kiro’s integration depth and spec enforcement. Multi-cloud or JetBrains-heavy teams will find Cursor’s portability more practical. See how Kiro fits in the broader SDD tool landscape and GitHub SpecKit as the primary open-source alternative for additional comparison points.

What does Kiro GovCloud mean for regulated industry teams?

AWS GovCloud (US) is an isolated AWS infrastructure region built to meet US government data residency, ITAR, and FedRAMP compliance requirements. Kiro GovCloud is available in both GovCloud (US-West) and GovCloud (US-East).

For teams that can’t use standard commercial cloud AI tooling — US federal contractors, defence-adjacent software shops, HealthTech and FinTech organisations with strict data sovereignty requirements — GovCloud deployment means AI-generated code and the associated spec artefacts never leave a compliant data boundary. Steering Files can be pre-configured with agency-specific or regulatory compliance constraints.

A few meaningful constraints apply relative to commercial Kiro. Auto model selection is disabled — Claude Sonnet 4.5 is the default. The VS Code plugin is not available; users access Kiro through the standalone IDE or CLI. And GovCloud is a US-region construct — it does not directly address EU data sovereignty. Teams under GDPR or EU AI Act obligations need to treat that as an open question. The full governance analysis — EU AI Act compliance matrix, audit trail architecture, and regulated industry adoption patterns — is in our dedicated governance article.

How do you evaluate Kiro for your organisation?

Start with infrastructure context, not the feature list. Kiro’s value proposition is strongest for teams already on AWS. CodeCatalyst integration, IAM, and GovCloud availability are meaningful differentiators for AWS-native teams — and largely irrelevant for everyone else. If you’re not already on AWS, the evaluation calculus shifts substantially.

Next, audit your editor ecosystem before you pilot. If more than a fifth of your engineering team is on JetBrains or other non-VS Code environments, Kiro’s editor constraint creates adoption friction that needs solving before rollout.

If the infrastructure and editor checks pass, design the pilot carefully. Pick one bounded greenfield feature — not a brownfield migration. Run one or two engineers through the full three-phase spec workflow for two to four weeks. The harder shift is cultural: specs become the primary review surface, which reshapes how teams plan, split tasks, and sign off on changes. Teams with strong RFC or ADR practices will adapt faster. Teams with no formal spec habit will need process change management alongside tool adoption.

Know your exit ramps before you start. If the pilot team finds the spec gates valuable but the VS Code constraint prohibitive, GitHub SpecKit — IDE-agnostic and open-source with 93,000+ stars as of May 2026 — provides a compatible workflow without the editor lock-in. If the spec mandate feels heavy for your team’s current maturity, Cursor’s Plan Mode is the lowest-friction entry point to spec-first thinking. Both are legitimate paths rather than failures to adopt Kiro.

The relevant question is whether your team needs enforced spec gates or has the discipline to self-enforce. The answer to that shapes which tool fits — and which cultural investment is actually required.

Frequently Asked Questions

What is the difference between Kiro and Amazon Q Developer?

Q Developer was a conversational coding assistant — chat-first, prompt-driven, no spec mandate. Kiro is an agentic IDE that requires formal spec documents before generating code. Different product classes, not version iterations. New Q Developer signups were blocked from May 15, 2026; IDE plugins reach end of support April 30, 2027.

Does Kiro work with JetBrains IDEs?

Kiro works in JetBrains through JetBrains AI Assistant via the Agent Client Protocol (ACP) — a CLI-based integration, not a native plugin. Visual Studio and Eclipse have no path at all; users must switch to the standalone Kiro IDE or Kiro CLI. Assess that constraint before piloting.

What is EARS Notation and why does Kiro use it?

EARS — Easy Approach to Requirements Syntax — is a formal requirements-writing standard that structures user stories using consistent WHEN/IF/WHILE sentence patterns, ensuring acceptance criteria cover happy paths, edge cases, and failure modes. Kiro uses EARS to generate requirements.md content that is precise enough for agents to implement without ambiguity.

Can I use Kiro with models other than Claude?

Kiro’s default auto-router combines Claude Sonnet, Qwen, DeepSeek, GLM, and MiniMax, selecting the optimal model per task. You can pin a specific model for consistent behaviour. In AWS GovCloud (US), auto model selection is disabled; Claude Sonnet 4.5 is the default.

How do Agent Hooks differ from GitHub Actions?

GitHub Actions are CI/CD pipeline triggers that fire on repository events — push, pull request, merge. Agent Hooks are IDE-level triggers that fire on file-system events within the developer’s local working session — save, create, edit. They operate at different points in the development cycle and complement rather than replace each other.

How long does the three-phase spec workflow take compared to just starting to code?

No controlled time comparison exists. The overhead is front-loaded — requirements.md and design.md authoring adds time per feature. Proponents claim this is recovered in reduced review cycles and lower debugging overhead. Measure it in your own pilot rather than assuming the trade-off works in your context.

Is Kiro suitable for brownfield codebases?

Kiro is designed primarily around greenfield feature development. For brownfield migration, OpenSpec by Fission AI is explicitly brownfield-first, using delta markers (ADDED/MODIFIED/REMOVED) to track changes in existing code. Kiro can be applied to brownfield features but offers no native brownfield tooling.

What is a Steering File and how is it different from a requirements.md file?

A Steering File defines standing constraints that apply to every code generation session — security standards, naming conventions, regulatory requirements. requirements.md defines what a specific feature should do. They operate at different abstraction levels: Steering Files are persistent project-wide governance, requirements.md is per-feature specification.

Does Kiro integrate with AWS CodeCatalyst and existing AWS pipelines?

Kiro integrates natively with AWS CodeCatalyst for source control, pull requests, and CI pipelines. Hooks can trigger CodeCatalyst workflows without custom glue. IAM integration is built in. Meaningful advantage for AWS-native teams; meaningful lock-in for multi-cloud organisations.

How does Kiro handle security scanning — is it built in or via Agent Hooks?

Via Agent Hooks, not a built-in static analysis engine. A Hook triggers a security scan agent when new dependencies are added. It does not run automatically — teams configure it as part of project setup.

What happened to teams using Amazon Q Developer?

Q Developer IDE plugins reach end of support April 30, 2027; existing customers retain access through the transition window. Amazon’s guidance is to migrate to Kiro — a workflow adoption, not a tool swap. It is a substantive process change, not a substitution.

How does Kiro compare to GitHub Copilot?

GitHub Copilot is an AI coding assistant — autocomplete and chat — with no spec mandate. GitHub’s spec-first tooling is GitHub SpecKit, a separate open-source framework. The closer comparison is Kiro vs. SpecKit or Kiro vs. Cursor. Comparing Kiro to Copilot conflates two different product categories.

From Vibe to Spec — Why AI Coding Is Growing Up

AI-assisted development has a production problem. Three failure modes keep showing up across independent data sources: context decay, hallucinated architecture, and quality debt accumulation. These aren’t edge cases or operator errors. They’re what you get when you run a workflow at production scale that was never designed to run there.

In April 2026 an Amazon AI coding incident made this concrete at enterprise scale — a 13-hour service disruption, officially classified as user error, but read by the engineering community as something more structural. That incident has accelerated a shift that was already underway: from vibe coding to spec-driven development.

This is a diagnostic piece, not a tutorial. The broader strategic context is in the pillar. What follows is an evidence-based walkthrough of why the shift is happening, what it means technically, and whether it’s a genuine change or just a rebranding of ideas we’ve been cycling through since TDD.

What is vibe coding and why did engineers embrace it?

Andrej Karpathy coined “vibe coding” in February 2025 to describe a specific way of working: write natural language descriptions of what you want, let the AI generate the implementation, accept the output if the application runs, and move on. His original description was explicit about the posture — “I ‘Accept All’ always, I don’t read the diffs anymore.” He also framed the scope honestly: suited for throwaway weekend projects, not production systems.

The productivity case was real. Nearly 44% of developers had adopted AI coding tools by 2023, with some projects reporting gains of up to 55%. Platforms like Cursor, GitHub Copilot, Replit, Lovable, and Bolt proliferated faster than anyone expected. The barrier to prototype dropped fast, and people who’d never thought of themselves as developers started shipping real products.

Vibe coding works in contexts where rapid exploration matters more than long-term maintainability: scripts, greenfield experiments, single-session throwaway builds. The problem is assuming a workflow optimised for small-scope single-session work will hold up when you scale it. At a certain threshold of session length, codebase size, and production consequence, the structural assumptions collapse — and that’s what the next section is about.

For spec-driven development and what it means for engineering leaders in production contexts, the pillar article covers the broader picture.

Where does vibe coding break down in production?

The three failure modes aren’t random. They’re built into the workflow’s architecture.

Context decay happens because every AI session starts with an empty context window. Architectural decisions made in session one are invisible in session ten. The codebase accumulates inconsistencies the AI cannot see.

Hallucinated architecture is the patchwork problem: the AI generates structural decisions — module boundaries, data models, inter-service dependencies — that were never requested and often don’t fit together. Steve Tarcza, director of Amazon Stores, saw this directly: AI was “doing work you didn’t ask it to, going further than you wanted.”

Quality debt accumulation is the formal name for this. The arxiv flow–debt tradeoff paper documents the mechanism: vibe coding optimises for rapid flow at the cost of accumulating debt — security vulnerabilities, missing documentation, architectural fragility. Apiiro found a 322% spike in privilege escalation flaws in high AI code contribution repositories. More than 170 applications built on the Lovable platform had misconfigurations exposing sensitive data.

The failure modes compound. Context decay causes hallucinated architecture, which accelerates quality debt. By the time the problem is visible, remediation is expensive.

What is context decay and why does it matter?

Context decay is what happens when an AI agent’s working memory resets between sessions, discarding all the architectural decisions that were accumulated. Every time you start a new session, the AI has no memory of what was decided in the last one. Over a single session, that’s fine. Over a multi-session production project, the AI in session ten has no idea why module A was designed the way it was in session two.

LLM-based agentic coding assistants lack persistent memory: they lose coherence across sessions, forget project conventions, and repeat known mistakes. This is a constraint on how LLMs work, not a model quality problem.

The result is a codebase that’s internally inconsistent in ways that are hard to see from any one vantage point. The InnoGames engineering blog describes what it looks like in practice: the model “starts hallucinating issues that don’t exist, initiating unrequested refactorings, or losing track of what it was actually asked to do. This is not a theoretical concern — it happens reliably and predictably once the context gets crowded enough.”

Debugging becomes archaeology. Refactoring becomes guesswork.

SDD’s response is structural: persistent specification documents committed to the repository serve as the AI’s external memory. Any agent reads them at session start. A study of 283 development sessions building a 108,000-line distributed system found that agents with a “hot-memory constitution” — always-loaded specification documents — repeated far fewer known mistakes and maintained coherence across sessions.

That context decay problem is also what makes the Amazon incident worth understanding.

What did the Amazon incident reveal about agentic coding at scale?

In April 2026, an AI coding incident at Amazon produced a 13-hour service disruption. Amazon officially classified it as user error — a classification that matters and shouldn’t be dismissed.

Aragon Research’s framing adds the structural dimension: “The decision by an AI to delete a production environment reflects a lack of contextual awareness — the agent understands the technical goal but lacks the business judgement to weigh the cost of a thirteen-hour outage against the benefits of a fresh environment.” The agent had the capability. It lacked the contextual constraints that human engineers carry implicitly. Both frames can be true simultaneously.

The permission model is what made it possible. When an AI agent inherits senior engineer permissions, the speed of execution removes the human buffer that typically stops destructive commands from being finalised. Amazon’s response was governance escalation alongside continued adoption: Steve Tarcza stated “nothing ships without a human checking it first” — while Amazon simultaneously mandated 80% AI tool adoption.

The full structural analysis is in Amazon’s documented AI coding incident.

What is spec-driven development and how does it respond to these failures?

Spec-driven development (SDD) is a software development approach where structured, human-reviewed specifications are written and committed to the repository before any code is generated. The spec is the persistent source of truth across sessions and agents.

Prezi engineers describe the shift directly: “With spec-driven development, your code merely becomes the output of your work. It’s like the rendered MP4 file in a video project. You want a change? You edit the project, not the pixels in the rendered video.”

The core architecture is a “constitution” — typically Mission, Tech Stack, and Roadmap documents committed to the repository. Any agent starting a new session reads the constitution first. That’s the mechanism that addresses context decay: external memory, version-controlled, accessible in any session.

Against hallucinated architecture, SDD constrains the solution space before generation begins. Against quality debt, conformance tests paired with the spec mean the AI can’t silently expand scope.

The honest position: SDD reduces hallucination, it doesn’t eliminate it. Tarcza was direct: does spec-driven development solve hallucination and prompt injection? “No,” says Tarcza. “It reduces it at best. And even then, there are cases where it still does go beyond the specification.”

Karpathy’s proposed successor term — “agentic engineering” — describes exactly this model: developers orchestrate AI agents against specifications with human oversight, rather than prompting and accepting. The coiner of “vibe coding” has effectively moved past the pure form.

Commercial implementations include AWS Kiro, Amazon’s spec-first agentic IDE, GitHub Spec Kit, and Intent from Augment Code. The full 30-plus framework landscape is covered in its own article.

Is this a genuine paradigm shift or just a rebranding of existing practices?

The sceptical position is worth taking seriously. TDD was supposed to solve the spec-before-code problem around 2000. BDD extended it in 2006. MDD tried to make models primary through the 1990s. Each generation stalled. To a sceptical reader, SDD looks like the same idea with new branding.

RedMonk analyst Kate Holterhoff’s “Adventures in Vibe Coding” contextualises this as a maturation phase rather than a revolution. That’s the useful framing — pattern recognition from people who’ve watched several of these cycles.

The load-bearing argument for why this cycle is different sits with MDD’s failure. Model-driven development collapsed because spec languages were too rigid and code generators couldn’t handle complexity. LLMs change both sides. As Thoughtworks observes: “Given LLMs’ ability to manipulate text, it’s unsurprising that specs may play so nicely with the growth of AI in software.” The constraint that defeated previous generations has been removed.

Prezi engineers describe SDD as the synthesis that LLM availability finally makes viable. The full treatment is in A Sufficiently Detailed Spec Is Code — the community principle behind the paradigm shift.

The maturation signal that distinguishes this cycle is not coming from a vendor keynote. Amazon Stores mandating human review of all AI output, post-incident, is a different kind of signal than a framework announcement. The evidence is still early, but the source matters.

Where does this leave engineering teams right now?

Vibe coding isn’t over. The arxiv practitioner guidelines are specific about its scope: small, self-contained, low-consequence tasks where rapid exploration matters more than long-term maintainability. The obituaries are premature. The scope boundary is what matters.

The decision you face right now is not whether to adopt SDD. It’s to identify which of your current AI coding workflows are already producing the conditions for context decay, hallucinated architecture, or quality debt — and to address those first.

The numbers make the gap concrete. Tarcza’s engineers spend less than 30% of their time on core engineering — the rest goes to oversight and governance. The Sonar 2026 State of Code survey found that 96% of developers don’t fully trust AI-generated code, yet only 48% always check it before committing. That gap is where quality debt accumulates.

Your practical starting point: prioritise specs where the cost of context decay is highest — long-running projects, multi-session workflows, production systems with security or data consequences. Add governance gates proportional to blast radius. Treat tooling as secondary to process.

The direction of travel, in Karpathy’s framing, is agentic engineering: you as the orchestrator of AI agents working against specifications, with human oversight at the gates that matter.

This article’s scope ends at the diagnostic. The governance and implementation action items are in Spec-Driven Development in Regulated Industries — governance, compliance, and audit trails. The tooling evaluation is in the 30-plus framework landscape. The question of what comes next is answered across the cluster.

FAQ

What is the difference between vibe coding and agentic engineering?

Vibe coding prioritises speed over structure: prompt, accept the output, iterate without reading every line. Agentic engineering keeps AI-driven implementation but adds orchestration — you manage agents against specifications with human oversight. The key distinction is governance. “Vibe coding is dead” is a less accurate read than “vibe coding has a defined scope.”

Why is context decay specifically a production problem rather than a development problem?

In development, context decay produces inconsistency that slows future work but doesn’t immediately cause failures. In production, the decay accumulates architectural decisions that no single review can fully audit. The AI in session ten genuinely cannot know why the auth module was designed in session two. Human reviewers face the same opacity when auditing a vibe-coded codebase under incident conditions.

Does spec-driven development actually fix hallucinations, or just reduce them?

It reduces them. Steve Tarcza at Amazon Stores is explicit that SDD reduces hallucination at best — there are still cases where the AI goes beyond the specification. A spec narrows the solution space before generation, but it can’t prevent all hallucinated decisions. SDD makes hallucinations more detectable and bounded — that’s the honest expectation to set.

What is the flow–debt tradeoff in vibe coding?

The arxiv paper “Vibe Coding in Practice: Flow, Technical Debt, and Guidelines for Sustainable Use” provides the formal framing. “Flow” is the rapid, low-friction development experience. “Debt” is the accumulating architectural inconsistencies, security vulnerabilities, and testing gaps. The tradeoff is structural — the workflow optimises for flow by skipping the specification work that prevents debt. Short-term productivity gains can be followed by maintenance costs that erode them.

Was the April 2026 AWS incident really caused by vibe coding?

Amazon officially classified it as user error, and that classification shouldn’t be dismissed. What the engineering community observed was a governance and permission model failure — an AI agent operating with broad authorisation made decisions that required business judgement it didn’t have. Both frames can be true simultaneously: user error in the permissioning, and a structural condition that made the consequences possible at that scale.

When is vibe coding still appropriate and when does SDD make more sense?

Vibe coding works for small scope, self-contained tasks, throwaway scripts, and single-session rapid prototyping with no production consequence. SDD is the right call when projects span multiple sessions, involve production systems, carry security or data consequences, or need to be maintained by others. The heuristic: if context decay would produce a codebase you can’t safely audit, vibe coding is not the right workflow.

SDD is a direct descendant. InfoQ frames SDD as a “fifth-generation programming shift” with TDD, BDD, and MDD as predecessors. The key difference is that SDD is designed for AI agents as implementers — it uses natural language specifications that LLMs can interpret rather than executable tests in code. SDD’s spec layer is upstream of TDD’s test layer. They’re complementary, not competing.

What is a “constitution” in spec-driven development?

The arxiv codified context paper describes a “hot-memory constitution” as the top-level specification set — always-loaded conventions committed to the repository. Typically Mission, Tech Stack, and Roadmap documents. Any agent starting a new session reads the constitution first. The architectural context lives in the repository rather than in the AI’s context window, accessible to any agent in any session.

What did Prezi engineers mean by calling SDD the “culmination of TDD, BDD, and MDD”?

Prezi engineers described SDD as synthesising the core insight from each predecessor — TDD’s executable specs, BDD’s natural language bridge, MDD’s model-as-primary-artifact. The synthesis is viable now because LLMs can interpret natural language specifications, removing the constraint that defeated MDD’s rigid spec languages. For engineers who know TDD and BDD, SDD is less a departure than a completion.

Are there companies using spec-driven development at scale in production right now?

Yes. Amazon Stores (StoreGen), led by Steve Tarcza, is the most detailed public case study — mandatory human-approval gates on all mutating AI actions, structured specifications before any AI-generated code goes to production. The Amazon Stores case emerged from governance necessity post-incident, not from framework advocacy. JetBrains and DeepLearning.AI jointly offer a “Spec-Driven Development with Coding Agents” course — practitioner training at this scale suggests adoption beyond early experimentation.

Does RedMonk think spec-driven development is just hype?

RedMonk analyst Rachel Stephens’ “Spec vs Vibes” contextualises SDD as the organised engineering response to the vibe coding failure pattern, consistent with how mature engineering practices have emerged historically. Kate Holterhoff’s “Adventures in Vibe Coding” frames it as a maturation phase, not a revolution. RedMonk’s value is vendor-neutrality and a track record of distinguishing genuine structural shifts from marketing cycles.

The Model Release Treadmill — What Accelerating AI Releases Mean for Enterprise Deployments

Between February 5 and May 5, 2026 — 89 days — OpenAI shipped five distinct GPT-5.x variants. Claude Opus 4.6 opened that window. GPT-5.5 Instant closed it. Anthropic, Google, Alibaba, Xiaomi, and DeepSeek matched pace throughout. The Frontier Model Release Velocity Index (FMRVI), published by Digital Applied to track substantive frontier launches across labs, recorded more than 12 releases in Q1 2026 — roughly three per week.

That cadence is the model release treadmill. If your business runs AI in production, it changes how you plan, budget, and build. This article is an orientation hub for six cluster pieces: Five Models in Three Months, Multi-Model Month: April 2026, Deprecation Pressure, Benchmark Inflation, Lock In vs Keep Up, and Architecture That Survives Model Churn.

What is the AI model release treadmill and why does it matter for enterprises?

The model release treadmill describes the structural shift from biannual flagship releases to a weekly cadence. The FMRVI recorded a doubling of release velocity between Q4 2025 and Q1 2026. Each new model is a distinct artifact — different behaviours, prompting requirements, and its own deprecation clock — not a patched version of its predecessor. Evaluation, integration, and validation must now run in weeks. Model selection has become a continuous operational burden.

Deep dive: the GPT-5.x timeline.

What did the 89-day window between February and May 2026 actually look like?

In 89 days, OpenAI shipped GPT-5.3 Instant (March 3), GPT-5.4 (March 5), GPT-5.4 mini (March 18), GPT-5.5 — nicknamed “Spud” — (April 23), and GPT-5.5 Instant (May 5). Claude Opus 4.6 opened the sequence on February 5, making this cross-vendor from day one. The cadence compressed across the window: GPT-5 to GPT-5.2 took 18 weeks; GPT-5.4 to GPT-5.5 took just 7. Each event required enterprise teams to assess regression risk and decide before the next release arrived.

Full timeline: the 89-day release sequence.

Is this just OpenAI, or are all the major labs running at this pace?

It is industry-wide. Anthropic’s Opus cadence compressed from 26 weeks (4 → 4.5) to 10 weeks for each subsequent pair. Google halved its Pro interval from 34 weeks to 13. Alibaba shipped seven Qwen variants in ten weeks — the highest-cadence shipper in Q1 2026 per FMRVI. April 2026 saw Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4, and Kimi K2 Thinking all arrive within the same month.

Cross-vendor convergence: evaluation window collapse in April 2026.

Why do faster model releases mean faster model deprecations?

Each new release creates pressure to retire its predecessor. Production lifespans that once ran 12–18 months are compressing toward six. OpenAI retired GPT-4o in February 2026 and GPT-5.1 in March. Anthropic announced Claude Sonnet 4 and Opus 4 would reach end-of-life June 15 — approximately 62 days’ notice. Gemini 2.0 Flash shuts down June 1. The faster the treadmill runs, the shorter the runway before a forced migration.

Deprecation and retirement are distinct: deprecation blocks new deployments but leaves existing calls working; retirement removes the endpoint entirely. After retirement, calls either fail or silently activate an unplanned fallback — routing to a different, potentially weaker model — without any deployment-level alert.

Migration checklist and contractual guidance: the six-month shelf life of enterprise AI.

What does a model deprecation actually cost in engineering time?

The API config change — updating the model string in a .env file — takes an afternoon; the rest does not. Prompt revalidation, where every system prompt and few-shot example must be tested against the new model’s behaviour, takes weeks. Add regression testing, schema repair for changed output formats, and compliance re-validation in regulated industries, and the true migration cost is measured in engineering weeks. Most organisations have no AI continuity plan to absorb these costs predictably.

Full breakdown: what deprecation pressure costs in practice.

Why should you stop relying on AI benchmark scores to choose a model?

At today’s release velocity, published benchmarks are structurally outdated before they reach you. GPT-5.5’s safety card compared it to Claude Opus 4.5 — already superseded by Opus 4.7 before the card was published. Scores also suffer from contamination: evaluation data infiltrates training corpora. Kimi K2 self-reported 50 percent on the HLE benchmark; independent testing found 29.4 percent. Intelligence rank and usage rank diverge sharply — a leaderboard position is not a production decision.

Alternatives: when leaderboards mislead.

What is driving the open-source and Chinese lab acceleration — and why does it matter?

Chinese labs are setting pace, not catching up. Xiaomi’s MiMo V2 reached 21 percent of OpenRouter token volume in four months. DeepSeek-V3.2 benchmarks against GPT-5 and Gemini 3.0 Pro as an open-weight model with a reported training cost of $4.6 million. Kimi K2 Thinking was described as a “DeepSeek moment” for Moonshot AI. Open-weight releases add capable new models to your evaluation queue without adding managed deprecation policies: no endpoint removal, no formal notice period.

Independent testing found Kimi K2’s self-reported HLE score of 50 percent was actually 29.4 percent — a gap that illustrates the evaluation burden open-weight models place on your team.

Evaluation window implications: the cross-vendor convergence case study.

How does constant model selection affect your operational budget?

The FMRVI documents compression of enterprise evaluation cycles from six months (2024) to three months (2025) to four weeks (Q2 2026), with no floor yet observed. At four-week cadence, model evaluation becomes a standing operation. FMRVI recommends budgeting three to five percent of total AI spend for evaluation infrastructure and holding ten to fifteen percent as an uncommitted reserve — when a release resets the capability frontier mid-quarter, organisations that can reallocate budget in two weeks outperform those on rigid annual plans.

Strategic budget framing: the enterprise model strategy dilemma.

What are the two strategic choices — and what does each actually cost?

The lock-in posture commits to a specific model version, reducing evaluation overhead but accumulating technical debt toward a forced migration. The keep-up posture chases releases, maximising capability access but converting engineering capacity into a standing upgrade operation. Neither is neutral — provider lock-in is a single-vendor dependency with one remedy, while model lock-in is tight coupling to specific output patterns with a different remedy. Staying put on either eventually forces a migration with no abstraction layer in place.

Full framework: lock-in vs keep-up decision framework.

What architectural patterns reduce the cost of living on the treadmill?

A model abstraction layer — a thin interface between your application code and the provider API — is the most effective starting point, enabling model swaps without codebase refactoring. Pair it with model-agnostic prompt design, a continuous evaluation harness running against a canonical task set of 30–50 production-representative tasks, and staged rollout gates routing a percentage of traffic to a candidate model before full cutover.

Implementation guide: AI architecture that survives model churn.

The sections above frame each dimension of the treadmill problem; below is a navigation guide for where to enter based on your current situation.

Reading Guide: Where to start based on your current situation

Just received a deprecation notice — Deprecation Pressure for migration checklist and vendor questions.

Trying to understand the last 90 days — Five Models in Three Months for the timeline, then Multi-Model Month: April 2026 for the cross-vendor picture.

Presenting this to your board — Lock In vs Keep Up for strategic framing and board-ready language.

Building or refactoring production AI systems — go directly to Architecture That Survives Model Churn.

Making a model selection decision using benchmark scores — read Benchmark Inflation first.

Resource Hub: The Model Release Treadmill Library

Understanding the Treadmill — Evidence and Timeline

Five Models in Three Months: The 89-day release sequence and why the pace is structurally new.
Multi-Model Month — April 2026: Cross-vendor convergence and the collapse of evaluation windows.
Benchmark Inflation: Why published benchmarks are structurally outdated and what to use instead.

Managing the Operational Consequences

Deprecation Pressure: What model deprecation costs in practice, with migration checklist and contractual guidance.
Lock In vs Keep Up: Strategic framework for model stability versus capability currency.
Architecture That Survives Model Churn: Abstraction layers, continuous evaluation harnesses, and staged rollout patterns.

The treadmill is not going to slow down. The competitive dynamic between Western labs and Chinese labs is a feedback loop with no obvious floor. What changes is whether your architecture treats each new release as an emergency or a routine evaluation event.