Which SaaS Vendors Will Survive the AI Reckoning — A Framework for Evaluating Your Stack

In the first week of February 2026, more than $1 trillion in software market capitalisation vanished. The market called it the SaaSpocalypse. What most technology buyers still lack is a framework for reading it correctly.

Here is the thing: AI disruption does not hit your SaaS stack evenly. The February sell-off made no distinction — Procore and Zendesk got hammered equally, as if they face identical risks. They do not.

Bain’s four-scenario model, from its Technology Report 2025, gives you the map. It places every SaaS vendor on two diagnostic axes and produces four quadrants. This article makes it usable for mid-market CTOs, with real named vendors in every quadrant.

One data point first. Publicis Sapient reduced its traditional SaaS licences by approximately 50% — including Adobe — replacing them with generative AI tools. That is not a forecast. That is a company that ran this analysis and acted. The question is which of your vendors would survive the same scrutiny.


Why do some SaaS vendors look defensible while others look doomed — and what determines the difference?

Two things distinguish defensible from doomed. First: whether AI can now perform the complete workflows the vendor’s software orchestrates. Second: whether the vendor’s revenue depends on human seat counts, or on something harder to displace — proprietary data governance, compliance infrastructure, or domain integrations built over years.

Vendors whose revenue is built on human seats doing things AI can now do face direct compression. Agentic AI does not replace HubSpot — it replaces the five marketing coordinators who each needed a HubSpot seat to run campaigns. The platform may persist; the licence revenue shrinks. This is what people are calling the “starved not killed” dynamic.

The median EV/Revenue multiple for public SaaS companies stood at 5.1x in December 2025, down from 18–19x at the pandemic peak. The market has already priced in structural change — it has just done so indiscriminately. The Bain framework disaggregates which repricing is justified and which is noise.

The outcome is K-shaped bifurcation: vendors in Core Strongholds and Gold Mines are adapting and emerging stronger; Open Doors and Battlegrounds vendors face double pressure from AI-native competition and shrinking seat demand. For how agentic AI attacks SaaS business models at the mechanism level, see our analysis here.


What is the Bain four-scenario framework and how do its two axes work?

Bain plots SaaS workflows on two independent axes. Their intersection produces four quadrants.

Axis 1 — user automation potential: Can the humans using this tool be automated away? Monday.com task coordination is structured and repetitive — high automation potential. Procore site management involves liability chains requiring human judgment — low automation potential.

Axis 2 — AI penetration potential: Can AI now perform the tasks this tool handles with equivalent accuracy? Customer support tickets score high — AI resolves them reliably today. Clinical trial data validation scores low — regulatory requirements mandate deterministic accuracy that AI cannot certifiably meet.

The two axes are independent, which is what gives you four distinct quadrants rather than a simple safe/at-risk binary:

AI competition happens at the orchestration layer — precisely where most SaaS workflow coordination lives. Vendors without defensible data or compliance moats face displacement from the layer below them. The mechanism-level breakdown is in our agentic AI analysis.


Which SaaS vendors are Core Strongholds — and what makes them genuinely hard to displace?

Core Strongholds sit in the low/low quadrant. What makes them defensible is not brand strength. It is compliance-critical data governance, deep proprietary datasets that AI-native competitors cannot quickly replicate, and switching costs that exceed the benefit of substitution.

Procore — construction project management requires strict oversight around site safety, subcontractor liability, and regulated data flows. The liability chain cannot be delegated to an AI agent without legal exposure. Core Stronghold.

Medidata — clinical trial randomisation requires 100% accuracy as a regulatory requirement. Patient safety stakes and audit trail obligations make AI replacement a compliance risk before it is even a technology question. Core Stronghold.

Epic and Cerner — both control proprietary patient data under HIPAA governance. The data moat is also a regulatory moat. Core Stronghold.

IQVIA — its proprietary pharmaceutical dataset accumulated over decades creates a moat that AI-native competitors would need years to match. Core Stronghold.

Guidewire — holds exclusive insurance claims data, placing it firmly in Core Stronghold territory on its data advantage side. It also occupies Gold Mine territory by actively embedding AI-enhanced underwriting into that proprietary data foundation.

The data moat is the most important Core Stronghold defence. Bain puts it plainly: “Your data is your moat. While models such as GPT-4o are everywhere, the real value lies in the proprietary data you own.” An AI-native competitor can access the same foundation models. It cannot access decades of proprietary transaction history.

One limit worth noting: the data moat protects the system of record, not the engagement layer sitting above it — a tension Workday and Epic are actively navigating.

For the broader picture of how this bifurcation plays out across the market, see our pillar analysis.


Which vendors are Open Doors — and what does that classification mean for your contracts?

Open Doors vendors sit in the low/high quadrant. AI can perform the tasks these tools handle, but human participation has not yet been eliminated end-to-end. The risk is spend compression — seat-based revenue under sustained pressure as AI reduces the number of humans needed.

HubSpot — marketing automation, email campaigns, and lead scoring are probabilistic tasks AI replicates with sufficient accuracy that the human coordinator role is being compressed. UncoverAlpha puts it bluntly: AI replicates HubSpot’s core function at approximately 1% of the cost. HubSpot’s stock fell from $880 to approximately $200–233. Open Door.

Monday.com — task assignment, status tracking, and deadline management are structured and automatable. The team that needed 15 seats to coordinate a project may need five. Monday.com fell 77% in the same period. Open Door.

Atlassian — layer-dependent. The project management and ticketing tier is an Open Door. The developer tooling layer is considerably more defensible.

LegalZoom — document generation and lower-complexity compliance workflows are increasingly AI-replicable. Open Door trending toward Battlegrounds.

Contract strategy: use renewal timing to renegotiate seat counts. Open Doors vendors know they are vulnerable and may offer flex credits across seats and AI agents. Evaluate by layer, not as a monolithic product. The renegotiation playbook is in ART006.


What makes Gold Mines and Battlegrounds the most contested SaaS categories in 2026?

Gold Mines and Battlegrounds both involve high AI capability. The direction is what differs. Gold Mines is where AI creates new value legacy SaaS cannot match. Battlegrounds is where AI directly replicates what incumbents do.

Gold Mines

Cursor — an AI-native code editor that displaced legacy IDEs not by replicating their feature sets but by doing something they structurally cannot: generating code rather than assisting humans who type it. Cursor hit $1 billion ARR in less than 24 months. AI-native companies are achieving approximately $700K ARR per employee versus traditional SaaS requiring far larger teams for equivalent revenue. The evidence on who is winning these Gold Mine categories is in our AI-native versus incumbent analysis.

Guidewire on its AI-enhanced side is a legacy vendor successfully transitioning to Gold Mine territory. Its insurance-specific dataset is the asset; AI-enhanced underwriting built on proprietary data creates more value than an AI-native competitor starting from general-purpose models.

Battlegrounds

Intercom — conversational AI has directly replicated Intercom’s core function. The agentic AI handling support tickets today does not need Intercom’s routing and escalation workflow — it is the routing and escalation. Highest near-term displacement risk in the mid-market stack.

Tipalti — accounts payable automation (invoice processing, approval routing, payment execution) is precisely the multi-step, rules-based workflow that agentic AI was designed to handle.

ADP — payroll itself is deterministic and defensible. The engagement layers are Battlegrounds territory. Classify by revenue exposure: if the licence revenue comes from the engagement layer, the vendor is Battlegrounds-exposed.

Zendesk — ticket routing, escalation management, and agent assignment is precisely what agentic AI does autonomously.

Salesforce straddles the framework. CRM data depth — 15+ years of customer history — is a Core Stronghold asset. The workflow coordination layer is Battlegrounds territory. Whether incumbency plus AI pivot is sufficient to defend against AI-native CRM alternatives is the most actively contested question in enterprise software right now. For mechanism-level detail on how agentic AI attacks these workflows, see our analysis.


What does the deterministic vs. probabilistic distinction add to the Bain analysis?

When you are auditing a thirty-tool stack, a rapid first-pass filter helps. The deterministic/probabilistic distinction gives you exactly that.

Deterministic SaaS requires 100% accuracy. Payroll cannot be approximately right. Compliance reporting cannot tolerate meaningful error. CIOs are already rejecting AI replacement in financial services for this reason — a system correct “six out of ten times” is insufficient. Deterministic SaaS is inherently more defensible.

Probabilistic SaaS tolerates approximation. Marketing content needs to be good enough to generate engagement. Task coordination needs to be adequate, not optimal. AI handles these functions with sufficient accuracy that seat-based workflows become optional.

The overlay maps cleanly: deterministic SaaS maps predominantly to Core Strongholds; probabilistic SaaS maps predominantly to Open Doors and Battlegrounds.

For straddlers like ADP, Salesforce, and Workday — each has a deterministic data layer (defensible) and a probabilistic engagement layer (exposed). Classify by revenue exposure: which layer generates the licence revenue?

A quick first-pass for a representative mid-market stack:

The mechanism connecting deterministic/probabilistic to how agentic AI attacks these workflows is in our agentic AI analysis.


What is the Forrester REAP model and how does it complement the Bain framework?

The Bain framework classifies. The Forrester REAP model tells you what to do about it. They address different parts of the same problem.

REAP is Forrester’s application disposition matrix: Reassess, Extract, Advance, Prune.

Applied to named vendors: Procore → Extract with Advance. HubSpot → Reassess. Cursor → Advance. Tipalti, Zendesk, Intercom → Prune.

REAP’s limitation: it gives you disposition categories, not timing strategy. Knowing you should Prune Zendesk does not tell you when to exit or how to sequence the transition. That belongs to the contract audit and action playbook.

Bain is richer for classification; REAP is richer for portfolio governance. Using both eliminates the two most common failure modes — classification without action, and action without classification rationale. See also the broader strategic context in our SaaS reckoning overview.


What does the Klarna experience tell us about applying these frameworks in practice?

Klarna is the most instructive real-world data point for organisations considering aggressive AI substitution — not because the results were uniformly positive, but precisely because they were not.

Klarna deployed an AI customer support agent handling work equivalent to 700–853 human agents. Since 2022, it reduced its workforce approximately 50% through attrition while growing revenue from $300,000 to $1.3 million per employee. It replaced Salesforce’s CRM engagement layer with an in-house AI stack — running, in effect, its own Battlegrounds analysis and concluding the engagement layer was replaceable. The cost thesis was directionally correct.

Where it went wrong: customer satisfaction declined. The nuance lost in aggressive AI substitution was quality in complex, high-stakes interactions — the deterministic edge cases where probabilistic AI performs poorly. Klarna reversed course and began rehiring human support staff.

The lesson here is important. The Bain framework is a classification tool, not a replacement guarantee. A correct Battlegrounds classification means the vendor faces structural displacement risk — not that the AI-native alternative is ready for 100% of the workflow on day one. Klarna did not abandon the thesis; it recalibrated the application.

For Salesforce specifically: Klarna replaced the engagement layer. The data depth question is different and more complex. The Bain framework forces you to make that layer distinction explicitly rather than treating the product as a single binary choice.

The operational playbook for managing these transitions is what to do next. This article establishes the framework.


Frequently Asked Questions

How do I know if my SaaS vendor will survive the AI disruption?

Apply the Bain two-axis test: (1) can AI now perform the tasks this tool handles with equivalent accuracy? (2) can the human users be automated away, reducing seat demand? Yes to both means Battlegrounds. No to both means Core Stronghold. Use the deterministic/probabilistic first-pass before the full test: if the tool requires 100% accuracy for its core function, it is likely more defensible.

Can AI really replace Salesforce or Workday for my company?

Not in full. Both have deep data layers that are defensible. What AI is replacing is the engagement layer — workflow coordination, approvals, service interfaces. Klarna replaced Salesforce’s CRM engagement layer, not its data depth. Evaluate each vendor by layer, not as a monolithic product.

What is the difference between SaaS vendors that are safe from AI and those that aren’t?

Safe vendors (Core Strongholds) share three characteristics: compliance-critical data governance AI cannot reliably own, deep proprietary datasets AI-native competitors cannot quickly replicate, and switching costs that exceed the benefit of substitution. At-risk vendors rely on seat-based revenue from human workflows that AI can now replicate.

What does “deterministic SaaS” mean and why does it matter?

Deterministic SaaS requires 100% accuracy for its core function — payroll, ERP, healthcare records, compliance management. AI can assist but cannot govern these workflows at current accuracy levels. Probabilistic SaaS tolerates approximation — content, marketing, task coordination — which is why AI is displacing it faster.

Why did Klarna replace Salesforce, and should I do the same?

Klarna concluded Salesforce’s CRM engagement layer was in the Battlegrounds quadrant and built an AI-native replacement. The cost savings were real; the customer satisfaction decline was also real. Whether to follow depends on your vendor’s quadrant classification for your specific workflows and your tolerance for transition risk. The action playbook is here.

What SaaS tools should I cut because AI can replace them?

Battlegrounds vendors — Tipalti, Intercom, and Zendesk are the clearest near-term candidates. Open Doors vendors (HubSpot, Monday.com) warrant renegotiation before exit. Do not exit a deeply integrated Battlegrounds vendor without first running dependency analysis and evaluating whether the AI-native alternative handles your deterministic edge cases. The audit playbook addresses that process.

What is the Forrester REAP model and where can I find the full version?

REAP stands for Reassess, Extract, Advance, and Prune — a portfolio disposition framework from Forrester Research that converts Bain quadrant classification into a decision instruction. The full methodology is paywalled. Publicly available framing: Core Strongholds → Extract or Advance; Open Doors → Reassess; Gold Mines → Advance; Battlegrounds → Prune.

Is the SaaSpocalypse a real structural shift or overblown market panic?

Both. The February 2026 sell-off treated Procore and Zendesk as equivalent risks. The structural shift is real for Battlegrounds and Open Doors vendors; it is overstated for Core Strongholds. The median EV/Revenue multiple compressing from 18–19x to 5.1x reflects structural repricing — but the sell-off applied it uniformly. The Bain framework disaggregates signal from noise.

How does seat-based pricing create AI disruption risk for SaaS vendors?

Seat-based pricing ties vendor revenue to the number of human users. When AI agents reduce the number of humans needed to execute a workflow, seat count drops even if the platform remains in use. A team needing 20 HubSpot seats to run marketing operations may need eight when AI handles content generation and campaign reporting.

What is the three-layer agentic stack and why does it matter for SaaS?

The three-layer agentic stack: systems of record at the base (governed data, compliance logic), agent operating systems in the middle (the orchestration layer), and outcome interfaces at the top. AI competition with SaaS happens at the orchestration layer — where agentic AI is replicating the workflow coordination functions that SaaS platforms were built to provide. Early agent operating systems include Microsoft’s Azure AI Foundry, Google’s Vertex AI Agent Builder, and Amazon Bedrock Agents. Vendors without defensible data or compliance moats face displacement from below.

What does the Bain framework say to do if most of my SaaS spend is in the Battlegrounds quadrant?

Treat those vendors as disposal candidates and begin transition planning. Battlegrounds classification means the vendor faces structural revenue decline as AI agents replicate their core workflows — on a timeline of months to a few years, not decades. The practical steps — exit sequencing, contract timing, AI-native tool evaluation — are the subject of the practical playbook.

Should I wait for my SaaS vendor’s AI product to mature before acting?

Only if the vendor is in Open Doors or Core Strongholds — where watching the AI transition play out is rational. For Battlegrounds vendors, the AI alternative is maturing faster than the incumbent’s AI pivot in most cases. Waiting for Zendesk to out-innovate agentic customer support AI is a lower-probability bet than beginning evaluation now. The audit playbook addresses the timing question in full.

SaaS Pricing Is Shifting from Per-Seat to Usage and Outcome — What Changes at Your Next Renewal

Per-seat pricing was built for a world of human users. AI agents don’t log in, don’t consume named-user licences, and don’t map to headcount — so the model is structurally broken in an AI agent economy.

Vendors know this and they’re moving fast. Gartner predicts 40% of enterprise SaaS spend will shift to usage- or outcome-based models by 2030. Maxio found 83% of AI-native SaaS companies have already made the switch at the vendor level. The transition is happening whether you engage with it or not. The only question is whether your organisation shapes the terms or inherits them.

This article gives you a clear taxonomy of what’s replacing per-seat pricing, what the transition looks like in practice from two real incumbent case studies, and a concrete pre-renewal checklist for your next vendor negotiation. For the full strategic picture, see how the SaaS reckoning is reshaping technology budgets.


Why is per-seat pricing structurally misaligned in an AI agent economy?

Per-seat pricing charges a fixed fee per named user account regardless of actual usage. It was designed for human-operated software where value tracked headcount — simple unit economics both sides could understand.

AI agents break every assumption that logic rests on. They don’t consume seats — they consume compute cycles, API calls, tokens, and workflow executions. Charging per seat for an AI-augmented workflow is like billing per driver for a highway used mostly by autonomous vehicles. Under per-seat pricing, adding AI agents means paying for human licences the agents never use, while absorbing AI consumption costs layered on top. Bain‘s analysis of 30+ major SaaS vendors found roughly 65% have layered an AI consumption meter on top of existing seat pricing. You end up paying twice.

The mechanism vendors use to monetise this is what Tropic calls the AI Tax: a 20–37% price uplift at contract renewal, imposed through AI feature bundling or Forced SKU Migration — where vendors retire legacy pricing tiers and compel customers onto AI-inclusive packages. Slack, Google Workspace, Salesforce — all have done it in recent years.

The misalignment between agent-driven workflows and per-user billing is the root cause, not vendor opportunism. That’s why the market is moving — and why the transition is already showing up at your renewal.

For context on why the SaaS reckoning created this pressure, see why the SaaS reckoning is driving this pricing shift.


What does Gartner’s 2030 prediction mean for your next SaaS contract?

Gartner predicts that by 2030, at least 40% of enterprise SaaS spend will shift toward usage-, agent-, or outcome-based pricing, with seat-based vendor revenue share declining from 21% to 15% (Deloitte TMT Predictions 2026). Maxio confirms the supply already exists: 83% of AI-native SaaS companies offer usage-based pricing. The negotiating lag is on the buyer side.

This creates a window that is open now and will close as pricing models standardise. Vendors in transition are often looking for design partners — buyers willing to commit on favourable terms in exchange for early access. That’s a very different conversation from “give us a discount.”

The framing that works: “The analyst consensus says seat-based pricing is declining. We’d like a contract that reflects that trajectory, with transition terms written in now.”

See how the SaaS reckoning is reshaping technology budgets for the broader strategic context.


What is the difference between usage-based, consumption-based, and outcome-based pricing — and which should you prefer?

These three terms get conflated constantly in vendor materials. The distinctions matter operationally.

Usage-based pricing charges proportionally to measured inputs: API calls, tokens processed, compute units, data volumes. Predictable if you can model usage patterns; unpredictable if AI agent activity varies. It’s the most common post-per-seat model.

Consumption-based pricing is functionally similar but applied to output volumes — pages generated, emails sent, records processed. Most vendor implementations blur the line. Joey Quirk from Chargebee puts it bluntly: “It’s just usage pricing with a marketing degree.”

Outcome-based pricing charges per verified business result — per resolved support ticket, per closed deal, per completed workflow. Costs tie to value delivered, not activity. The catch: you need a contractual Outcome Measurement Agreement defining what counts as a valid outcome before you sign.

Hybrid pricing — fixed base plus variable consumption — is the dominant transition state. Most enterprise renewals in 2025–2026 will land here.

Which model to go for: push for hybrid with a hard consumption ceiling when AI adoption is early; move to usage-based when patterns are stable; only commit to outcome-based when you have the instrumentation to support an Outcome Measurement Agreement.


What does Zendesk’s outcome-based pricing model reveal about how the transition works in practice?

In August 2024, Zendesk became the first major incumbent SaaS vendor to launch outcome-based pricing for AI agents: billing per Automated Resolution (AR) — a support ticket resolved by AI without human intervention, confirmed after 72 hours of inactivity. The pricing: $1.50 per committed AR, $2.00 pay-as-you-go.

The thing most buyers miss: you need to agree contractually on what counts as “resolved” before signing. That Outcome Measurement Agreement is the prerequisite most teams aren’t through yet. Zendesk’s definition — AI confirmed the response was relevant — is Zendesk’s definition until you contractualise it.

And the buyer risk is real. Vendors control the resolution-count methodology unless it’s explicitly contractualised. Zendesk’s AI agent pricing dropped roughly 50% within a single year due to competitive pressure. Buyers who locked in early at 2024 rates are now navigating that overpayment.

Zendesk’s move signals where the market is going. Intercom confirmed it with 393% annualised Q1 growth on its AI agent Fin, tying revenue directly to ticket resolutions. That’s the template.

For evaluating vendors as part of a broader portfolio review, see the SaaS portfolio audit and vendor renegotiation playbook.


How do you negotiate a pricing model transition with a vendor that hasn’t offered one yet?

Most vendors haven’t proactively offered usage- or outcome-based pricing. The transition needs to be initiated by the buyer.

Position as a design partner, not a difficult customer. The framing that opens doors: “We’re planning significant AI agent deployment over the next two years — we’d like to explore what a usage-based structure looks like, and we’re willing to commit on a multi-year basis if the pricing model aligns with our consumption trajectory.” Contrast that with “We want lower prices.” One gets you a strategic conversation. The other closes the door.

Use data anchors. The Gartner 40% shift prediction, Maxio’s 83%, Zendesk’s August 2024 outcome-based model, and Salesforce’s Agentic Enterprise License Agreement (AELA) all give you market-consensus standing. The AELA — unlimited use of Agentforce, Data 360, and MuleSoft for a fixed fee on 2–3 year terms — is the template for how to frame a value-commitment conversation at any scale.

When the vendor isn’t ready, negotiate defensively:

Start renewal conversations six to nine months before contract expiry. For hybrid pricing, never commit to open-ended consumption billing — insist on a hard ceiling above which usage is throttled, not charged.

For the complete negotiation framework, see the broader SaaS portfolio audit and vendor renegotiation playbook.


What changes operationally when your SaaS billing shifts from fixed seats to variable usage?

Per-seat billing is operationally simple: a fixed line item per vendor, predictable to the cent. Variable usage billing requires new operational infrastructure across finance, procurement, and compliance.

Finance impact. Variable SaaS spend can’t be fully budgeted at the start of a fiscal year. Consumption-based line items need reserve buffers and real-time spend dashboards. Flexera found over 70% of organisations report business units purchasing more SaaS than IT is even aware of. That needs to change before you move to variable billing.

FX exposure. For Australian companies using global vendors billed in USD, variable consumption billing introduces FX variance that fixed-seat contracts don’t. Model this before committing to large consumption-based contracts.

Procurement complexity. Outcome Measurement Agreements require procurement to define, audit, and enforce business outcome metrics — a capability most procurement teams simply don’t have for software contracts yet.

The CFO case. Usage-based billing creates the mechanism for a strategic reframe: consumption spend is budget reclassified from headcount to technology, not an incremental cost. Bain puts it plainly — selling an AI usage model “requires shifting budget lines from labour to software.” If your AI agents reduce support headcount by two roles, the consumption billing covers work that was previously a labour cost. That’s the argument that gets finance on side.

For operational infrastructure on hybrid pricing management, see the SaaS portfolio audit and vendor renegotiation playbook.


Your pre-renewal checklist: what to look for, what to ask for, and what not to concede

This checklist goes directly into your next vendor renewal conversation.

Before the renewal meeting

Terms to ask for

  1. Price protection clause: cap annual increases at 3–5%, CPI-indexed. Eliminate “market rate” language.
  2. SKU-level price lock: prevents forced migration to a more expensive AI-inclusive SKU mid-term.
  3. Consumption cap with hard ceiling: for any hybrid or usage-based component, set a maximum monthly spend above which usage is throttled, not charged.
  4. Outcome Measurement Agreement: for any outcome-based component, define measurement methodology, audit rights, and dispute resolution before signing.
  5. Transition-to-usage clause: an option to convert to usage-based pricing at any renewal during the term, at pre-agreed rates.

Terms to avoid conceding

Vendor signals to watch for

The complete framework for auditing your full SaaS portfolio is in the SaaS portfolio audit and vendor renegotiation playbook. For the full strategic picture for CTOs navigating the SaaS reckoning, the overview covers all dimensions of this transition.


Frequently Asked Questions

What is usage-based SaaS pricing and how does it differ from per-seat pricing?

Usage-based pricing charges proportionally to measured consumption — API calls, tokens, compute units. Per-seat pricing charges a fixed fee per named user regardless of usage. Usage-based costs scale with activity, not headcount, which changes how finance budgets for SaaS spend.

What is outcome-based pricing in SaaS and how does it work?

Outcome-based pricing bills per verified business result delivered — per support ticket resolved, per sales deal closed. It requires a contractual Outcome Measurement Agreement. Zendesk’s per-Automated-Resolution model (August 2024), at $1.50 per committed resolution, is the primary incumbent example.

What does Gartner predict about SaaS pricing models by 2030?

Gartner predicts at least 40% of enterprise SaaS spend will shift to usage-, agent-, or outcome-based models by 2030, with seat-based revenue share declining from 21% to 15% (Deloitte TMT Predictions 2026). The transition window to act on is 2025–2026.

What is the Salesforce Agentic Enterprise License Agreement (AELA)?

The AELA gives enterprise customers unlimited use of Agentforce, Data 360, and MuleSoft for a fixed fee on 2–3 year terms. It’s the most visible large-incumbent template for how enterprise AI pricing transitions away from per-seat metered billing.

What is the AI Tax on SaaS renewals?

The AI Tax is Tropic’s term for the 20–37% price uplift vendors impose at renewal by bundling AI features into existing products or migrating customers to more expensive AI-inclusive SKUs.

What is a hybrid SaaS pricing model and why is it becoming the default?

A hybrid model combines a fixed base subscription with variable consumption add-ons. It’s the dominant transition state in 2025–2026 because neither vendors nor buyers are ready for pure usage or outcome models yet.

What is Forced SKU Migration in SaaS contracts?

Forced SKU Migration is when a vendor eliminates a legacy pricing tier, compelling customers to migrate to a more expensive AI-inclusive SKU at renewal. Watch for SKU retirement notices six to twelve months before your renewal date.

How do I build price protection into a SaaS renewal contract?

Negotiate a price protection clause capping annual increases at 3–5% (CPI-indexed), a SKU-level price lock, and an explicit carve-out preventing AI features from triggering automatic billing uplift. For hybrid contracts, always negotiate a hard consumption ceiling.

Is outcome-based SaaS pricing good for buyers or vendors?

Outcome-based pricing is buyer-aligned — you pay only for verified value delivered. The challenge is implementation: Outcome Measurement Agreements, instrumentation to audit results, and trust in the vendor’s methodology. Buyers with high AI adoption and clear outcome metrics benefit most.

Should I switch to usage-based pricing before my next SaaS contract renewal?

Usage-based pricing makes sense if seat utilisation is low and usage patterns are predictable. Don’t switch if there’s no consumption cap or if your finance team lacks tooling to monitor real-time SaaS spend.

What does the SaaS pricing model shift mean for my IT budget?

It means shifting from a fixed, predictable software cost line to variable spend scaling with AI agent activity — requiring reserve buffers, real-time monitoring, and a budget reclassification argument: AI consumption spend replaces labour costs, not just software costs.

Is this a good time to renegotiate my SaaS contracts?

Yes. Vendors are in a transition window, building alternative pricing infrastructure while defending legacy revenue. Buyers who approach as preferred early adopters have more leverage now than they will once pricing models standardise.

The SaaS Reckoning Explained — What Happened to Enterprise Software in 2026

On January 12, 2026, Anthropic launched Claude Cowork. A journalist built a kanban board in under 10 minutes and posted a video. Monday.com‘s market cap dropped $300 million before the session closed. That single tweet — not a product launch, not an earnings miss, just a demonstration — erased roughly a quarter-billion dollars from a company generating $1.3 billion in annual recurring revenue.

By the time February ended, approximately $1 trillion in aggregate market capitalisation had been wiped from enterprise SaaS. Media coined “SaaSpocalypse.” It generates heat, but not much light.

The full picture of what the SaaS reckoning means for technology leaders needs more than a headline. This article gives you the shared vocabulary and market context: a structured account of what happened, why it happened, and what it means — with real numbers, named triggers, and the frameworks to think clearly about it. By the end, you’ll understand why “SaaS is dying” and “SaaS is fine” are both wrong, and what the accurate framing actually looks like.

What actually happened to enterprise software stocks in January and February 2026?

The January–February 2026 SaaS dislocation erased approximately $1 trillion in aggregate market capitalisation from enterprise SaaS, with the S&P 500 software index shedding that amount since January 28, 2026. Broader estimates run closer to $2 trillion, but the enterprise SaaS figure is the more defensible anchor.

The individual stock moves tell the story. HubSpot declined approximately 51% from peak to trough — from roughly $880 per share to around $233, with market cap collapsing from $42 billion to under $10 billion. Monday.com fell approximately 44%. ServiceNow declined approximately 36%. Atlassian dropped 26.9% in eighteen trading days. Workday was down approximately 13% year-to-date.

The iShares Expanded Tech-Software ETF (IGV) — a broad SaaS proxy — declined approximately 22% year-to-date, marking the steepest software sell-off since the 2022 rate hike cycle. The losses weren’t a slow bleed. The majority were compressed into two sharp sell-offs in January and one in early February. Jefferies equity trader Jeffrey Favuzza dubbed it “SaaSpocalypse” and described trading that was “very much ‘get me out’ style selling” — language not heard since 2008.

While SaaS haemorrhaged, semiconductor plays surged — Lam Research up 30.3%, KLA Corp up 29%, Applied Materials up 27.3%. Capital wasn’t leaving tech; it was moving within tech. That’s the first signal this was structural, not a general correction.

Why did the SaaSpocalypse happen when it did?

The sell-off had two convergent triggers in January 2026, each compounding the other.

The first was the Anthropic Claude Cowork launch on January 12. Claude Cowork positioned Claude as an enterprise collaboration layer capable of executing multi-step SaaS workflows autonomously — not a new AI feature added to an existing tool, but a potential replacement layer for the per-seat model that underpins SaaS valuations. On January 30, Anthropic released 11 open-source Cowork plugins covering legal, finance, marketing, sales, and customer support. The plugins collapsed a significant portion of the SaaS stack into a single AI system: productivity, marketing, finance, and data workflows all replicable through one agent layer.

The timeline: January 16 — Anthropic launches multi-agent shared workspace, Atlassian and Asana drop ~4% immediately. January 20 — tech companies announce hiring freezes citing AI. January 28–29 — ServiceNow’s earnings with guidance language acknowledging AI substitution risk coincides with the OpenAI Frontier launch, which released a diagram showing value accruing to AI agents above the SaaS layer. ServiceNow drops 11% in the session.

CNBC host Deirdre Bosa’s Monday.com kanban tweet is the clearest illustration. She built a functional kanban interface using Claude Cowork and said she wanted to “try to recreate Monday.com.” Monday’s stock dropped 6% immediately and another 10% the next session — $300 million erased from a company generating over $1 billion in annual revenue.

The proximate triggers were real, but they landed on a market already repricing 18 months of enterprise AI data. Menlo Ventures documents enterprise AI spending growing from $1.7 billion in 2023 to $37 billion in 2025 — a 3.2× year-on-year growth rate. The triggers provided confirmation, not news.

For the mechanism-level breakdown of how AI agents actually threaten SaaS business models, that article covers the specifics.

Is SaaS really dying — or is something more nuanced happening?

The growth assumptions that justified 20–40× revenue multiples are no longer credible, and the repricing reflects that. SaaS products still work. Contracts still renew. But the budget that would have funded SaaS growth is now flowing into AI. That’s what this repricing is about — a contraction in growth expectations, not a collapse of the business model.

HarbourVest characterised the sell-off as “rational repricing” — AI is undermining three assumptions baked into SaaS multiples: that seat-based pricing will grow forever, that software margins are structurally fixed at 80–85%, and that recurring revenue is predictable. Morgan Stanley called it a “sentiment-driven dislocation.” Both are partially right. They’re not mutually exclusive.

Public SaaS growth rates had declined every single quarter since the 2021 peak. AI budgets are up 100%+ year-on-year while overall IT budgets are up ~8%. AI is absorbing the growth margin from total IT spend — and that margin was previously flowing into SaaS expansion.

The relevant question isn’t whether SaaS survives as a category. It’s which SaaS products are being starved, and at what rate.

What is the “uncertainty tax” — and why does it matter beyond the stock market?

The uncertainty tax is the valuation discount investors apply to SaaS businesses whose revenue model is perceived as structurally threatened by AI — a premium charged for unpredictability in ARR, margin, and net revenue retention when the 5-year model is opaque.

It’s rational. When a SaaS company’s revenue model depends on seat-count growth, and AI agents can perform the same workflows without seats, even a 10% long-term seat reduction assumption changes the entire discounted cash flow. HarbourVest documents the maths: 30% seat decline plus 10% price increase produces -23% revenue contraction; 50% seat decline plus 15% price increase produces -42.5%. AI forces a decoupling between value delivered and seats billed — value goes up but the monetisable unit goes down.

The Monday.com kanban incident illustrates this precisely: ARR didn’t change that day, but $300 million in market cap was erased — because investors updated their probability on future seats. Goldman Sachs analyst Ben Snider framed it clearly: “near-term earnings results will be important signals of business resilience, but in many cases insufficient to disprove the long-term downside risk.”

The tax falls most heavily on horizontal SaaS — project management, CRM, work-OS tools — and least on system-of-record platforms with deep integration lock-in.

For CTOs, this matters operationally. A vendor carrying a significant uncertainty tax faces constrained R&D, potential pricing aggression, and elevated acquisition risk. All of which surface in renewal conversations before anything dramatic happens on the exchange.

What does the data say about AI’s actual growth versus SaaS decline?

The enterprise AI spending data isn’t speculative. Menlo Ventures documents $1.7 billion in enterprise AI spending in 2023, rising to $11.5 billion in 2024, then to $37 billion in 2025 — a 3.2× year-on-year growth rate representing more than 6% of the entire software market within three years of ChatGPT‘s launch. AI-native startups captured approximately 63% of AI application layer revenue in 2025, up from 36% in 2024.

The Gartner 2030 prediction, cited through Deloitte Insights: at least 40% of enterprise SaaS spend will shift toward usage-, agent-, or outcome-based pricing by 2030. Seat-based pricing had already fallen from 21% to 15% of vendors as a primary model within twelve months, while hybrid pricing surged from 27% to 41%. These are current market data, not forecasts.

Gartner separately predicts 35% of point-product SaaS tools will be replaced by AI agents or absorbed within larger agent ecosystems by 2030. Deloitte found that 57% of enterprise respondents were allocating 21–50% of their digital transformation budgets to AI automation, with 20% allocating more than half.

Together, the datasets establish that the reckoning is structural — the demand side is genuinely migrating, and the supply side is already following. For a deeper treatment of the pricing model shift, that article covers the transition in full.

What is the K-shaped bifurcation — and which side of it are your vendors on?

The K-shaped bifurcation describes the post-dislocation divergence in SaaS: two arms separating rather than one uniform decline.

Platform incumbents — Salesforce, Oracle, Microsoft — are trending toward recovery. Horizontal point-solution SaaS — HubSpot, Monday.com, and in important respects Workday — continues to face downward pressure. HarbourVest’s analysis identifies the structural logic.

Systems of record hold proprietary customer-specific operational data, run deterministic mission-critical workflows, carry high switching costs, and are necessary for regulatory, financial, or operational continuity. AI augments these systems; it doesn’t replace them. If SAP goes down, factories stop. If Workday breaks, payroll fails.

Bolt-on tools have the opposite profile: they don’t own the data layer, perform tasks where “good enough” is acceptable, have low switching costs, and solve problems AI can replicate with general-purpose models. The tell: if Claude Cowork’s plugins can replicate a vendor’s core value proposition out of the box, that vendor’s moat was a feature wrapped in a subscription.

Workday is the nuanced case. Despite platform positioning, its HR and finance workflows are increasingly automatable at the process layer. Platform status is necessary but not sufficient — the workflows must also be irreplaceable by agents.

The valuation divergence between the two arms is pronounced. AI-native startups commanded 50× higher valuations than traditional SaaS at Series D. Harvey (legal AI) trades at 80× revenue. HubSpot trades at ~6×. The market is pricing growth assumptions, and those assumptions differ structurally between the two arms.

Understanding which arm your vendors sit in is the starting point for the practical work. The vendor evaluation framework covers that in full.

What does the SaaS reckoning mean for technology leaders right now?

The market repricing is a leading indicator of structural change, not just a financial event. It has operational consequences for every organisation managing a software stack.

The vendor financial health implication is direct. A vendor carrying a meaningful uncertainty tax faces constrained R&D, potential pricing aggression, and elevated acquisition or partnership risk. The high leverage of private equity-backed SaaS businesses inhibits reinvestment precisely when they need to spend aggressively to stay competitive. Vendor financial stress is an operational risk for buyers.

The pricing model transition is already arriving in contracts. Deloitte notes usage-based and outcome-based terms are already appearing in enterprise agreements. The Gartner 2030 prediction isn’t a distant horizon — it’s the direction renewal negotiations are already moving.

The build-vs-buy calculus has shifted. AI-assisted internal development has made custom tooling viable for more organisations. Build time now measures in days to weeks for point-solution tools, at a developer plus an AI subscription ($50–200 per month), compared to $500,000 or more annually for enterprise SaaS licences. That shift increases your negotiating leverage even where there’s no genuine intention to build.

For the mechanism of how AI agents attack SaaS models, that article covers the specifics. For vendor evaluation, the framework article covers that in full. For the full picture of what the SaaS reckoning means for technology leaders, that is the right place to continue.

Frequently Asked Questions

What is the SaaS reckoning?

The January–February 2026 broad repricing of enterprise SaaS equities. Triggered by Anthropic’s Claude Cowork launch on January 12, 2026 and compounded by ServiceNow’s earnings on January 28–29. Approximately $1 trillion in aggregate SaaS market capitalisation erased across six weeks.

What does “SaaSpocalypse” mean?

Media shorthand coined by Jefferies equity trader Jeffrey Favuzza and picked up by Forbes and TechCrunch. The term overstates the finality. “SaaS reckoning” is more precise — SaaS is being structurally repriced, not eliminated.

Why did HubSpot stock fall so much in 2026?

Two factors: Claude Cowork directly threatened HubSpot’s CRM and marketing workflow seat model, and HubSpot’s position as a horizontal SaaS tool with limited system-of-record defensibility made it a candidate for the highest uncertainty tax.

Is SaaS really dying because of AI agents?

SaaS isn’t dying — it’s being starved. AI agents threaten the seat-based growth assumptions that justify high-multiple valuations, but the operational software layer doesn’t disappear quickly. The question is which categories face the most acute substitution pressure.

What is seat-based pricing and why is it under threat?

Per-named-user licences where revenue scales with headcount. AI agents can execute the same workflows without occupying seats — breaking the growth assumption that more employees means more ARR. Per-seat pricing already dropped from 21% to 15% of SaaS companies as a primary model in twelve months.

What did Anthropic’s Claude Cowork actually do?

An enterprise collaboration layer capable of executing multi-step SaaS workflows using 11 open-source plugins covering legal, finance, marketing, sales, and customer support. Investors read it as a replacement layer for per-seat pricing across the SaaS category — not a feature, a structural challenge.

What does the Gartner 2030 SaaS pricing prediction say?

Via Deloitte Insights: by 2030, at least 40% of enterprise SaaS spend will shift toward usage-, agent-, or outcome-based pricing. Seat-based pricing is already declining from 21% to 15% of vendors in just twelve months.

What does the Menlo Ventures $37 billion figure mean?

Enterprise AI spending in 2025, from the Menlo Ventures State of Generative AI in the Enterprise report. Up from $1.7 billion in 2023 and $11.5 billion in 2024 — 3.2× year-on-year growth. Enterprise AI adoption is measured and accelerating.

How does the 2026 SaaS sell-off compare to the 2001 dot-com crash?

The 2001 crash was speculative valuations on businesses with no revenue model. The 2026 reckoning is a repricing of businesses with proven revenue models whose growth assumptions are genuinely in question. SaaStr’s characterisation: “2016 was cyclical — 2026 is structural.”

What is the K-shaped bifurcation in SaaS?

Post-dislocation divergence where platform incumbents with system-of-record status (Salesforce, Oracle, Microsoft) trend toward recovery, while horizontal point-solution SaaS (HubSpot, Monday.com) continues to face downward pressure. The split is structural — data ownership, integration depth, and workflow substitutability.

Why are all my software stocks going down in 2026?

Investors repriced the category when AI agents demonstrated they could replace the workflow tasks that per-seat licences support. The stocks that fell hardest are those most dependent on seat-count growth as their revenue engine.

What should technology leaders do about the SaaS valuation crash?

Understand which of your vendors sit in the downward arm of the K-shaped split. Anticipate pricing model changes in renewal conversations. Assess whether AI-assisted internal tooling has made build alternatives viable for point-solution tools. The vendor evaluation framework covers this in full.

How AI Agents Actually Attack SaaS Business Models — A Mechanism-Level Breakdown

The January 2026 SaaS market dislocation wiped over a trillion dollars from software market caps. But the stock moves are just the symptom. What’s actually happening underneath is structural — it’s about the specific mechanisms by which AI agents are attacking SaaS business models.

The best framework for understanding this is Pakodas’ eight disruption theories. They map out how agentic AI is eroding SaaS business model assumptions across multiple attack surfaces at the same time. We’ll run through all eight, then go deeper on the mechanisms that matter most: AI unbundling, the dumb pipe risk, the cost collapse, the deterministic/probabilistic divide, and where value is accumulating in the three-layer agentic stack. For the broader picture of the SaaS reckoning, start there.

What exactly are AI agents — and why do they threaten SaaS differently from previous AI tools?

An AI agent is an autonomous multi-step action system. It perceives its environment, plans a sequence of actions, and executes them end-to-end — without needing human instruction at each step. That’s what separates agents from copilots. Copilots wait for a prompt, return a result, and wait again.

Previous AI tools worked with the SaaS interface. Agents bypass it entirely. They pull data via APIs, execute workflows, and take actions without ever loading a dashboard.

Anthropic’s Claude Cowork launched on January 12, 2026 and added eleven plugins covering legal, finance, marketing, sales, product management, and data analysis — work that previously required a stack of specialised SaaS products. When it launched, LegalZoom‘s stock dropped 20% in a single session. Not because Cowork is better than LegalZoom. Because document generation is now a commodity.

The consequences for SaaS are structural. Per-seat pricing is built on the assumption that human users need logins and dashboards. One AI agent can do the work of many human users — and agents don’t need seats. As Satya Nadella put it: “Business applications are essentially CRUD databases with a bunch of business logic. The business logic is all going to these agents.”

What are Pakodas’ eight theories of AI disruption — and how do they map to the SaaS business model?

Pakodas’ eight disruption theories aren’t a menu where only one turns out to be right. They are all happening at the same time, at different speeds across different market segments.

Theory 1 — The Superagent Eats the Interface. AI agents become the primary way humans interact with software — sitting above all apps, talking to their APIs. SaaS tools become back-end services; value creation moves to the agent layer.

Theory 2 — The Great Unbundling. A sales team paying $30,000 per year for Gong can now replicate the core feature with Claude Code in a weekend. Bundle pricing power erodes when users can acquire individual features at near-zero cost.

Theory 3 — The Uncertainty Tax. Even SaaS companies with strong fundamentals are getting repriced downward because markets can’t confidently model what their business looks like in five years. ServiceNow beat Wall Street expectations for the ninth consecutive quarter, raised guidance, and the stock dropped 11%.

Theory 4 — The $0 MVP. Traditional SaaS MVPs cost $500,000 to $1 million to build. AI coding tools collapse that to near-zero. The cost moat protecting SaaS from substitution is gone.

Theory 5 — Compound Engineering. A single developer using AI coding agents can maintain and ship five software products simultaneously. AI-native companies operate at roughly half the headcount of traditional SaaS at equivalent output. The evidence in AI-native growth rates is well documented.

Theory 6 — The Invisible App. Agents don’t need good design; they need good APIs. The SaaS product becomes invisible — a back-end service users never directly interact with.

Theory 7 — The Probabilistic Divide. Probabilistic SaaS — content creation, marketing automation, task management — faces the most acute displacement risk. Deterministic SaaS — payroll, ERP, healthcare records — retains more defensibility.

Theory 8 — The End of the Seat. AI agents do the work that previously required multiple human users. Salesforce is already experimenting with Agentic Enterprise Licence Agreements — flat-fee structures for companies deploying agents at scale.

Theories 2, 4, and 7 are the mechanically richest. The next sections go deeper.

How does AI unbundling work — and why can’t SaaS vendors just add AI to defend against it?

AI unbundling is the process by which AI agents let companies extract individual features from bundled SaaS products and replace them with purpose-built alternatives at near-zero cost.

The SaaS bundle worked because building custom software was expensive. Vendors packaged multiple features together and users paid for the bundle even if they only used part of it. That friction is now gone. A sales team paying $30,000 a year for Gong can replicate the core call analysis feature with Claude Code in a weekend.

Jasper AI is the canonical example. It peaked at roughly $90M ARR in 2023. Then ChatGPT commoditised its core content generation feature, revenue dropped, and both co-founders stepped down. A Retool survey of 817 builders found 35% had already replaced at least one SaaS tool with a custom build, and 78% expect to build more in 2026.

Adding AI features to an existing bundle doesn’t restore the bundle’s value proposition. The real competitive pressure is bundle pricing against à la carte alternatives at zero marginal cost.

What is the “dumb pipe” risk — and which SaaS vendors face it most acutely?

“Business applications are essentially CRUD databases with a bunch of business logic. The business logic is all going to these agents.” — Satya Nadella.

Here’s the dumb pipe scenario: an AI agent accesses a SaaS vendor’s data via API and executes all business logic externally. Interface value, workflow value, and intelligence value migrate to the agent layer. The vendor narrows to a data store with no differentiation.

The structural distinction that matters: systems of record hold authoritative, governed data. Systems of engagement are the workflow and UI layers humans use to interact with that data. Systems of engagement are the most vulnerable; systems of record retain defensibility as long as they protect the data moat.

The most exposed vendors: HubSpot (down 51% year-over-year), Monday.com (down 44%), Atlassian at the workflow layer (down 26.9%). Their differentiation is precisely what agents are replacing.

More defensible: Workday (HCM compliance data), Epic (healthcare records), SAP/Oracle ERP. Bain names Procore’s project cost accounting and Medidata’s clinical-trial randomisation as “core strongholds.” The evaluative question is simple: does this vendor’s primary value sit in the interface layer, or in proprietary governed data? How these mechanisms map to the Bain four-scenario framework gives you the full vendor evaluation methodology.

Why has the cost of building software alternatives collapsed?

Traditional SaaS MVPs cost $500,000 to $1 million to build. AI coding tools have collapsed that to near-zero. The mechanism is vibe coding: AI-assisted development using Cursor, Claude Code, and GitHub Copilot that lets a single developer replicate core SaaS functionality in days. The skill threshold has dropped from senior engineering teams to competent solo developers.

Compound engineering extends this further. A single developer with AI coding agents can maintain and ship five products simultaneously.

Cursor crossed $1B ARR in less than 24 months with approximately 300 employees — $3.3M ARR per employee, three to five times more efficient than the best public SaaS companies. Salesforce runs at roughly $800K ARR per employee. The full picture of what these growth trajectories mean for incumbents is worth a look.

The only remaining barriers to entering a SaaS category are proprietary data, compliance requirements that embed the vendor into regulated workflows, and network effects requiring multi-sided marketplace participation. Every SaaS renewal now needs a build assessment.

What is the difference between probabilistic and deterministic SaaS — and which category is more at risk?

The probabilistic vs. deterministic distinction is the single most actionable risk-stratification tool you have for evaluating a vendor portfolio.

A deterministic system produces the same outputs given the same inputs, every time — payroll, invoicing, compliance, clinical trials. Error tolerance is near-zero. Regulatory requirements embed the vendor deeply into your operations.

A probabilistic system is one where “good enough” is acceptable: content generation, marketing automation, project management, customer support routing. AI agents handle these workflows at near-zero marginal cost.

Defensible examples: Workday, SAP, Oracle, Epic, Procore, Medidata — vendors holding governed, regulated data that can’t be casually replicated.

Vulnerable examples: HubSpot, Monday.com, Zendesk, Jasper AI, Atlassian’s workflow layer. ServiceNow reported that AI agents now resolve 90% of IT and 89% of customer support requests autonomously inside its own operations — which is what probabilistic workflow automation looks like at scale.

One important nuance: deterministic is a spectrum. Vendors like Salesforce and Workday have probabilistic workflow layers built on top of their deterministic core data. The workflow UI can be unbundled even if the underlying system of record is defensible. Probabilistic tools get heightened build-vs-buy scrutiny at every renewal. The Bain four-scenario framework maps this directly to portfolio decisions.

Where does value accumulate in the three-layer agentic stack?

Bain’s three-layer agentic stack maps out where economic value is moving as AI agents mature.

Layer 1 — Systems of Record. The data repositories: Workday, Salesforce, SAP, Epic. Their edge is unique data structures, long activity histories, and built-in regulatory logic. They remain valuable as long as they hold proprietary, governed data — and lose that value if they become pure CRUD stores.

Layer 2 — Agent Operating Systems. The middleware that orchestrates actual work: Microsoft’s Azure AI Foundry, Google’s Vertex AI Agent Builder, Amazon Bedrock Agents. These systems plan tasks, remember context, and invoke tools. This is where substantial platform value is currently accumulating — Amazon, Google, and Microsoft are forecast to spend close to $500 billion on AI infrastructure in 2026 alone.

Layer 3 — Outcome Interfaces. How humans consume agent outputs: Teams, Slack, custom apps. Layer 3 is increasingly commoditised — building a custom outcome interface has near-zero cost with vibe coding.

The emerging battleground is the semantic layer between these three levels. Anthropic’s Model Context Protocol (MCP) and Google’s Agent2Agent (A2A) standardise how agents package tool calls, security tokens, and results as they move among layers — and both show the kind of network-effect dynamics where the first standard to achieve broad adoption takes most of the market.

Vendors concentrated in Layer 3 face the most exposure. Vendors with a defensible role in Layer 2 or irreplaceable data in Layer 1 are better positioned. For what this means for your software stack, the implications are practical.

Are AI agents replacing SaaS or augmenting it — and does the distinction matter for evaluating your stack?

Both are happening. Which one applies to a specific tool depends on the probabilistic/deterministic framework from the previous section.

The replacement scenario is real. Klarna consolidated 1,200 SaaS applications into an in-house AI stack. AI handled the workload of approximately 700 customer service agents and revenue per employee grew from $300K to $1.3M. But customer satisfaction declined and the company reversed course, rehiring staff. Full stack replacement carries operational risk.

The augmentation/starvation scenario is the more common near-term mechanism. Menlo Ventures documents enterprise AI spending growing from $1.7 billion in 2023 to $37 billion in 2025 — discretionary IT budget that would have funded SaaS expansion is now flowing to AI tools instead.

And augmentation still erodes per-seat revenue. If AI agents handle 40% of a team’s workflow, that team renews at a reduced seat count. Headcount-driven SaaS growth stalls even without product replacement.

For probabilistic tools, think in replacement terms — build-vs-buy assessment, migration planning. For deterministic tools, think in starvation and renegotiation terms — seat contraction, usage-based pricing conversations, renewal discipline.

The mechanism-level picture is where the analysis starts. For what this means for your software stack, that is the full context.

FAQ

What is the difference between an AI agent and a copilot?

Copilots require human prompting at each step. AI agents are autonomous: they perceive their environment, plan multi-step actions, and execute tasks end-to-end without per-step human guidance. Agents interact with systems directly through APIs — they don’t surface a UI.

What are Pakodas’ eight theories of SaaS disruption?

The eight theories are: (1) The Superagent Eats the Interface, (2) The Great Unbundling, (3) The Uncertainty Tax, (4) The $0 MVP, (5) Compound Engineering, (6) The Invisible App, (7) The Probabilistic Divide, and (8) The End of the Seat. They are additive — multiple theories can attack a single vendor simultaneously.

Which SaaS tools are most at risk from AI agents right now?

Probabilistic SaaS tools — those where output accuracy doesn’t need to be exact — face the most immediate risk. HubSpot, Monday.com, Zendesk, Jasper AI, and Atlassian’s workflow layer are the commonly cited examples. Deterministic tools — payroll, ERP, healthcare records, compliance — retain more defensibility.

What does “dumb pipe” mean in the context of SaaS?

“Dumb pipe” (Satya Nadella’s term) describes the risk scenario where a SaaS vendor becomes a passive data store — accessed by AI agents via API, with all business logic executed in the agent layer. The vendor loses interface differentiation and pricing power, reduced to a CRUD database. Vendors most at risk are those whose primary value is in the workflow and UI layer rather than in proprietary governed data.

Why did Klarna replace Salesforce CRM with an in-house AI stack?

Klarna consolidated 1,200 SaaS applications as part of an AI-first rationalisation. AI handled the workload of approximately 700 employees. Customer satisfaction declined and Klarna reversed course, rehiring staff. Full stack replacement is technically feasible but operationally risky at scale.

What is vibe coding and why does it matter for SaaS buyers?

Vibe coding is AI-assisted software development using Cursor, Claude Code, and GitHub Copilot. A developer can now replicate core SaaS functionality in days at near-zero marginal cost. This changes the build-vs-buy calculus: tools that were cost-prohibitive to build internally are now feasible alternatives, which gives you real renegotiation leverage at renewal.

Is Cursor really a threat to traditional SaaS companies?

Cursor crossed $1B ARR in 24 months with approximately 300 employees — $3.3M ARR per employee, three to five times more efficient than the best public SaaS companies. That growth rate validates Compound Engineering in practice: AI-native companies can reach significant commercial scale with structural cost advantages incumbents can’t easily replicate.

What is MCP (Model Context Protocol) and why does it matter?

MCP is Anthropic’s standard for how AI agents communicate across systems — packaging tool calls, security tokens, and results as they move between applications. Google’s Agent2Agent (A2A) is its counterpart. Both show strong network-effect dynamics — winner takes most — and the outcome will influence which agent OS platforms integrate most easily with existing SaaS stacks.

What is the three-layer agentic stack?

Bain’s three-layer agentic stack: Layer 1 (Systems of Record — Workday, Salesforce, SAP), Layer 2 (Agent Operating Systems — Azure AI Foundry, Vertex AI Agent Builder, Amazon Bedrock Agents), and Layer 3 (Outcome Interfaces — Teams, Slack, custom apps). Value is accumulating most rapidly at Layer 2. Vendors concentrated in Layer 3 face the greatest displacement risk.

Are per-seat SaaS contracts still appropriate in an agentic world?

Per-seat pricing is increasingly misaligned with AI agent economics. Agents don’t need logins or user licences. Gartner predicts over 30% of enterprise SaaS solutions will incorporate outcome-based pricing components. Salesforce’s Agentforce already offers Flex Credits at $0.10 per standard action. Ask your vendors about usage-based and outcome-based alternatives at your next renewal.

What is Pakodas’ “Uncertainty Tax” — and why does it matter?

The Uncertainty Tax (Theory 3): SaaS companies with strong fundamentals get repriced downward because markets can’t model their five-year trajectory under AI disruption. Depressed valuations affect vendor stability — financially pressured vendors may cut R&D, accelerate price increases, or become acquisition targets.

Does the probabilistic/deterministic distinction mean ERP and payroll vendors are completely safe?

No — it means they are more defensible, not invulnerable. Deterministic SaaS vendors still face the starvation scenario — budget contraction, seat reduction. Vendors like Workday and SAP have probabilistic workflow layers built on top of their deterministic core data. Those workflow layers remain vulnerable even if the underlying data store is defensible.

What Is AI Slop and What Does It Mean for the Internet’s Future

In December 2025, Merriam-Webster named “slop” its Word of the Year — “digital content of low quality that is produced usually in quantity by means of artificial intelligence.” They didn’t pick it because the word was new. They picked it because everyone was already drowning in it.

More than 20% of videos shown to new YouTube users are AI slop. 51% of all internet traffic now comes from bots. 86.5% of top-ranking Google pages contain some AI-generated content. The problem isn’t that AI creates bad content — it’s that the internet’s architecture can’t tell the difference between what’s useful and what’s filler. The evidence for how widespread AI slop has become is now quantified across every major platform.

This guide maps the problem and what you can do about it:

What is AI slop and where did the term come from?

The word migrated from social media into the dictionary in under eighteen months. It started as shorthand for AI-generated images clogging Facebook feeds — the six-fingered hands, the shrimp Jesus portraits — then expanded to cover any low-effort AI output published at scale. What separates slop from ordinary bad content is the economics: generative AI dropped production costs to near zero, so volume replaced quality. AI slop YouTube channels pull an estimated $117 million per year in ad revenue. The full picture of what AI slop is and how it spread — from Facebook image feeds to enterprise workflows — covers the definition, the business model, and the algorithmic amplification that makes it self-sustaining.

Go deeper: AI Slop Is Everywhere Now, and Here Is the Evidence

How much of the internet is now AI-generated content?

More than most estimates predicted. Ahrefs found 86.5% of top Google pages contained AI-generated text, though the ranking correlation was just 0.011. Google isn’t rewarding it, but when AI produces thousands of articles per hour, signal-to-noise degrades for anyone relying on organic discovery. The consequences of this volume for how AI slop is damaging search and e-commerce are measurable and concrete — a 400% surge in AI-generated Amazon reviews and a structural shift in who wins search visibility.

Go deeper: AI Slop Is Everywhere Now, and Here Is the Evidence | How AI Slop Is Reshaping Google Search Rankings and E-Commerce Trust

How is AI slop affecting Google search and e-commerce?

Gartner projects a 25% drop in traditional search volume by 2026 as users shift to AI answer engines. AI-generated product reviews on Amazon have surged 400%, eroding trust signals buyers depend on. NerdWallet illustrates the shift — revenue climbed 35% even as organic traffic fell 20%, suggesting the old SEO playbook is breaking down. Understanding how AI slop is reshaping Google rankings and e-commerce trust in granular detail explains why the 0.011 correlation between AI content and ranking position matters more than it first appears.

Go deeper: How AI Slop Is Reshaping Google Search Rankings and E-Commerce Trust

What is vibe citing and why does it matter for AI research?

Vibe citing is what happens when researchers let AI draft their reference lists. The citations look plausible but the papers don’t exist. GPTZero analysed 51 papers accepted at NeurIPS 2025 and found over 100 hallucinated citations. These papers become training data for next-generation models — fabricated references propagating through every system built on that data. The full investigation into vibe citing and the academic integrity failure at NeurIPS shows how a 55% growth in the error rate since ChatGPT’s release is reshaping what peer review can be trusted to catch.

Go deeper: Vibe Citing and the Collapse of Peer Review at the World’s Top AI Conference

How does AI slop damage future AI systems?

Ilia Shumailov’s team at Oxford documented “model collapse” in a Nature 2024 paper. When AI models train on AI-generated content, each generation loses fidelity — like photocopying a photocopy. Minority perspectives disappear first. If you’re building or fine-tuning AI, the training data you collect tomorrow will contain more slop than what you collected last year. The detailed technical explanation of model collapse and the entropy spiral — including what this means for SMB teams fine-tuning on internal corpora — is covered in depth separately.

Go deeper: The Entropy Spiral: How AI Slop Degrades Future AI Systems Through Recursive Training

What is the Dead Internet Theory and is it real?

The Dead Internet Theory started as a fringe conspiracy — the idea that most online activity is bots, not humans. With 51% of web traffic now automated and AI content on every major platform, the data is catching up to the theory. The internet isn’t literally dead, but the assumption of human authorship no longer holds. This shifts the ground under your content strategy, engagement metrics, and data collection. Provenance verification, data quality gates, and content governance are the practical responses to a world where you can no longer assume the source of what you’re reading — or training on.

Go deeper: AI Slop Defence: Provenance Verification, Data Quality Gates, and Content Governance

Why is AI slop hard to detect — and what can you do about it?

If the assumption of human authorship no longer holds, identifying AI-generated content becomes the next question. Detection tools struggle because AI text mimics human patterns convincingly, watermarking isn’t universal, and paraphrasing strips it. The deeper issue is that detection is framed as binary — AI or human — when most content sits on a spectrum.

Reliable defence requires provenance systems that track content origin from creation through publication. Start with your own data pipelines: implement quality gates before content enters training sets or knowledge bases, adopt provenance standards like C2PA, and build governance frameworks that define acceptable AI use. The three-layer approach to what to do about AI slop — combining provenance-at-source, data quality gates, and human-in-the-loop curation — is the practical framework for SMB technology teams.

Go deeper: AI Slop Defence: Provenance Verification, Data Quality Gates, and Content Governance

What is Answer Engine Optimisation (AEO) and how does it replace SEO?

As search shifts from link lists to AI-generated answers, the optimisation game changes. AEO focuses on structuring your content so AI systems can extract, attribute, and cite it accurately. NerdWallet’s numbers show visibility in AI answers can outweigh traditional search rankings. Start with structured data, direct answers to specific questions, and authoritative sourcing. The strategic picture of the post-SEO web and content authenticity — including where the 2026–2028 content landscape is heading and how digital provenance becomes a competitive signal — maps the full transition from SEO to AEO.

Go deeper: The Post-SEO Web: Answer Engine Optimisation, Digital Provenance, and the Authenticity Advantage

Resource Hub: AI Slop Guide Library

Understanding the Problem

Technical Mechanisms and Defences

Strategic Response

Frequently Asked Questions

What is the difference between AI slop and legitimate AI-generated content? AI slop is defined by intent and quality outcome: content produced in volume to fill space, drive ad revenue, or game search rankings — not to inform. Legitimate AI-assisted content involves meaningful human input at the planning, editing, and quality-review stages. The distinction is not about which tool was used but whether a human with domain knowledge shaped the output and stands behind its accuracy. When the volume objective overrides the quality objective, slop is the result. The full evidence base for how widespread AI slop has become shows what that volume looks like at scale.

What happened at NeurIPS 2025 with fabricated citations? GPTZero analysed 4,841 papers accepted at NeurIPS 2025 and found over 100 fabricated citations in 51 of them — citations to papers, journals, and authors that do not exist, generated by AI and not caught by peer review. The term “vibe citing” was coined to describe the pattern: AI-assisted writing that references sources plausibly but without verifying that they exist. The NeurIPS error rate has grown 55% since ChatGPT’s release. The full investigation into vibe citing and the collapse of peer review covers how these fabrications passed review and what it means for trusting AI research.

What is model collapse and why should technology teams care? Model collapse is what happens when AI models are trained on datasets containing AI-generated content: successive model generations produce outputs of progressively lower quality, losing rare and diverse viewpoints and converging on statistically average outputs. The concern for technology teams is not only theoretical — any fine-tuning pipeline that ingests internal corpora (support tickets, documentation, code comments) may already be training on ChatGPT-assisted content, making the collapse loop a present risk rather than a future one. The technical mechanics of the entropy spiral and model collapse explain the feedback loop and how to detect early signs of degradation.

AI slop vs spam — what is the difference and why does slop matter more? Spam is identifiable by signature: repetitive patterns, known sender domains, structural tells that filters can learn. AI slop is fluent, structurally sound, and often passes automated quality checks. Spam filters work because spam has consistent technical signatures; slop detection fails because slop is indistinguishable at the surface from good content. The consequence is that slop contaminates training data, search indexes, and knowledge bases in ways that spam never could — it is credible enough to be propagated rather than quarantined. The practical answer to this challenge is a provenance-first defence strategy that addresses authenticity at the source rather than relying on detection after the fact.

SEO vs AEO — which should I focus on for my content strategy in 2026? Both, transitionally. Traditional SEO still drives significant traffic for most publishers, and abandoning it prematurely is costly. However, the structural shift toward AI answer engines (Gartner’s 25% volume decline by 2026, NerdWallet’s traffic-down/revenue-up pattern) is happening fast enough that building AEO capability now is lower risk than waiting. The good news is that AEO best practices — authoritative authorship, verifiable claims, structured data, E-E-A-T signals — are also what distinguishes quality content from AI slop in search algorithms. The full strategic framework for the post-SEO web and content authenticity covers the transition in detail.

How do I protect my company’s RAG systems from AI slop contamination? The primary defence is a data quality gate at the point of ingestion — applying provenance checks and content validation before AI-generated material enters the knowledge base, rather than attempting to clean it out afterwards. C2PA content credentials can verify that documents were human-authored; for sources without provenance metadata, human review of high-stakes content categories (policy documents, technical specifications, customer-facing knowledge articles) is the most reliable fallback.

Go deeper: AI Slop Defence: Provenance Verification, Data Quality Gates, and Content Governance

The Post-SEO Web: Answer Engine Optimisation, Digital Provenance, and the Authenticity Advantage

The web’s information layer is shifting under your feet. ChatGPT, Perplexity, Google AI Overviews, and Gemini are now the first stop for queries that used to hit a list of blue links. Meanwhile, AI-generated content — what the Reuters Institute calls “AI slop” — is flooding the web with low-quality material that’s eroding trust in traditional search.

These aren’t two separate problems. They’re the same infrastructure crisis playing out from different angles. And out of that crisis two things are emerging: a new discipline called Answer Engine Optimisation, and a new competitive advantage called digital provenance. For technology companies in the 50–500 employee range, getting your head around both — and how they connect — is what separates a content strategy that compounds in value from one that quietly becomes irrelevant.

This article is part of our comprehensive what AI slop is doing to the internet series. If you’ve already read about how AI slop is reshaping Google search rankings and e-commerce trust or the provenance infrastructure that defends against it, this is where those two threads come together into something actionable for the 2026–2028 window.

Why Is Traditional SEO Losing Ground in the Age of AI Search?

Traditional SEO was built for a link-list world. It optimises for ranking position, click-through rates, and dwell time — all of which assume users browse to source pages. That assumption is breaking down.

Publishers expect search engine referral traffic to fall by 43% over the next three years, according to the Reuters Institute’s 2026 trends report. Google’s AI Overviews now appear for more than half the keywords tracked at Backlinko. Semrush projects that LLM referrals will overtake traditional Google organic search by the end of 2027, off the back of an 800% year-over-year increase measured over just three months.

AI slop has weaponised the same techniques that built SEO traffic in the first place. Pink slime sites — Reuters Institute’s label for automated AI-generated content farms — produce content that’s indistinguishable from legitimate sources in a link-list result. In France alone, journalist Jean-Marc Manach identified more than 4,000 fake news websites powered by generative AI, all set up to game Google’s algorithms.

It’s a self-reinforcing loop: cheap AI content degrades signal quality in traditional search, which accelerates users moving to AI answer engines that synthesise responses rather than returning link lists. A content strategy built entirely for traditional SEO is increasingly a bet on a declining channel. (For the full current-state breakdown, see how traditional SEO is being degraded by AI content.)

What Is Answer Engine Optimisation (AEO) and How Does It Differ from SEO?

Answer Engine Optimisation is the practice of structuring your content so that AI-driven answer engines — ChatGPT, Perplexity, Google AI Mode, voice assistants — choose to cite and synthesise it in generated responses. The Reuters Institute flagged AEO as a key term to watch in 2026.

The core shift is from rankability to citability.

SEO optimises for ranking signals: backlinks, domain rating, keyword density, page authority. AEO optimises for whether an AI engine trusts and quotes your content — or passes over it. Clearly structured arguments with direct, extractable answers. Verifiable claims linked to primary sources. Authoritative authorship signals. FAQ and schema markup that AI crawlers can reliably parse. And LLMs.txt, which tells LLM crawlers which of your content they’re allowed to access.

What doesn’t transfer from SEO to AEO: keyword stuffing, thin content padded for word count, link-building tactics that accumulate links without signalling any real domain expertise.

AEO success isn’t measured in site visits. The right metric is citation share — how often your brand and content appear in AI-generated answers for your target queries. NerdWallet‘s revenue rose 35% in 2024 while monthly traffic fell 20%. That tells you everything about how discovery and decision-making are shifting to AI-mediated experiences where the click is optional.

The bridge concept between your existing SEO investment and AEO is E-E-A-T — Google’s trust framework covering Experience, Expertise, Authority, and Trust. Content teams that have invested in E-E-A-T signals are in much better shape for AEO than those optimising for keywords alone. Research shows 99% of URLs appearing in AI Mode results come from the top 20 organic search results — which means foundational SEO still matters, but ranking position alone doesn’t guarantee citations.

What Is Generative Engine Optimisation (GEO) — and Is It the Same as AEO?

GEO is the Backlinko and eMarketer label for the same thing as AEO: optimising for citation and synthesis by AI-driven answer engines. GSO (Generative Search Optimisation) is a third label, this one from Digiday‘s coverage.

The practical difference between AEO, GEO, and GSO is analytical tradition, not actual practice. Both frameworks land on the same core tactics: structured content, authority signals, verifiable sourcing, FAQ markup, co-citations, and schema. As Backlinko’s Leigh McKenzie puts it: “Isn’t this just SEO with a different name? In many ways, it is. But there’s a reason everyone’s talking about it… it reflects a real shift.”

GEO introduces one concept worth pulling out specifically: co-citation and co-occurrence building. If your brand is consistently mentioned alongside authoritative sources in the context of a particular topic, AI engines infer domain authority. Unlike link building — which accumulates PageRank through inbound links — co-citation building focuses on earning brand mentions across Reddit, LinkedIn, industry publications, and sector surveys. Pages with quotes or statistics have 30–40% higher visibility in AI answers, according to academic research cited by Backlinko.

The terminology will settle eventually. What matters more right now is internal alignment — pick one label, build a shared vocabulary, and get moving before you’re still debating nomenclature when your competitors aren’t.

How Does Digital Provenance Become a Competitive Advantage, Not Just a Defence?

Digital provenance is the verifiable record of a piece of content’s origin, authorship, creation method, and modification history. The Reuters Institute defined it as a key 2026 term: “the ability to verify the origin and history of digital media in an AI-infused world where sophisticated deep-fakes are becoming more common.”

The defensive case is covered in our article on C2PA and provenance as the foundation for AEO credibility. But there’s a second, more interesting case.

AI answer engines that are trained to prefer trustworthy sources will increasingly use provenance metadata as a trust signal. Authenticity and provenance are already listed as emerging AEO trust practices in Tryprofound‘s 2026 AEO guide — alongside a recommendation to incorporate digital watermarking and provenance indicators (Adobe Content Credentials, SynthID) to signal authenticity.

Gartner has placed digital provenance among its top 10 technology trends through 2030. The Digital Authenticity and Provenance Act 2025 requires organisations to be transparent about their content verification practices — and regulatory momentum like that compresses the early-adopter window pretty quickly.

The technical backbone is C2PA (Coalition for Content Provenance and Authenticity), an open standard developed by Adobe, Microsoft, Sony, and major publishers. C2PA’s Content Credentials work like a nutrition label for digital content — cryptographic signatures link each modification to a specific actor, and the smallest change creates a completely different hash value, making tampering instantly detectable.

Here’s the dual-use argument worth sitting with: the same C2PA infrastructure that defends against AI impersonation is also an AEO offensive asset. Implementing it before provenance becomes a standard citation-selection criterion gives you a citation-share advantage that’s genuinely hard to acquire retroactively.

What Does the Authenticity Advantage Look Like in Practice?

In markets flooded with polished, optimised, or AI-generated content, verified authenticity becomes scarce. And scarcity creates competitive value. The Reuters Institute says it plainly: “Trusted, high-quality, accurate content will be increasingly valued in a world of AI slop, deep-fakes, and toxic social media debates — many executives believe this is a structural advantage.” As the AI slop epidemic overview documents, this dynamic is already reshaping how audiences and platforms assign trust.

The 2026–2028 window is when this advantage could become structurally durable. Provenance infrastructure is available now but not yet platform-mandated. Early adopters gain citation share before compliance requirements level the playing field and the advantage becomes table stakes rather than genuine differentiation.

Here’s what a practical 2026–2028 content strategy actually looks like:

  1. AEO-structured content: Question-based headings, FAQ format, schema markup (Article, FAQ, DefinedTerm), and LLMs.txt implementation — the foundational technical signals AI crawlers prefer
  2. C2PA provenance metadata: On key content assets — technical posts, case studies, whitepapers — to build the authenticity signal that AEO trust frameworks are beginning to incorporate
  3. Co-citation building: Brand mentions alongside authoritative industry sources across Reddit, LinkedIn, industry surveys, and sector publications — the contextual authority signals AI engines use to determine domain expertise
  4. AI search visibility measurement: Track how often your brand is cited in AI-generated answers across ChatGPT, Perplexity, and Google AI Overviews. Tools for this are still early as of 2026 — manual spot-checking is the current approach
  5. A clear SEO-to-AEO transition strategy: AEO and SEO are additive, not competing — SEO drives traffic to your site; AEO builds brand visibility in AI answers

The integration thesis is straightforward: investment in provenance plus AEO strategy plus understanding of current impact adds up to a content posture that holds its value as AI answer engines increasingly mediate discovery. You can read the full picture of the AI slop epidemic in the pillar article.

Frequently Asked Questions

Is AEO just SEO with extra steps?

AEO and SEO are complementary but distinct. They optimise for different criteria, not additional ones. SEO optimises for rankability (links, keywords, page signals). AEO optimises for citability (authority, structure, verifiability). Some SEO foundations transfer cleanly — E-E-A-T, structured data, authoritative backlinks. Others don’t — keyword density, link-building volume tactics that accumulate links without signalling domain expertise. Run both in parallel; prioritise AEO where your content is structured and authoritative enough to be worth citing.

Does verifying content with C2PA actually improve AI search rankings?

Not currently confirmed as a direct ranking signal in any AI answer engine. However, authenticity and provenance are emerging as AEO trust practices — digital watermarking and provenance signals may well become standard expectations as answer engines develop more sophisticated quality frameworks. The strategic bet is to implement before it’s confirmed or mandatory, so you gain a first-mover citation-share advantage while it’s still available.

What is LLMs.txt and should my company implement it?

LLMs.txt is an emerging web convention — analogous to robots.txt — that lets site owners specify which content is accessible to LLM crawlers. It’s not yet a formal standard, but major AI crawlers are beginning to respect it. For SMB tech companies, implementing LLMs.txt for content pages (blog, product documentation, technical articles) is a low-effort AEO signal that makes your content more parseable by AI crawlers. Recommended as a baseline step.

What is the difference between AEO, GEO, and GSO?

All three describe the same discipline: optimising content to be cited and synthesised by AI-driven answer engines. AEO is the Reuters Institute framing; GEO is the Backlinko/SEO-practitioner framing; GSO is the Digiday label. No standard taxonomy exists yet. Align internally rather than waiting for industry consensus — the terminology will settle, but the discipline won’t wait for it.

What is E-E-A-T and does it apply to AEO?

E-E-A-T (Experience, Expertise, Authority, Trust) is Google’s trust framework for evaluating content credibility. Content strategies that have invested in E-E-A-T signals — authoritative bylines, primary source citations, depth of expertise — are better positioned for AEO than thin, keyword-padded content. It’s the bridge concept: what SEO teams already know that transfers most cleanly to AEO.

How do I measure AEO success if there are no click-through rates?

Replace click-through rate with citation share: how often your brand and content appear in AI-generated answers for your target queries. Key metrics are Visibility Score (how often your brand is mentioned in an AI response), Citation Score (how often your domain is cited), Sentiment Score, and Accuracy Score. Tools are still early as of 2026 — manual spot-checking across ChatGPT, Perplexity, and Google AI Overviews is the current approach.

How is AI slop affecting content marketing for B2B tech companies?

AI slop raises the bar for content that earns citation share. As AI-generated content floods every topic area, answer engines apply stricter selection criteria to surface trustworthy sources. The zero-click economy doesn’t hurt B2B tech companies as badly as it hurts publishers — brand citation without a click still builds authority and awareness in your target audience. In a world where referral traffic declines but AI-mediated brand recognition grows, the metric that matters is whether your company shows up in the answers your prospects are already receiving.

Is digital provenance just for media companies, or does it apply to tech product companies?

Digital provenance started with media companies and photojournalism. It applies equally to any content-dependent product: technical blogs, product documentation, case studies, whitepapers, developer content. Provenance metadata on those assets — recording authorship, creation method, and review status — signals the quality and verifiability that AI answer engines are beginning to prefer when selecting what to cite.

AI Slop Defence: Provenance Verification, Data Quality Gates, and Content Governance

AI slop has moved well past being a social media nuisance. It is now infiltrating the training pipelines, UGC systems, and content workflows that SMB tech companies depend on — and the instinctive response, dropping in a detection tool and calling it done, is not going to cut it.

This article covers a three-layer defence architecture sized for companies with 50–500 employees: a small data team, a production product to maintain, no dedicated ML platform organisation. The three layers are provenance-at-source using C2PA, data quality gates for training pipelines, and human-in-the-loop curation for high-stakes decisions. We also compare GPTZero vs. Originality.ai vs. manual review so engineering managers can match the right tool to the right use case.

Before we get into the defence layers, it helps to understand what you are actually defending against. For that, see our overview of the AI slop threat landscape.

Why Are Detection-Only Approaches to AI Slop Insufficient?

Detection tools like GPTZero and Originality.ai classify content after it has been created. They sit downstream of the problem. Adnan Masood, chief AI architect at UST, put it plainly: “I’ve seen teams auto-draft FAQs and knowledge base articles, ship them and then feed those same pages back into RAG as retrieval sources. A month later, you’re no longer retrieving trusted institutional knowledge; you’re retrieving yesterday’s synthetic filler.”

Accuracy is shakier than vendors would like you to believe. ZDNet‘s October 2025 test series across 11 dedicated AI detectors found GPTZero and Originality.ai both scoring 80% accuracy. Copyleaks, which markets “99% accuracy backed by independent third-party studies,” declared clearly human-written content to be 100% AI-generated in the same test. Undetectable.ai scored 20% accuracy as a detector — having previously scored 100%. That kind of volatility tells you everything you need to know about accuracy guarantees.

Anti-detection tools make things worse. Services like Undetectable.ai and Bypass.ai rewrite AI-generated text specifically to evade classifiers. As enterprise adoption of detection tools grows, so does the commercial incentive to build better evasion tools. False positives compound the problem further: non-native English speakers writing technical content produce text that closely mimics AI-generation patterns. Systematically rejecting it is both a governance failure and a data quality failure.

Gartner‘s position captures it well: “As AI-generated data becomes pervasive and indistinguishable from human-created data, a zero-trust posture establishing authentication and verification measures, is essential to safeguard business and financial outcomes.” Detection-only is not a zero-trust posture. Detection should be a second layer, not the primary defence.

What Is Provenance Verification and Why Is It the Durable Defence?

Provenance verification asks a different question than detection. Detection asks: does this content look AI-generated? Provenance asks: where did this content come from?

Dr. Manny Ahmed, CEO of OpenOrigins — cited by the BBC on this point — framed it directly: “We are already at the point where you cannot confidently tell what is real by inspection alone. Instead of trying to detect what is fake, we need infrastructure that allows real content to publicly prove its origin.”

The structural advantage is simple. A valid cryptographic signature from a human-operated tool cannot be spoofed by an anti-detection rewriter. The rewriter can change every word. It cannot forge a valid cryptographic signature from the original creation tool. The provenance chain is either present and unbroken, or it is absent — an anti-detection rewriting tool does not produce fake provenance, it destroys real provenance.

The Content Authenticity Initiative puts it clearly: “Detection tools will always be in an arms race against bad actors, requiring regular updates to improve their accuracy.” Provenance is not in that arms race.

One limitation worth being upfront about: C2PA adoption is not universal in 2026. The provenance approach works best in controlled-intake workflows — partner-submitted content, in-house editorial production — where you can require C2PA signing as a condition of submission. For genuinely open UGC at internet scale, detection remains necessary. The two layers are complementary, not mutually exclusive.

How Does C2PA Work and How Do You Integrate It?

The Coalition for Content Provenance and Authenticity (C2PA) is a Joint Development Foundation project founded in 2021 by Adobe, Arm, BBC, Intel, Microsoft, and Truepic. It publishes royalty-free, open technical specifications for attaching verifiable provenance metadata to digital media. Adobe Content Credentials is the consumer-facing brand for the same standard.

How it works. At signing, the originating tool assembles assertions about the content — who created it, when, which tools were used, whether AI was involved — and signs the claim with a private key from a trusted Certificate Authority. The signed manifest is stored inside the file in a JUMBF container. At verification, any C2PA-compliant tool validates the certificate chain, checks the cryptographic hash against the current file bytes, and reports pass or fail. No network call required — all certificates travel inside the manifest. Unlike EXIF or IPTC metadata, the C2PA manifest breaks if tampered with.

2026 adoption. C2PA v2.2 (stable, May 2025) supports JPEG, PNG, WebP, AVIF, HEIC, MP4, MOV, and PDF. Hardware support includes the Leica M11-P, Sony α9 III, and Google Pixel 10, which signs every photo by default using hardware-backed keys. Adobe Photoshop, Lightroom, Premiere Pro, Adobe Firefly, OpenAI DALL-E 3, Sora, and Google Imagen all support C2PA signing.

SMB integration steps. For controlled-input workflows: require C2PA-signed files as a condition of submission — unsigned files are flagged at intake. For training data pipelines: treat C2PA metadata as a positive confidence signal; documents without it are not auto-rejected, but get a lower provenance confidence score. Implementation uses the open-source C2PA JavaScript SDK or Python library on GitHub — feasible with a 1–2 week engineering effort. The EU AI Act‘s transparency labelling requirement (effective August 2026) is satisfied by C2PA’s AI assertion type.

Known limitations. Strip attacks are the primary vulnerability: a non-C2PA tool can save a file without the manifest container, silently removing all credentials. Retroactive signing is not possible — existing content without C2PA metadata cannot be signed after the fact.

How Do Data Quality Gates Work in an AI Training Pipeline?

A data quality gate is a filtering stage that validates incoming content against defined rules before ingestion — stopping contamination upstream rather than cleaning up after training has already run.

Filter before ingestion, not after training. A contaminated corpus requires retraining from scratch or expensive data-cleaning passes. Nature Medicine research found that replacing just 0.001% of training tokens with misinformation caused models to generate 7–11% more harmful completions. At SMB fine-tuning scales, contamination effects show up faster. This is the practical prevention mechanism for the model collapse entropy spiral documented by Shumailov et al. in Nature (2024) — covered in our article on the model collapse mechanism these defences prevent.

A four-signal filtering framework, in order of reliability.

C2PA provenance metadata — cryptographic, not gameable by text rewriting. Content with valid C2PA signing gets the highest confidence score. C2PA content is trusted; content without C2PA is not auto-rejected, but flagged for secondary review.

AI detection score (GPTZero or Originality.ai) — medium reliability, gameable by anti-detection rewriters. Use as a secondary signal. WitnessAI‘s governance framework recommends combining detection scoring with provenance tracking.

Vocabulary diversity metrics — AI-generated text tends toward lower type-token ratio, higher phrase repetition, and characteristic sentence-length distributions. Flag statistical outliers in the bottom quartile for secondary review. This signal is harder to game than a detection classifier because it measures distributional properties rather than learned patterns.

Source provenance score — content from known high-quality sources (academic publishers, verified news outlets) gets higher baseline confidence than anonymous web scrapes.

Threshold setting. Set thresholds conservatively. A document scoring 70% AI-content probability should route to human review, not auto-rejection. Auto-rejection below 90% confidence will remove legitimate content at meaningful rates. If you cannot answer “where did this training data come from and how confident are we that it is human-authored?” — your pipeline has a governance gap.

Auditing existing corpora. Run the corpus through batch detection, flag everything above 60% AI probability, then analyse vocabulary diversity across the flagged set. Sample 100–200 documents for human review to establish your domain-specific false positive rate. Document the methodology — this becomes your training data governance record.

When and Where Should You Use Human-in-the-Loop Curation?

Human-in-the-loop (HITL) curation is the escalation layer for pipeline decisions where automated systems handle things poorly: borderline detection scores, high-stakes training data, novel evasion patterns, and consequential UGC.

HITL is not a replacement for Layers 1 and 2. GPTZero’s batch analysis of 4,841 NeurIPS 2025 accepted papers found 100 confirmed hallucinated citations across 51 papers — in pipelines that already included multiple rounds of human peer review. The combination outperforms either approach alone.

Four pipeline stages that warrant HITL at SMB scale: documents scoring high AI-detection confidence (above 70–80%) with no C2PA metadata; borderline detection scores (50–80% AI probability) — sample 5–10% quarterly to calibrate your false positive rate; high-stakes UGC with consequential downstream use like product reviews or customer-facing model training data; and novel evasion patterns in review logs — when content classified as human shows anti-detection tool signatures, escalate and update thresholds.

Cost-benefit. A trained reviewer processes 300–500 documents per hour at $0.06–$0.17 per document — 6–17 times the automated detection cost. The economics work when HITL is the escalation layer: automated detection handles 90–95% of decisions; human review handles the borderline and high-stakes 5–10%. The blended per-document cost drops to $0.01–$0.02 across the full pipeline at those ratios.

Use a three-tier queue: auto-approve (low detection score, C2PA present) → HITL review (borderline or high-stakes) → reject (very high detection score, no mitigating signals). Measure false positive rates quarterly and adjust thresholds. A well-tuned gate should reject no more than 1–3% of content that human reviewers would pass.

GPTZero vs. Originality.ai vs. Manual Review: Which Fits Your Use Case?

This is an analytical comparison based on published methodology, documented use cases, and independently tested performance — not live benchmark results. Accuracy figures are drawn from ZDNet’s October 2025 five-test series across 11 detectors, and from each tool’s published documentation.

GPTZero scored 80% accuracy in ZDNet testing. It is designed for longer-form text and handles academic and editorial corpus auditing at scale reasonably well. Watch for elevated false positive rates on technical and scientific writing, and poor performance on content under 100 words. Cost: approximately $6–$10 per 1,000 documents. Best suited for long-form academic and editorial corpus auditing. Vulnerable to anti-detection rewriting.

Originality.ai also scored 80% in ZDNet testing. It is purpose-built for web-facing, SEO-adjacent content — its Amazon reviews analysis (26,000 reviews, 400% increase in AI-generated reviews since ChatGPT launch) demonstrates the content type it is optimised for. A 100% false positive on clearly human-written content in ZDNet testing is a documented failure mode worth knowing about. Cost: approximately $10 per 1,000 documents. Better suited than GPTZero for UGC screening and web-scraped corpus auditing. Also vulnerable to anti-detection rewriting.

Manual review is the only approach with no evasion vulnerability. A skilled domain expert can exercise contextual judgement — evaluating plausibility, coherence, citation authenticity — that no classifier currently replicates. Cost: approximately $60–$170 per 1,000 documents. Right for escalation, not primary filtering.

Use GPTZero for batch auditing of long-form editorial training data. Use Originality.ai for UGC screening and web-scraped dataset auditing. Use manual review as the escalation layer for borderline scores and high-consequence decisions.

For the business risk framing that should inform how you size these investments, see the business risk context for UGC protection. For UGC-specific protection, the next section covers the practical steps.

How Do You Protect a UGC System from AI-Generated Fake Submissions?

Originality.ai’s analysis of 26,000 Amazon reviews found a 400% increase in AI-generated reviews since November 2022, with extreme 1-star and 5-star reviews 1.3 times more likely to be AI-generated than moderate reviews. AI bots account for one in every 31 visits to publisher websites in Q4 2025 (TollBit data), up from one in 200 in Q1 2025.

The practical implication: you cannot assume submitting accounts are human, and you cannot rely on detection alone.

Friction-based approaches. Account age requirements (30 days minimum before high-trust content is surfaced) raise the cost of automated abuse without depending on detection accuracy. Verified Amazon reviewers were 1.4 times less likely to have AI-generated content than unverified reviewers — friction-based verification is a meaningful signal. Submission rate limiting (flag accounts submitting more than 5 reviews per day) catches throwaway account abuse.

Detection at submission. Run AI detection scoring at submission time using Originality.ai for web-content types. Route high-score submissions to a human review queue rather than auto-publishing. False positive rates are real enough to warrant human escalation over silent rejection.

Community-based flagging. Implement user-flagging with a downstream review queue. Weight flags from established, high-trust accounts more heavily. Community flagging scales without linear cost increases.

Transparency to end users. Surface provenance signals where available: “verified purchaser,” account age, or C2PA Content Credentials badges. Consumers find AI-generated reviews less helpful — Originality.ai found a statistically significant negative correlation between helpfulness scores and AI content probability. Authentic content is a competitive differentiator as AI slop awareness grows.

Why Does the Adversarial Dynamic Favour Provenance Over Detection Long Term?

The anti-detection tools category is commercially motivated and structurally unconstrained. As long as detection tools are an enterprise barrier to AI-generated content distribution, there is money in building better evasion tools.

The degradation is already measurable. Undetectable.ai scored 100% accuracy as a detector in earlier ZDNet tests, then collapsed to 20% in October 2025. Against clean AI-generated text — before evasion tools are applied — detection accuracy sits at 80%. With well-executed evasion, real-world accuracy can drop to 50–65% — barely better than chance.

Why C2PA wins structurally. Evasion rewriting destroys real provenance but cannot manufacture fake provenance. The private signing key is held by the originating tool’s Certificate Authority — it cannot be retrospectively manufactured. Strip attacks remain the practical vulnerability, but even here the outcome is absent provenance rather than forged provenance — a manageable risk tier in a risk-tiered intake policy.

The regulatory tailwind. The EU AI Act (effective August 2026) requires transparency labelling for AI-generated content — C2PA’s AI assertion type satisfies this requirement. Companies building provenance-first infrastructure in 2026 are not just protecting their training pipelines; they are ahead of compliance.

Invest first in provenance infrastructure for controlled-intake workflows, where C2PA adoption can be required as a condition of submission. Use detection as the second-layer filter for open-intake contexts where provenance cannot be required. The two-layer architecture is more durable than either approach alone.

For the broader picture of what AI slop means for the information ecosystem, see our understanding AI slop and its risks overview.

FAQ

Does C2PA work if content is edited or shared across platforms?

C2PA supports edit manifests — each editing step in a C2PA-compliant tool appends a new cryptographic entry to the manifest chain, preserving provenance through multiple editing stages. Adobe Photoshop, Lightroom, and Premiere Pro all support this.

The failure mode is re-export through non-C2PA tools: a screenshot, re-encoding, or resave by software that does not preserve the JUMBF manifest container drops the original C2PA metadata entirely. Most social media platforms strip metadata on upload, so open web sharing introduces coverage gaps. C2PA is most reliable in controlled pipelines where the toolchain is known end-to-end.

How much does AI content detection cost at scale?

Originality.ai: approximately $0.01 per document at retail pricing. GPTZero: freemium model with paid API access above five free tests per day. Manual review: $0.06–$0.17 per document at $30–$50/hour with 300–500 documents per hour throughput.

At 100,000 submissions per month: automated tools run approximately $600–$1,000/month; manual review runs $6,000–$17,000/month. The recommended architecture — automated detection handles 90–95% of decisions, HITL escalation handles the remaining 5–10% — produces a blended cost of approximately $900–$2,700/month at that volume.

Can a data quality gate remove too much legitimate content?

Yes — false positives are the primary operational risk. Technical writing, legal prose, and domain-specific documentation have vocabulary and sentence structure patterns that closely resemble AI-generated content. Non-native English writing is disproportionately flagged across multiple detectors.

Mitigation: set detection-score thresholds conservatively and route borderline cases to human review rather than auto-rejection. A well-tuned gate should reject no more than 1–3% of content that human reviewers would pass. Measure and track false positive rates quarterly, segmented by content type.

Is manual review scalable for high-volume UGC?

At 300–500 documents per reviewer per hour, manual review scales linearly with headcount. For most SMBs, the sustainable ceiling is 10,000–20,000 manual reviews per day before cost and review quality both become problematic. Above that volume, automated detection reduces the manual queue to only borderline and high-stakes cases, so reviewers handle 5–15% of total submissions rather than 100%. Offshore review teams can reduce cost by 50–70% for well-defined review criteria.

The Entropy Spiral: How AI Slop Degrades Future AI Systems Through Recursive Training

You know what happens when you photocopy a photocopy of a photocopy. Each generation loses fidelity — not randomly, but systematically. Fine details blur first, then outlines, then structure. By generation 10 you have a grey smear.

That is exactly what AI training on AI-generated output does. The mechanism is called model collapse. If you have been using AI tools for a year or two and something feels off — responses more generic, more hedge-y, less precise about edge cases — you are experiencing what Charlie Guo at Ignorance AI calls “intelligence drift.” It is a structural cause, not a perceptual one. This article is part of our series on what the AI slop epidemic means broadly — covering how machine-generated content is degrading the systems that produce it.

In 2024, Shumailov et al. documented model collapse empirically in Nature (vol. 631, pp. 755–759). What almost no practitioner-facing writing has addressed since is what those findings mean for companies fine-tuning foundation models on their own internal data. That is what this article is about.

What is model collapse — and why does the 2024 Nature paper matter?

Model collapse is a degenerative process where a model trained on AI-generated outputs progressively loses the ability to represent rare or edge-case knowledge. Output distribution narrows. Outputs become homogenised and low-variance. It is not a software bug — it is a structural training phenomenon.

Shumailov et al.’s key finding: “indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear.” The tail is where rare, specialised, and minority knowledge lives. When the tails disappear, the model loses the ability to reason about uncommon cases — exactly the cases where expert knowledge matters most.

Two stages. Early: the model begins losing information about the tails. Late: the model converges to a distribution that carries little resemblance to the original — the “irreversible defects.”

Shumailov et al. characterise it as a structural property of the training process: “this process is inevitable, even for cases with almost ideal conditions for long-term learning.” Given sufficient recursion, collapse is not a risk. It is an outcome.

Why does this matter right now? The internet is no longer predominantly human-authored. AI slop floods the web-crawl corpora used to train the next generation of foundation models. According to Ignorance AI’s analysis, over 74% of newly created webpages contained AI-generated text as of April 2025. The recursive loop is already running at internet scale.

How does the entropy spiral work, step by step?

The recursive training mechanism — which Shumailov et al. term the “self-consuming loop” — works like this: Model N generates outputs. Those outputs enter a training corpus. Model N+1 is trained on that corpus. Model N+1 generates outputs. Those outputs enter the next corpus. Model N+2 is trained on that. Each generation’s output becomes the next generation’s training input.

The entropy framing is not a metaphor. The arxiv analysis (2509.16499, 2025) measured a strong linear correlation between model generalisability and training dataset entropy. As recursive iterations progress, entropy sharply decreases. Generalisability declines with it.

Three sources of error compound across generations:

  1. Statistical approximation error: any finite training set truncates the full distribution. Tail knowledge gets cut off each generation.
  2. Functional expressivity error: neural networks have structural limits on what distributions they can represent. Each generation’s learned distribution is a compressed approximation of what it was trained on.
  3. Functional approximation error: stochastic gradient descent amplifies the most common patterns at the expense of rare ones.

These errors compound. The Gaussian model collapse theorem establishes that the nth generation approximation “collapses to be zero variance as the number of generations increases, with probability 1.”

The observable surface symptom before collapse becomes obvious is semantic drift: outputs converge on common phrasing, vocabulary diversity decreases, answers to unusual queries become vague or confidently wrong.

Semantic drift is distinct from catastrophic forgetting. Catastrophic forgetting happens during fine-tuning when the task is too narrow. Model collapse happens from recursive data contamination — the data is degraded, not the training objective. Different causes, different remediation.

For a richer treatment of hallucination mechanisms that manifest at the output level as model collapse progresses, see vibe citing as a professional-domain symptom of hallucination at scale.

What are data distribution drift and semantic drift — and how do you measure them?

Data distribution drift is the statistical shift in training data characteristics across successive iterations — the precursor condition for model collapse. Semantic drift is the observable symptom in model outputs — the signal that damage has already occurred.

By the time user-reported quality degradation surfaces — developers complaining that the model “feels dumber” — the degradation has typically been underway for weeks or months.

Vocabulary diversity / type-token ratio: measure the ratio of unique tokens to total tokens in model outputs, run against held-out human-authored test queries. A declining ratio over successive training runs signals semantic drift. Direction of travel matters more than absolute value.

Perplexity on held-out human-authored data: a model whose learned distribution has shifted away from human language patterns will show rising perplexity on normal human writing. The Nature paper’s OPT-125m experiments documented exactly this across recursive generations.

Repetitive phrasing audit: manually review a sample of 50–100 production outputs for formulaic sentence openings, stock hedge phrases, or repeated structural patterns. Low-cost, and easy to spot once you know what to look for.

Establish a baseline at initial fine-tuning. Re-evaluate after each subsequent training run. Quarterly monitoring is a reasonable minimum.

Early-stage semantic drift can potentially be arrested by cleaning the corpus and retraining from an earlier checkpoint. Late-stage collapse is not recoverable without clean human-authored training data at scale. Prevention is more tractable than reversal.

Full remediation — data quality gates, provenance tracking tooling, entropy-based data selection — is treated in depth in the companion piece on how to defend against model collapse in fine-tuning pipelines.

Are you already inside the recursive training loop? The SMB fine-tuning blind spot

Most content on model collapse focuses on AI labs training frontier models from scratch. The SMB fine-tuning scenario is a structural analogue that almost nobody discusses explicitly.

The starting condition

Foundation models — GPT-4o, Claude, Gemini — are pre-trained on web-crawled corpora that already contain a growing fraction of AI-generated content. The recursive loop is already running at internet scale before your fine-tuning begins. You are not starting from clean human-authored data.

Frontier AI labs know this. Google has licensing deals with Reddit. OpenAI has deals with News Corp. These are active efforts to preserve access to verified human writing because the free supply is running low — “peak data” is a live concern at every major AI lab.

The internal corpus problem

Think about what “our data” actually contains for a team that has been using AI tools for the last 18 months.

Support tickets drafted with ChatGPT or Copilot. Internal documentation and wikis written with AI writing assistance. Code comments and README files generated by coding assistants. Marketing and HR materials produced with AI tools. None of it is tagged as AI-generated at the time of authorship. To any data ingestion pipeline, it looks like internal human knowledge.

When your team fine-tunes a foundation model on this corpus, it is entering the recursive training loop — not at generation 1, but potentially at generation 2 or 3. Invisible, because nothing is labelled “AI-generated.”

Closing the loop

If the fine-tuned model is then deployed to help write more support tickets, update more internal documentation, or assist with code comments — and that content subsequently enters the fine-tuning corpus in the next training cycle — the loop is now closed internally. The organisation has created a private entropy spiral.

Moveo AI‘s research establishes a practical threshold: even 10–25% of incorrect or AI-generated data in a fine-tuning set causes measurable performance degradation, and at that level the base model outperforms the fine-tuned variant. When fine-tuning data volume increases without curation, performance does not just plateau — it often regresses: “the model becomes less coherent, and the subtle understanding it initially displayed vanishes.”

More training on unaudited data is not safer. It is faster degradation.

If your team adopted AI writing and coding tools 12–18 months ago without tagging authorship, assume a portion of your internal corpus is effectively synthetic data. The question is: “how far along are we, and do we have the baseline measurements to know?”

How do you detect early signs of model degradation in fine-tuned models?

Baseline measurement at the first fine-tuning run is the foundation — measurements at key intervals, compared against a baseline. Not onerous.

1. Baseline capture at initial fine-tuning

At the point of completing the first fine-tuning run, record: type-token ratio on a held-out set of 200–500 human-authored test queries; perplexity on the same set; a sample of 50–100 production outputs, saved verbatim. This is the reference point.

2. Dataset audit before each fine-tuning run

Before adding new data to the fine-tuning corpus, review it for AI-generated content. AI content detection tools (imperfect but directionally useful), internal tagging policies for new content, and similarity searches against your own recent model outputs can reduce contamination risk. The goal: understand what fraction you are adding and whether it is approaching the 10–25% threshold.

3. Output sampling programme

Monthly is reasonable, quarterly is a minimum — sample 50–100 production outputs and assess for vocabulary diversity and formulaic phrasing. You are looking for direction of travel, not an absolute score. Is the type-token ratio declining? Are the same sentence openings appearing more frequently?

4. Regression testing

Maintain a set of domain-specific evaluation prompts with human-authored reference answers. After each model update, compare outputs using a semantic similarity score. A declining score — even a gradual one — is a signal. Keep the same evaluation set across all cycles.

5. Catastrophic forgetting check

After each fine-tuning run, evaluate general-capability benchmarks alongside domain-specific tasks. If general performance degrades while domain performance holds, catastrophic forgetting is the likely cause, not model collapse. They require different responses. Conflating the two leads to applying the wrong fix.

The “worse than baseline” signal

If your fine-tuned model consistently underperforms the unmodified base model on domain-specific tasks, contamination is the likely cause. Fine-tuning should improve domain performance. If the base model is beating your fine-tuned version, the corpus is degrading rather than improving capabilities. If vocabulary diversity has declined more than approximately 20% from baseline, or perplexity on human holdout has increased more than approximately 15–20%, treat corpus contamination as urgent before the next training run.

What comes next: the defence stack for model collapse

Model collapse is a documented failure mode, not a hypothetical risk. Shumailov et al.’s Nature 2024 paper established it as an inevitable outcome of sufficient recursive training. Entropy decreases with each recursive generation, compressing output diversity in a measurable, linear relationship.

SMB organisations fine-tuning foundation models on internal corpora are entering the recursive training loop without necessarily realising it. The 10–25% contamination threshold for measurable degradation is reachable without deliberate action. Early detection is possible, but only if baselines are established before complaints surface.

For organisations that have identified contamination risk or are building fine-tuning pipelines, the defence stack — data quality gates, provenance tracking, human-in-the-loop validation, entropy-based data selection — is treated in depth in how to defend against model collapse in fine-tuning pipelines.

The conditions driving this problem are not improving. As Charlie Guo frames it: “The path of least resistance (and lowest costs) leads towards AI models regurgitating AI content, over and over again.” The entropy spiral is the default trajectory for any system that does not actively counter it. The question is whether the measurements are in place to catch it before it compounds.

For the broader consequences of AI-generated content flooding the internet, see the wider consequences of AI-generated content floods — covering all dimensions of the AI slop epidemic in one place.

Frequently asked questions

What is model collapse in AI?

Model collapse is a degenerative process where a model trained on AI-generated outputs progressively loses the ability to represent rare or edge-case knowledge. Output distribution narrows until outputs become homogenised and low-variance. Shumailov et al. documented this in Nature (vol. 631, pp. 755–759, 2024): “indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear.” Structural training phenomenon, not a software bug.

Is model collapse the same as catastrophic forgetting?

No. Catastrophic forgetting occurs during fine-tuning when the task is too narrow — the model overwrites prior general capabilities with the new task distribution. Model collapse occurs from recursive contamination of training data — the data is degraded, not the training objective. If general performance degrades while domain performance holds after fine-tuning, catastrophic forgetting is the likely cause. If vocabulary diversity and perplexity on human holdout are declining over successive training cycles, model collapse is the more likely cause.

Can model collapse be reversed once it starts?

Early-stage semantic drift can potentially be arrested by cleaning the corpus and retraining from an earlier checkpoint. Late-stage collapse — where rare knowledge has been systematically extinguished from the model’s weights — is not recoverable without access to clean human-authored training data at scale. The Nature paper uses language of “irreversible defects” once collapse compounds. Prevention is more tractable than reversal.

Is it safe to fine-tune an LLM on data my team wrote with AI assistance?

Not without auditing the corpus first. AI-assisted content is indistinguishable from human-authored content at ingestion time. The Moveo research establishes that even 10–25% AI-generated or incorrect data in a fine-tuning set causes measurable degradation — at that level the base model can outperform the fine-tuned variant. If your team has been using AI writing tools for 12–18 months without tagging authorship, assume a portion of your corpus is effectively synthetic data before you fine-tune on it.

Does model collapse affect large frontier models like GPT-4o or Claude?

Yes, in principle — frontier models are trained on web-crawled corpora that already contain AI-generated content. In practice, frontier AI labs have data quality teams, filtering pipelines, and curated licensed corpora that reduce contamination. SMBs fine-tuning on internal corpora lack all of these safeguards. The risk profile for enterprise fine-tuned models is higher than for the foundation models they fine-tune on.

What is the “self-consuming loop” in AI training?

The formal term from the Shumailov et al. literature: each generation of model output, when used as training data for the next generation, creates a feedback loop. For enterprise fine-tuning, the loop can be closed within a single organisation if the fine-tuned model assists with content creation that subsequently enters the fine-tuning corpus. The conditions for this are already met at most teams using AI writing tools for a year or more.

What does “peak data” mean and why does it matter for AI training?

Peak data: usable high-quality human-authored web text is approaching exhaustion as a training resource. Frontier AI companies are actively securing licensed human-authored content because the free clean supply is running low. The base models you fine-tune on are being trained on an increasingly synthetic web, meaning the starting point for fine-tuning is already further down the entropy spiral than it was two years ago.

How do I know if my internal documents count as “synthetic data”?

Any content drafted or materially edited by an AI writing tool — ChatGPT, Claude, Copilot, Gemini — functions as synthetic data in a fine-tuning context, even if a human reviewed it. Content types to audit: customer-facing communications, support ticket responses, internal documentation, wiki pages, code comments, README files, HR and policy documents. If your team has been using AI writing tools for 12–18 months without tagging authorship, assume a portion of your corpus is effectively synthetic data.

What is RAG and is it safer than fine-tuning for avoiding model collapse?

Retrieval-Augmented Generation (RAG) injects relevant context from a retrieval index into the model’s prompt at inference time. RAG does not modify model weights, so the base model is not exposed to the recursive training dynamic — model collapse is not a risk of the RAG approach itself. Trade-off: RAG is more collapse-safe but requires up-to-date retrieval indices. Fine-tuning produces more deeply integrated domain knowledge but carries collapse risk when corpus quality is uncontrolled.

Does model collapse affect coding assistants?

Yes. Coding assistants trained or fine-tuned on repositories containing AI-generated code face the same recursive contamination risk. GitHub repositories increasingly contain AI-generated code comments, boilerplate, and documentation. Observable signals: progressively more generic, less idiomatic code suggestions; increasing rates of hallucinated function signatures or API calls that exist in pattern but not in reality. At advanced stages, outputs drift toward plausible-looking but incorrect code — the AI hallucination dynamic, surfacing through the model collapse mechanism.

Vibe Citing and the Collapse of Peer Review at the World’s Top AI Conference

At NeurIPS 2025 — the world’s most prestigious AI venue — GPTZero found 100 fabricated citations in 51 accepted papers. Peer review caught none of them.

GPTZero is the AI content detection company. They scanned 4,841 accepted papers from the Conference on Neural Information Processing Systems and came up with a name for what they found: “vibe citing.” This article is part of our series on what AI slop is and where it shows up — and academic peer review is now one of its most consequential vectors.

If the research your team uses to evaluate AI tools and benchmark vendor claims contains fabricated evidence, your decisions are built on sources that don’t exist. Here’s how it happened, why peer review failed, and what you should actually do about it.

What is vibe citing — and who coined the term?

Vibe citing is when AI-hallucinated citations end up in academic papers. References that look plausible but point to works that don’t exist. Invented authors, fabricated titles, fake DOIs, arXiv IDs pointing nowhere.

GPTZero’s Head of Machine Learning, Alex Adams, coined the term as a riff on “vibe coding” — Andrej Karpathy’s name for AI-assisted programming by feel rather than comprehension. The researcher isn’t reading and synthesising sources. They’re letting an LLM generate plausible-sounding references and calling it done.

That distinction from ordinary citation errors matters. A typo in a page number can be checked against the real paper. Vibe citations reference papers that don’t exist at all. The LLM generates syntactically correct, genre-appropriate references — author names that could be real, titles that sound like legitimate ML papers, correctly formatted venue identifiers. They read as legitimate until you actually look them up.

This isn’t plagiarism. It isn’t data fabrication. It’s a third category of research misconduct, enabled by LLMs at scale, and invisible to the human eye.

What did GPTZero find in the NeurIPS 2025 papers?

The numbers are precise. GPTZero’s Hallucination Check tool scanned 4,841 of the 5,290 papers accepted by NeurIPS 2025 and found 100 confirmed hallucinated citations across 51 papers.

NeurIPS 2025 received 21,575 submissions and accepted 5,290 — a 24.52% acceptance rate. Each of those 51 affected papers cleared a competitive bar and still went out with fabricated sources.

The University of Chester’s arXiv paper (2602.05930) breaks down how hallucinated citations fail into five categories. Total Fabrication accounts for 66% of cases — the entire citation invented from scratch. Partial Attribute Corruption (27%) blends real elements with fabricated ones. Identifier Hijacking (4%) uses a valid DOI that points to an unrelated paper. Semantic Hallucination (1%) and Placeholder Hallucination (2%) — obvious template failures like “Firstname Lastname” — round things out.

The most significant finding: 100% of hallucinated citations exhibited multiple failure modes simultaneously. That’s what makes them so hard to catch. They defeat several verification checks at once, not a single obvious one.

GPTZero’s Hallucination Check verifies citations against Google Scholar, PubMed, arXiv, CrossRef, and DOI/URL validation — a multi-database cross-reference no human reviewer performs routinely. The tool catches 99 out of 100 flawed citations.

The trend line matters as much as the point-in-time finding. A December 2025 pre-print found the average number of objective mistakes per NeurIPS paper grew from 3.8 in 2021 to 5.9 in 2025 — a 55.3% increase that tracks directly with ChatGPT’s launch in November 2022.

How did fabricated citations get through peer review?

Peer reviewers are domain experts. Their job is to evaluate whether research claims hold up — not to audit citations. Nobody in the NeurIPS review process is formally tasked with verifying that every referenced work actually exists.

NeurIPS submissions grew 220% from 2020 to 2025 — from 9,467 to 21,575. GPTZero calls this the “submission tsunami.” A typical reviewer handles four to eight papers per cycle, each with 30 to 60 references. Manually verifying hundreds of citations per cycle isn’t feasible. And vibe citations defeat visual inspection — correct journal name formats, plausible author combinations, appropriate venues for the claimed year.

The quality failure runs in both directions. At ICLR 2026, authors withdrew papers after discovering their reviewers had used AI to write feedback. NeurIPS launched its Responsible Reviewing Initiative in 2025, acknowledging the problem — but it didn’t prevent the hallucinated citations. The structural conditions remain.

Is this an isolated incident or a growing pattern?

NeurIPS isn’t alone. Before the NeurIPS investigation, GPTZero had already identified more than 50 hallucinated citations in papers submitted to ICLR 2026. GPTZero names ICLR, NeurIPS, ICML, and AAAI as the top four ML and AI conferences — all facing the same pressures.

The Reuters Institute’s 2026 report frames academic contamination as part of a broader AI content integrity problem. And it goes well beyond academia. The US MAHA report had citation errors detected by GPTZero within a week of its release. GPTZero’s analysis of a 234-page Deloitte Australia report found 19 hallucinations in 141 citations — the case that ended in a $98,000 AUD refund.

The structural driver is publication pressure combined with paper mills, with an LLM filling the ghostwriting role faster and harder to detect than any previous method.

The long-term risk is propagation through the citation graph. Future papers citing contaminated papers inherit corrupt evidence chains.

The same problem is showing up in courtrooms

More than 800 errant legal citations attributed to AI have been flagged in US court filings, with attorney sanctions following.

The structural parallel is obvious. Judges and opposing counsel aren’t expected to proactively verify every cited case — the same mismatch as peer review. The same root cause (LLM hallucination producing correctly formatted references pointing to nothing real) produces the same failure in both contexts.

The legal community has moved faster to enforce consequences — sanctions, mandatory disclosure in some jurisdictions — and that trajectory is worth watching as a signal for where academic responses are likely to follow.

Why this matters if you rely on AI research to make technical decisions

Here’s the specific risk. You’re evaluating a vendor’s benchmark claims. You pull an academic paper to calibrate your evaluation. The methodology might be sound — but if the literature review contains hallucinated citations, the supporting evidence base is fabricated. You’re building your evaluation on sources that don’t exist.

Research papers inform architectural choices, model capability assessments, and stakeholder briefings. Each of those has a research dependency that may be compromised at the source.

The practical response is calibrated scepticism, not blanket dismissal. Check whether the paper was submitted to a venue with automated citation verification. Hallucinated citations cluster in literature reviews, not methodology — so papers with reproducible code carry lower risk. Run suspicious citations through Google Scholar or CrossRef. For a systematic look at evaluating AI detection tools for research and training data, including their reliability limits and where human review remains necessary, see our dedicated guide.

There’s an irony worth naming. The AI tools generating research papers are being evaluated, in part, by research partly generated by those same tools. The circularity compounds at every layer.

The same hallucination mechanism behind vibe citing is what drives model collapse in AI training pipelines — when synthetic content is recursively fed back into future training runs, the degradation compounds at every cycle.

Frequently asked questions

What exactly is vibe citing?

Vibe citing is when AI language models generate academic citations without verifying the referenced works actually exist. The term was coined by GPTZero’s Alex Adams, riffing on “vibe coding.” These aren’t minor formatting errors — they’re wholesale inventions that happen to look syntactically correct.

Is all of NeurIPS 2025 compromised?

No. GPTZero found 100 hallucinated citations in 51 of 4,841 papers — approximately 1.05% of papers. The NeurIPS Board noted that incorrect references don’t necessarily invalidate the paper content. The concern is that the contamination is invisible to standard reading and review.

Can peer review be fixed to catch AI-generated citations?

The structural fix is automated citation verification at the submission stage, before peer review begins — similar to how plagiarism checkers now operate. ICLR has begun requiring disclosure and is coordinating with GPTZero. Policy statements without mandatory automated checking aren’t going to cut it.

What is the difference between vibe citing and just making a mistake?

Ordinary citation errors can be checked against the real paper. Vibe citations reference papers that don’t exist. GPTZero’s methodology excludes obvious spelling mistakes and dead URLs as plausibly human — vibe citing is specifically AI-generated, holistic fabrication.

How do hallucinated citations actually look in a paper?

A typical vibe citation reads like this: an author name that could be real, a title that sounds like a plausible ML paper, a venue such as “NeurIPS 2023” or “ICLR 2022,” and a DOI or arXiv ID that either leads nowhere or points to an unrelated paper. Total Fabrication (66% of cases) involves the entire reference being invented; Partial Attribute Corruption (27%) blends real elements with fabricated ones; Identifier Hijacking (4%) attaches real DOIs to wrong papers — and 100% of cases exhibit multiple failure modes simultaneously.

What is GPTZero’s Hallucination Check tool?

GPTZero’s Hallucination Check is an automated citation verification service that checks references against Google Scholar, PubMed, arXiv, CrossRef, and DOI/URL validation databases. It catches 99 out of 100 flawed citations and was the instrument used to scan 4,841 NeurIPS 2025 papers.

Why does the submission volume at NeurIPS matter?

NeurIPS submissions grew 220% from 9,467 in 2020 to 21,575 in 2025, stretching reviewer capacity across more papers with less experience per review. Citation verification — never formally required — becomes even less likely under that load.

Are AI-written reviews by peer reviewers also a problem?

Yes. At ICLR 2026, authors discovered their reviewers had used AI to write feedback, leading to paper withdrawals. The failure runs in both directions.

Does this problem only affect AI conferences?

No. The same hallucination mechanism produced 800+ fabricated legal citations in US court filings with attorney sanctions, errors in the US MAHA government report, and Deloitte’s $98,000 AUD refund. The consistent cross-domain pattern confirms this is an LLM deployment issue, not an academic one.

How should you assess whether a paper’s citations are trustworthy?

Check whether the paper was submitted to a venue with automated citation verification. Papers with reproducible code and experimental results carry lower risk — fabricated citations tend to concentrate in literature review sections, not methodology. Run citations that look suspicious through Google Scholar or CrossRef manually.

Will papers with hallucinated citations be retracted from NeurIPS 2025?

NeurIPS’s LLM policy designates hallucinated citations as grounds for revocation — but enforcement for already-accepted papers is less clear than pre-acceptance detection. ICLR’s policy is explicit about rejection. Post-publication correction in conference proceedings is structurally harder than in journals.

Is this an academic problem or does it affect technology decisions directly?

If vendor benchmark claims are supported by citations from contaminated papers, the evidence base for your technology decisions is compromised. The Deloitte case is one documented instance. As AI research informs more procurement decisions, the contamination risk moves upstream into technology governance. For the full scope of the AI slop problem — from content farms and search degradation through to model collapse and strategic response — our overview covers the full landscape.