Business

SaaS

Technology

•

Feb 20, 2026

Why Publishers Cannot Block Googlebot and What Regulators Are Doing About It

Every publisher and CTO running a content-heavy platform gets here eventually: why can’t I just block Google’s AI bot? Content is being extracted, AI Overviews are answering queries at the top of search results, and the traffic that used to flow back is not arriving.

Here’s the problem. Blocking Googlebot removes your site from Google’s search index entirely. And with Google holding more than 90% of search queries in markets like the UK, that is commercially equivalent to switching your site off. This article explains the structural trap, what the UK Competition and Markets Authority (CMA) is now proposing to do about it, and what publishers can realistically control in the interim. For the broader governance framework for AI crawler access, see the pillar article.

Why blocking Googlebot destroys your search traffic — the structural problem

Googlebot is Google’s primary web crawler. Block it, and your site disappears from Google’s search index. Organic traffic goes to zero.

That matters because of how dominant Google is. It holds more than 90% of general search queries in the UK and accounts for 39% of combined AI and search referral traffic to publisher websites, per Cloudflare Radar data. No other discovery channel is in the same league.

robots.txt offers no structural escape. It is an honour system — crawlers choose to comply voluntarily. A Web Application Firewall (WAF) can technically block any crawler, but deploying one against Googlebot eliminates organic search traffic just as completely.

Publisher behaviour confirms the trap. Cloudflare’s AI Crawl Control data (July 2025–January 2026) found websites blocking GPTBot and ClaudeBot at nearly seven times the rate they blocked Googlebot. That is a rational calculation: block AI-only crawlers with no search traffic dependency, leave Googlebot alone. The CMA put it plainly: “publishers have no realistic option but to allow their content to be crawled for Google’s general search because of the market power Google holds.” And because they cannot block Googlebot, Google uses that content for AI Overviews — which send very little traffic back to the websites whose content generates the answers.

What does Googlebot actually do — search indexing, AI training, or both?

Googlebot performs two distinct functions using a single crawler. It builds Google’s search index, and it fetches live web content in real time to power AI Overviews and AI Mode via retrieval-augmented generation (RAG).

RAG means the AI fetches current content at query time rather than relying solely on static training data. When an AI Overview appears at the top of search results, that summary was built from content Googlebot retrieved in real time from publisher websites.

This makes Googlebot architecturally different from GPTBot (OpenAI) and ClaudeBot (Anthropic), which crawl only to build training datasets. Block those and you prevent training data use. But Googlebot does both — allowing search indexing means accepting AI Overviews use of the same content. The two consent decisions cannot be separated.

Cloudflare Radar data shows Googlebot sees approximately 1.70× more unique URLs than ClaudeBot, 1.76× more than GPTBot, and roughly 167× more than PerplexityBot.

What Google-Extended does and does not protect you from

Get this distinction wrong and you will think you have control you do not have.

What Google-Extended covers: blocking Google-Extended tells Google you do not want your content used to train Gemini, Google’s large language model. Signal this in robots.txt by disallowing the Google-Extended user agent.

What Google-Extended does not cover: AI Overviews. AI Overviews run on Googlebot’s real-time RAG inference crawl. Google-Extended has no authority over that. A publisher can implement the directive and still have all their content summarised in AI Overviews the following day.

The nosnippet meta tag is equally inadequate — it does not address inference-time AI Overviews use. Cloudflare’s customer feedback confirms both controls “have failed to prevent content from being utilised in ways that publishers cannot control.” Google’s own representative acknowledged the gap: “We’re now exploring updates to our controls to let sites specifically opt out of search generative AI features.” That opt-out does not currently exist.

Implementing Google-Extended is still worthwhile as a training opt-out signal. Just do not mistake it for control over AI Overviews.

How the referral traffic loss is playing out in practice

A Pew Research Center study (July 2025, 900 US adults) found AI Overviews cut search click-through rates from 15% to 8% — near-halving referral likelihood. MailOnline reported a 56% click-through rate drop on pages where AI Overviews appeared. For the full crawl-to-refer ratio data and per-platform breakdown, see our companion analysis.

The mechanism is zero-click search. AI Overviews answer queries at the top of the SERP; users get the answer and leave. No click. No referral.

This is producing legal action. Chegg sued Google in February 2025, citing a direct correlation between AI Overviews’ launch and its revenue collapse — the first lawsuit in the current wave. Penske Media Corporation — parent of Rolling Stone, Billboard, Variety, and Hollywood Reporter — filed suit in D.C. federal court in September 2025, attributing a one-third decline in affiliate revenue to AI Overviews.

Google disputes the data. Liz Reid, Google’s head of search, argued in August 2025 that “overall, total organic click volume from Google Search to websites has been relatively stable year-over-year.” The Pew data, MailOnline’s metrics, and Penske’s revenue figures say otherwise.

The harm is not limited to traditional media. SaaS documentation, FinTech knowledge bases, HealthTech content — any organisation that produces content Google can summarise faces the same dynamic. That breadth is exactly what strengthened the case for regulatory intervention.

What the UK CMA’s Strategic Market Status designation changes

On 10 October 2025, the UK Competition and Markets Authority designated Google as having Strategic Market Status (SMS) in general search and search advertising — the first regulator in any jurisdiction to achieve this specific designation.

SMS is a designation under the DMCC Act 2024 applied to firms with substantial and entrenched market power. The Act came into force 1 January 2025; the CMA launched its investigation on 14 January and confirmed the designation on 10 October 2025.

SMS designation gives the CMA powers regulators have not previously held: it can impose legally enforceable conduct requirements with financial penalties of up to 10% of global turnover for non-compliance. Two scope points matter: Google’s Gemini AI assistant is explicitly NOT in scope. AI Overviews and AI Mode ARE in scope — the features directly responsible for zero-click search fall within the CMA’s new enforcement authority. The US DOJ found Google illegally monopolised the search market (2024 ruling), but remedy proceedings remain ongoing with no equivalent enforcement power yet exists in the US.

Regulatory Timeline

1 January 2025: DMCC Act comes into force; CMA launches SMS investigation
September 2025: Cloudflare publishes Responsible AI Bot Principles and Content Signals Policy
10 October 2025: CMA designates Google with Strategic Market Status
28 January 2026: CMA publishes proposed Publisher Conduct Requirements
25 February 2026: CMA consultation deadline
Ongoing: DOJ antitrust remedy proceedings (US); EU DSM Directive Article 4 enforcement

What regulators are actually requiring of Google

On 28 January 2026, the CMA published proposed publisher conduct requirements. The requirements would oblige Google to give publishers a “meaningful and effective” opt-out from AI Overviews without affecting search rankings; prohibit downranking sites that opt out; require transparency about content use; require attribution in AI summaries; and provide disaggregated engagement data so publishers can evaluate what AI use is actually worth.

What the CMA declined to mandate is equally significant. Crawler separation — the structural remedy — was acknowledged as “an equally effective intervention” but was not included. Licensing payments were deferred for at least 12 months.

Publisher response was sceptical. News Media Association CEO Owen Meredith: “We’re skeptical about a remedy that relies on Google to separate data for AI Overviews versus search after it has been scraped — this is a behavioral remedy, whereas the cleanest solution would be a structural remedy.” Digital Content Next CEO Jason Kint: “Structural separation… must remain firmly on the table.” For EU context, the DSM Directive Article 4 already gives publishers text and data mining opt-out rights; the CMA aims to create equivalent UK protection.

As of publication (20 February 2026), the consultation has closed and final conduct requirements are pending.

Why Cloudflare argues conduct requirements are not enough

Cloudflare submitted to the CMA that crawler separation is the only structural remedy that removes the conflict of interest inherent in Google managing its own opt-out. The argument is straightforward: behavioural remedies require Google to define the opt-out, implement the controls, and adjudicate compliance — on its own terms. Cloudflare: “A framework where the platform dictates the rules, manages the technical controls, and defines the scope of application does not offer ‘effective control’ to content creators… it reinforces a state of permanent dependency.”

Crawler separation — splitting Googlebot into distinct crawlers for search indexing, AI training, and AI inference — is technically feasible. Google already operates nearly 20 distinct crawlers for different functions. Paul Bannister, CRO of Raptive: “I think if Google actually wanted to do it, they could do it by tomorrow. It’s easy and straightforward and they don’t do it because it gives them a competitive advantage over OpenAI and others.”

In September 2025, Cloudflare published the Responsible AI Bot Principles — a five-principle framework for well-behaved crawlers, including the requirement that all AI bots have one distinct purpose and declare it. Googlebot does not comply. The companion Content Signals Policy extends robots.txt with machine-readable search, ai-input, and ai-train signals — already applied to 3.8 million domains — and by framing these signals as a licence agreement, Cloudflare is creating legal risk for Google if it continues to ignore them.

That structural debate will not be resolved quickly. Which means publishers need a strategy for the interim.

Should you block AI crawlers or optimise for them — the GEO alternative

For most publishers, blocking Googlebot is not viable. The realistic strategic choice separates into two categories: what you can control now, and how to adapt where you cannot.

What you can control today: Block non-Google AI crawlers — GPTBot, ClaudeBot, PerplexityBot — via robots.txt or WAF with no organic search risk. Implement Google-Extended to signal Gemini training opt-out. Monitor the Content Signals Policy for adoption signals from Google.

For the full toolkit of publisher tools that work within the Google constraint, see our companion guide.

The strategic alternative — Generative Engine Optimisation (GEO): For organisations that accept they cannot block Google’s AI use of their content, GEO is the pragmatic adaptation. Rather than competing for clicks that AI Overviews increasingly intercept, GEO optimises content to be cited and attributed in AI-generated answers. Publishers are already monetising the expertise. GEO is not a substitute for regulatory remedies — it is a strategy for the interim.

The Responsible AI Licensing Standard (RSL), being developed by Reddit, Fastly, and news publishers, offers an emerging commercial framework — essentially royalties for RAG use — worth watching as the compensation gap the CMA deferred will need resolving.

Waiting for regulatory remedies to mature is a valid position. The CMA’s conduct requirements, once finalised and enforced, may resolve the structural conflict without publishers having to make the blocking decision themselves. For a framework for building a coherent bot policy that integrates all these options, see the pillar article.

Conclusion

Googlebot’s dual-purpose architecture gives Google a structural advantage no other search engine or AI platform holds: access to publisher content for real-time AI inference, while publishers cannot refuse without destroying their search traffic.

The regulatory response is moving in the right direction. The CMA’s SMS designation — confirmed October 2025 — is the first time a regulator has held legally enforceable powers over Googlebot’s crawl behaviour. The January 2026 proposed conduct requirements would, if effectively implemented, give publishers a meaningful opt-out from AI Overviews without search ranking penalty.

Whether behavioural remedies will be sufficient remains open. Publishers and Cloudflare argue only mandatory crawler separation removes the conflict of interest. The CMA acknowledged the argument and chose a behavioural approach anyway.

No enforceable AI Overviews opt-out yet exists. Publishers who understand the structural problem can make better-informed decisions about blocking non-Google crawlers, signalling preferences via Google-Extended and the Content Signals Policy, and adapting content strategy toward GEO in the meantime.

Frequently Asked Questions

Can I block Googlebot from using my content in AI Overviews?

No — not without also blocking Googlebot from indexing your site for search. Googlebot uses the same crawler for search indexing and real-time AI inference. No current mechanism lets publishers separate consent by use case. The CMA confirmed: publishers “have no realistic option but to allow their content to be crawled for Google’s general search because of the market power Google holds.”

Does robots.txt stop AI scrapers from crawling my website?

robots.txt signals crawling preferences but does not technically enforce them. Reputable crawlers honour it voluntarily; others do not. A WAF provides technical enforcement, but using it against Googlebot eliminates organic search traffic. For non-Google AI crawlers — GPTBot, ClaudeBot, PerplexityBot — robots.txt plus WAF enforcement is effective with no search traffic risk.

What is Google-Extended and how do I use it?

Google-Extended is a separate crawler user agent that lets publishers opt out of having their content used to train Gemini, Google’s large language model. It is implemented via robots.txt. It does not stop AI Overviews, which are powered by Googlebot’s real-time inference crawl and are outside Google-Extended’s scope.

When did the UK CMA designate Google as having Strategic Market Status?

10 October 2025, under the Digital Markets, Competition and Consumers Act 2024 (DMCC Act). The investigation launched 14 January 2025; the designation was confirmed 10 October 2025 — the first under the UK’s new digital markets competition regime.

What is the difference between a behavioural remedy and a structural remedy for AI crawlers?

A behavioural remedy imposes rules on how Google manages its existing crawler — requiring Google to offer an AI Overviews opt-out, for example. A structural remedy requires Google to change crawler architecture — mandating separate crawlers for search, AI training, and AI inference. Critics argue only structural remedies remove the conflict of interest inherent in Google adjudicating its own opt-out.

What is crawler separation and why do publishers want it?

Crawler separation would require Google to operate distinct crawlers for search indexing, AI model training, and AI inference, so publishers could consent to each use case independently. Cloudflare argues this is technically feasible — Google already operates nearly 20 distinct crawlers for different functions — and is the only remedy that removes Google’s inherent conflict of interest.

What is Generative Engine Optimisation (GEO)?

GEO is a content strategy that treats AI answer engines as a separate discovery channel — optimising content to be cited and attributed in AI-generated answers rather than competing only for clicks that AI Overviews intercept. Publishers are already monetising GEO expertise by selling AI citation playbooks to brand clients.

What does the CMA’s Publisher Conduct Requirements consultation propose for Google?

Published 28 January 2026, the CMA proposed that Google give publishers a “meaningful and effective” opt-out from AI Overviews without penalising them in search rankings, provide transparency about content use, and include clear attribution in AI summaries. Licensing payment requirements were deferred at least 12 months. The consultation closed 25 February 2026; final requirements are pending.

Why Publishers Cannot Block Googlebot and What Regulators Are Doing About It

Why blocking Googlebot destroys your search traffic — the structural problem

What does Googlebot actually do — search indexing, AI training, or both?

What Google-Extended does and does not protect you from

How the referral traffic loss is playing out in practice

What the UK CMA’s Strategic Market Status designation changes

What regulators are actually requiring of Google

Why Cloudflare argues conduct requirements are not enough

Should you block AI crawlers or optimise for them — the GEO alternative

Conclusion

Frequently Asked Questions

Related Articles

Team extension, extended team & out-sourcing FAQ

It’s AI time – The Tools Are Finally Ready

How to build the dev team you need with the budget you have

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG