Business

SaaS

Technology

•

Jan 9, 2026

The 2025 AWS and Cloudflare Outages Explained

Late 2025 was rough for cloud infrastructure. AWS‘s US-EAST-1 region went down for 15 hours in October because of DNS resolution failures. Cloudflare had two separate global outages within a month—one lasting 3.5 hours in November, another hitting in December. The impact? Massive. Alexa smart assistants stopped responding. Ring cameras went offline. ChatGPT became unavailable. Spotify stopped streaming.

These weren’t isolated incidents. They exposed architectural weaknesses in how we’re building internet infrastructure today. DNS resolution failures cascaded through AWS’s control plane. Database permission changes in Cloudflare created oversized configuration files that crashed proxy systems. Kill switches designed to improve reliability triggered latent bugs that brought down 28% of global internet traffic.

This technical post-mortem is part of our comprehensive guide on infrastructure outages and cloud reliability in 2025, where we examine the systemic vulnerabilities affecting millions of businesses worldwide. The root causes varied—DNS failures, database configuration changes, type safety bugs—but they all demonstrated the same thing. Single points of failure can amplify into region-wide or global outages. If you’re making architecture decisions about multi-region strategies, automated failover, or choosing between type-safe languages like Rust versus dynamic languages like Lua, these outages provide concrete lessons you need to know about.

What caused the AWS US-EAST-1 outage in October 2025?

DNS resolution failure for DynamoDB service endpoints within AWS’s internal network triggered cascading control plane failures across EC2, Lambda, and CloudWatch. The outage affected 30-40% of global AWS workloads because US-EAST-1 hosts the largest concentration of AWS infrastructure worldwide.

Here’s how it works. AWS services rely on DNS to locate internal endpoints for DynamoDB, S3, EC2 API, and CloudWatch. When DNS resolution for DynamoDB endpoints began failing around 06:50-07:00 UTC on October 20, services couldn’t coordinate operations. EC2 instances couldn’t launch because they couldn’t reach the DynamoDB endpoints needed for coordination. Lambda functions failed immediately when they tried to resolve DynamoDB endpoints and got DNS timeouts. CloudWatch stopped collecting metrics because it couldn’t resolve CloudWatch API endpoints.

The failure cascaded outward from there. DynamoDB unavailability affected IAM authentication, preventing teams from logging into the AWS console. No console access meant operators couldn’t change settings, move traffic, or restart services—even after core systems started recovering.

Multi-AZ architecture didn’t help. Availability zones within a region share control plane infrastructure, including DNS resolution services. When regional infrastructure fails, all zones are affected simultaneously.

Retry storms extended recovery time. Millions of EC2 instances and Lambda functions simultaneously retried failed DynamoDB connections when DNS resolution was restored. The connection flood overwhelmed the database control plane, causing DNS resolution to fail again. AWS had to implement rate limiting before services could fully recover.

The outage lasted approximately 15 hours, affecting over 1,000 companies worldwide. Real-world impact included Alexa smart assistants becoming unresponsive, Ring security cameras going offline, and disruptions to Snapchat, Fortnite, and Robinhood.

How did Cloudflare’s database permission change trigger the November 2025 outage?

A ClickHouse database permissions change made access controls more granular but removed the database name filter from SQL queries. This created duplicate results that doubled the Bot Management feature file from around 60 to more than 200 features, exceeding the hard-coded 200-feature limit in Cloudflare’s Rust-based FL2 proxy.

The change allowed users to view metadata for tables in both the “r0” and “default” databases. When queries ran without a database name filter, they returned duplicate entries. Queries like SELECT name, type FROM system.columns WHERE table = 'http_requests_features' order by name; returned duplicate entries—one from each database.

This wouldn’t normally be a problem, except the FL2 proxy had a hard-coded 200-feature memory allocation limit. When the feature file exceeded expectations, the proxy code called Result::unwrap() on an Err value, triggering Rust panic errors that terminated threads and prevented request processing. The error message was blunt: “thread fl2_worker_thread panicked: called Result::unwrap() on an Err value.”

The outage affected approximately 20% of internet traffic for nearly 6 hours, with Cloudflare receiving over 3.3 million Downdetector reports. Services like ChatGPT, Spotify, Discord, and X (Twitter) experienced disruptions.

What made this difficult to diagnose was the gradual rollout. The permissions change deployed incrementally across infrastructure, creating intermittent failures as nodes alternated between generating good and bad feature files.

The failure cascaded beyond Bot Management. Bot Management failure affected Workers KV with elevated error rates, Turnstile with complete failure preventing dashboard logins, and Cloudflare Access with authentication failures.

What is a cascading failure in cloud infrastructure?

A cascading failure occurs when a failure in one component triggers failures in dependent components, propagating through the system in a chain reaction. Component A fails, which causes Component B to fail, which triggers Component C to fail—the blast radius expands exponentially.

In distributed cloud systems, cascading failures typically spread through shared dependencies like DNS, databases, or control planes. The AWS cascading pattern looked like this: DNS resolution failure → DynamoDB unavailable → EC2 launch failures + Lambda execution failures + CloudWatch metric collection failures.

The Cloudflare cascading pattern followed a different path: ClickHouse permissions change → oversized feature files → FL2 Rust proxy panic → global HTTP 5xx errors.

Cascading failures amplify through several mechanisms. Shared dependencies mean a single control plane, DNS, or database failure affects multiple services simultaneously. Retry storms during recovery flood systems attempting to come back online. Health check failures prevent load balancers from routing traffic even when services have partially recovered.

Prevention strategies exist but require careful implementation. Circuit breakers detect failures and stop cascading by temporarily blocking access to faulty services. When a circuit breaker detects a failure threshold—say, 50% error rate over 10 seconds—it “opens” and immediately fails subsequent requests without attempting calls. This gives downstream services time to recover.

Graceful degradation patterns allow systems to maintain partial functionality during failures rather than complete service collapse. When a recommendation engine fails, you can show static content instead. When a payment processor is down, you can queue transactions for later processing. For comprehensive guidance on implementing these patterns through chaos engineering and observability practices, engineering teams can build proactive detection and automated remediation capabilities.

Why did Cloudflare experience two major outages within a month?

Cloudflare experienced two unrelated outages on November 18 and December 5 because different subsystems failed through distinct mechanisms. The November incident involved database configuration creating oversized data structures. The December incident involved a security mitigation triggering a latent type safety bug.

The clustering reflected operational complexity at global infrastructure scale. Multiple independent failure modes coexist in systems this large.

November 18: Bot Management’s FL2 proxy got hit. This affected FL2, Cloudflare’s newer proxy written in Rust.

December 5: CVE-2025-55182 React vulnerability mitigation → WAF buffer increase from 128KB to 1MB → internal testing tool failure → kill switch activation → latent Lua bug in FL1 proxy → nil value exception → global HTTP 5xx errors. This affected FL1, Cloudflare’s legacy proxy written in Lua.

The architectural diversity meant different language safety properties. FL2 proxy (Rust, newer) vs FL1 proxy (Lua, legacy) meant Rust prevented type errors that occurred in Lua. But FL2 had its own vulnerabilities—the hard-coded 200-feature limit that caused the November outage.

The December incident demonstrated how reliability features can paradoxically create new failure modes. When an internal testing tool failed with new buffer sizes, engineers activated a kill switch to disable “execute” action types globally. The kill switch system had never been tested against “execute” type actions, exposing a years-old error where FL1 code assumed the “execute” field would always exist.

The December outage affected approximately 28% of Cloudflare’s HTTP traffic for 25 minutes. Unlike the gradual November rollout, the December incident’s global configuration system propagated network-wide within seconds.

What is a retry storm and how does it impact recovery?

A retry storm occurs when numerous clients simultaneously retry failed requests after a service disruption, overwhelming infrastructure attempting to recover. What should be a 5-10 minute outage extends to 2-3 hours as systems oscillate between partial recovery and collapse.

During the AWS US-EAST-1 incident, millions of clients retrying DynamoDB connections flooded recovering systems. Here’s the cycle: DNS resolution restored → millions of EC2 instances and Lambda functions simultaneously reconnected to DynamoDB → connection flood overwhelmed database control plane → DNS resolution failed again → cycle repeated.

The exponential growth happens because each failing service layer contributes independent retry traffic. EC2, Lambda, and CloudWatch all retry separately, compounding load on underlying services.

Prevention requires implementing exponential backoff and jitter in retry logic. Exponential backoff progressively increases wait times between retry attempts—1 second, 2 seconds, 4 seconds, 8 seconds, etc. Without this, thousands of clients retry simultaneously every second, overwhelming recovering systems.

Jitter randomises retry timing to prevent synchronisation. If every client waits exactly 2 seconds, they all retry at the same moment. Adding +/- 20% jitter spreads retries out over time.

Circuit breakers detect failures and stop retry attempts entirely when thresholds are reached, capping incoming request volume during recovery.

How do DNS resolution failures cascade across AWS services?

DNS resolution failures prevent AWS services from locating internal endpoints, causing service discovery breakdown. DNS operates as a control plane function shared across all availability zones within a region, so multi-AZ deployments don’t protect against region-wide DNS failures.

AWS services use DNS for internal service-to-service communication. When EC2 instances launch, they query DNS for DynamoDB endpoints. DNS failure returns no results → EC2 launch fails without database coordination → cascading failure to dependent applications.

The Lambda execution chain follows the same pattern: function invocation → resolve DynamoDB endpoint → DNS timeout → function fails immediately → application-level cascading failures for serverless workloads.

CloudWatch impact compounds the problem. Metrics collection requires resolving CloudWatch API endpoints. DNS failure prevents monitoring, so operators lose visibility during the incident. Root cause identification gets delayed when you can’t see what’s happening.

Problems with DynamoDB also hit IAM authentication. Teams couldn’t sign into tools that change settings, move traffic, or restart services. Recovery slowed even after core systems started coming back because operators couldn’t access the controls they needed.

Restoring DNS doesn’t immediately recover dependent services. Retry storms, cached failures, and stale connection pools prevent clean restart. Services need time to drain backlogs, clear cache, and re-establish connections.

What is the difference between Cloudflare’s FL1 and FL2 proxy systems?

FL1 is Cloudflare’s legacy proxy written in Lua with dynamic typing and runtime error detection. FL2 is the newer proxy written in Rust with static typing and compile-time error detection. The December 2025 outage occurred only in FL1 because Lua’s lack of type safety allowed a nil value exception that Rust would have prevented at compile time.

The incident stemmed from a Lua exception: “attempt to index field ‘execute’ (a nil value).” When the kill switch removed “execute” action types globally, FL1 Lua code tried to index a missing “execute” field. The code expected that “if the ruleset has action=’execute’, the ‘rule_result.execute’ object will exist” but the execute object didn’t exist after being skipped.

Lua permits operations on nil values that fail at runtime, whereas Rust prevents nil/null operations at compile time through Option types. FL2 Rust code wouldn’t compile if attempting unsafe nil access, preventing the entire outage class.

But FL2 isn’t invulnerable. It experienced the November outage from a hard-coded 200-feature limit. Even modern systems with strong type safety have vulnerabilities when you hard-code assumptions about data sizes.

Performance trade-offs exist. Lua offers faster development iteration but runtime risk. Rust provides compile-time guarantees but slower development velocity. Cloudflare is continuing gradual migration from FL1 to FL2 to eliminate type safety vulnerabilities, but complete migration requires rewriting Lua business logic.

Why is US-EAST-1 considered a single point of failure for AWS?

US-EAST-1 (Northern Virginia) hosts an estimated 30-40% of global AWS workloads, making it the largest concentration of AWS infrastructure worldwide. When a regional dependency fails there, impacts propagate worldwide. This represents a classic example of cloud concentration risk, where workload clustering creates systemic vulnerabilities across entire portfolios of businesses.

This concentration happened for historical and economic reasons. US-EAST-1 is AWS’s oldest region, so many foundational services and customer architectures were built with US-EAST-1 as the default. AWS SDKs, documentation, and tutorials often use US-EAST-1 as default, leading developers to deploy there without considering geographic redundancy.

Cost incentives reinforced this concentration. US-EAST-1 historically had lowest pricing, encouraging workload concentration for cost optimisation at the expense of resilience.

Some AWS control plane functions operate regionally or have primary operations in US-EAST-1. IAM, Route 53, and CloudFormation fall into this category. Even if your application runs in another region, it may depend on US-EAST-1 control plane functions.

Migration to multi-region architecture involves several components. You need a data synchronisation strategy, DNS-based traffic routing, and potentially 40-60% infrastructure cost increase for active-active patterns.

FAQ Section

What does multi-AZ versus multi-region architecture involve?

Multi-AZ provides isolation between data centres within a single region. Multi-region provides isolation between geographically separated regions with independent control planes. The difference matters because availability zones share regional control plane infrastructure, so regional failures affect all AZs. Multi-region requires data synchronisation, increased cost (40-60% for active-active), and complex traffic routing.

How long did the 2025 AWS and Cloudflare outages last?

AWS US-EAST-1 outage lasted over 15 hours. Cloudflare November 18 outage lasted approximately 3.5-6 hours affecting 20-28% of global traffic. Cloudflare December 5 outage lasted approximately 25 minutes affecting 28% of HTTP traffic.

What is a kill switch and why did it cause an outage?

A kill switch is a rapid shutdown mechanism to disable misbehaving features without full redeployment. Cloudflare’s December outage occurred when a kill switch disabled “execute” action types globally to protect against compromised WAF testing tools, triggering a latent Lua bug in FL1 proxy where code assumed the “execute” field always existed. The killswitch system had never been tested against “execute” type actions, exposing a years-old error.

Can multi-cloud strategies prevent outages like these?

Multi-cloud distributes workloads across AWS, Azure, and GCP for true provider independence. Active-active multi-cloud architectures should exceed 99.99% reliability if individual cloud vendors provide 99.9% uptime. But multi-cloud typically increases costs by 100-150% and requires portable technologies like Kubernetes, Terraform, and Ansible plus engineering investment in abstraction layers.

What does active-active multi-region architecture require?

Active-active architecture involves bidirectional data replication, DNS-based traffic routing, session state management, and conflict resolution strategies for concurrent writes. Cost typically increases 80-100% due to resource duplication. This pattern runs workloads in multiple regions simultaneously with traffic distributed globally, providing the highest availability with instant failover.

What is the difference between fail-open and fail-closed design?

Fail-open systems default to permissive states during failures, allowing traffic through and prioritising availability over security. Fail-closed systems default to restrictive states, blocking traffic and prioritising security over availability. Cloudflare committed to fail-open for configuration failures, accepting security risk to maintain internet connectivity.

Why did gradual rollouts fail to prevent the Cloudflare November outage?

Gradual rollout of ClickHouse permissions change deployed incrementally across infrastructure, creating intermittent failures that delayed root cause identification. The rollout strategy lacked health validation checks to detect oversized feature files before full deployment, allowing corrupted configuration to propagate globally.

What is Rust’s type safety advantage over Lua?

Rust’s static type system prevents nil/null pointer errors at compile time through Option types, forcing developers to explicitly handle missing values. Lua allows operations on nil values that fail at runtime, as occurred in December when the kill switch removed the “execute” field. Rust code attempting similar access wouldn’t compile, preventing the entire outage class.

How can you quantify outage risk for resilience investment decisions?

Calculate expected annual outage cost: (Outage probability) × (Revenue per hour) × (Expected outage hours) × (Customer impact %). For example: 5% annual US-EAST-1 outage probability × $100,000 hourly revenue × 3-hour duration × 80% customer impact = $12,000 expected annual loss. Compare against multi-region architecture cost increase to determine ROI.

Next Steps

The 2025 AWS and Cloudflare outages demonstrate that even the largest cloud providers experience systemic failures that cascade across infrastructure. DNS resolution failures, database configuration errors, and type safety bugs—each exposed architectural vulnerabilities that affect millions of businesses worldwide.

For a complete overview of cloud reliability strategies, risk frameworks, and operational resilience practices, see our comprehensive guide to infrastructure outages and cloud reliability in 2025, which covers everything from multi-cloud architecture patterns to vendor contract negotiation.

The 2025 AWS and Cloudflare Outages Explained

What caused the AWS US-EAST-1 outage in October 2025?

How did Cloudflare’s database permission change trigger the November 2025 outage?

What is a cascading failure in cloud infrastructure?

Why did Cloudflare experience two major outages within a month?

What is a retry storm and how does it impact recovery?

How do DNS resolution failures cascade across AWS services?

What is the difference between Cloudflare’s FL1 and FL2 proxy systems?

Why is US-EAST-1 considered a single point of failure for AWS?

FAQ Section

What does multi-AZ versus multi-region architecture involve?

How long did the 2025 AWS and Cloudflare outages last?

What is a kill switch and why did it cause an outage?

Can multi-cloud strategies prevent outages like these?

What does active-active multi-region architecture require?

What is the difference between fail-open and fail-closed design?

Why did gradual rollouts fail to prevent the Cloudflare November outage?

What is Rust’s type safety advantage over Lua?

How can you quantify outage risk for resilience investment decisions?

Next Steps

Related Articles

Which of the top 5 AI coding assistants is right for you?

5 Platforms For Optimising Your Agents Compared

How to Choose the Right Developer (hint: Focus on Security and Support)

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG

Related Articles

preventing teams from logging into the AWS console

outage lasted approximately 15 hours

Active-active multi-cloud architectures should exceed 99.99% reliability