Business

SaaS

Technology

•

Jan 9, 2026

Comparing Cloud Provider Reliability AWS Azure and Google Cloud

“We failed our customers and the broader internet.” That’s what Cloudflare’s CTO said after their December 2025 outage took a chunk of websites offline for 25 minutes.

Here’s the thing though—it’s not just Cloudflare. AWS US-EAST-1 went down for 15 hours in October 2025, affecting over 4 million users. That one region hosts somewhere between 30-40% of all AWS workloads.

And then there’s the Delta Air Lines situation. They got $60M in SLA compensation. Meanwhile, they lost $500M in business impact. That’s 12% coverage. Ouch.

The point is this: there’s no “best” cloud provider when it comes to reliability. What matters is understanding reliability patterns, regional differences, and what SLAs actually mean so you can build the right level of resilience for your business. For a comprehensive overview of how these outages fit into broader infrastructure challenges, see our cloud reliability comparison. Because outages will happen—the details matter way more than the marketing promises.

AWS vs Azure vs Google Cloud: Which Cloud Provider Has the Best Uptime Track Record?

Those shiny uptime percentages everyone throws around? They only tell part of the story.

AWS delivered 99.95% effective uptime in 2025 with 6 major incidents. Sounds good, right? But dig into US-EAST-1 and you’ll find 99.89% uptime versus that 99.95% global average. That’s 30% more outage incidents than any other AWS region.

Azure came in at 99.97% uptime with 4 major incidents. Fewer disruptions overall, but their mean time to recovery averaged 4.2 hours versus AWS’s 2.8 hours.

Google Cloud achieved 99.98% uptime with 3 major incidents. They recovered fastest at 1.9 hours average, but their smaller service coverage means fewer redundancy options when you need them.

Here’s where it gets interesting: these numbers are averages. If you’re in US-EAST-1 because some services are only available there, you’re accepting lower reliability. You don’t get to opt out.

The October 2025 AWS outage hit over 3,500 companies. We’re talking Snapchat, Ring, Robinhood, McDonald’s, Signal, Fortnite. In the UK: Lloyds Bank, HMRC, National Rail. Plus Coinbase and Duolingo. The root cause? A control plane failure. DNS resolution for DynamoDB failed, and that cascaded across EC2, Lambda, CloudWatch, and IAM. Multi-AZ didn’t help. Everything in US-EAST-1 went down together. For detailed technical analysis of this cascading failure mechanism, see the 2025 AWS and Cloudflare outages explained.

Azure’s October moment: an “inadvertent tenant configuration change” in Azure Front Door took down every single region worldwide.

Google Cloud’s June outage: a new feature in Service Control overloaded infrastructure. Three hours down, taking Discord and Spotify with it.

So what does this mean for your infrastructure decisions? It depends on which trade-offs fit your business. AWS has control plane issues. Azure has longer recovery times. Google has fewer incidents but smaller coverage. Pick your poison.

How Do Cloud Provider SLAs Compare in Actual Outage Compensation?

Standard SLAs promise 99.9% uptime. Drop below 99.9% and you get 10% service credit. Fall below 95% and you get 25%. Go further down, maybe 100%, capped at your monthly fees.

Let’s do the maths. Your monthly AWS bill is $100K. You experience 99.5% uptime. You receive $10K in credits. Meanwhile, your business lost $2M.

AWS, Azure, and Google have nearly identical standard terms—somewhere between 99.9% to 99.99% depending on your architecture. Multi-AZ gets you better terms. Multi-region better still. But the compensation is always the same: percentage-based credits against your infrastructure costs. Not your business impact.

The distinction between SLA, SLO, and SLI matters here. Your SLA is the contract—what providers promise and what they’ll pay when they break it. Your SLO is your internal objective, typically stricter than the SLA. SLI is what you actually measure. If your SLA promises 99.9% but your business needs 99.99%, that gap is yours to solve through architecture.

Now, if you’re spending above $500K annually, you can negotiate enhanced SLAs. We’re talking 99.95% to 99.99% commitments, financial penalties beyond credits—actual compensation up to 3x monthly spend—and priority incident response.

For smaller organisations, standard SLAs are non-negotiable. Which means architecture becomes your SLA.

What Is the Difference Between Multi-AZ and Multi-Region Architecture?

Multi-AZ gives you one level of resilience. Multi-region takes isolation quite a bit further.

Multi-AZ deploys across availability zones within one region. These are isolated datacentres 10-100km apart, but they’re sharing control plane infrastructure.

Multi-region deploys across geographic regions. Completely independent infrastructure. Completely independent control planes.

Think of it like this: multi-AZ is like backup power in different rooms. Multi-region is a second house in another city. Control plane failure is the whole house losing power. (See the FAQ “What is a control plane failure” for the details on this.)

The AWS October 2025 outage proved the point. When US-EAST-1 control plane failed, all AZs became unavailable simultaneously. Multi-AZ architecture meant absolutely nothing.

Here’s what it costs: Multi-AZ adds 15-25% to infrastructure costs. Multi-region adds 100-150% plus data transfer fees at $0.02/GB.

RTO considerations: Multi-AZ achieves failover in seconds. Multi-region runs anywhere from minutes to hours depending on your failover strategy.

For most workloads, multi-AZ gets you 99.95% reliability—that’s 22 minutes of downtime monthly. Multi-region pushes toward 99.99%—4.38 minutes. Whether that 17.6 minute difference justifies doubling your costs depends entirely on how you calculate downtime impact. For a comprehensive look at these architecture patterns and their trade-offs, see our guide on multi-cloud architecture strategies and resilience patterns.

What Caused the Cloudflare December 2025 Outage and What Can We Learn?

A configuration change deployed at 08:47 UTC triggered a Lua exception. The error message: “attempt to index field ‘execute’ (a nil value)”.

The change disabled an internal WAF testing tool that couldn’t support increased buffer sizes. The bug had been sitting there for years until this particular configuration exposed it.

The impact: approximately 28% of HTTP traffic returned HTTP 500 errors for 25 minutes.

But here’s the real problem: changes propagated network-wide within seconds. No gradual rollout. No canary testing. Full blast, global deployment.

This came just weeks after Cloudflare’s November 18 outage, where database permissions caused Bot Management files to exceed memory limits. Six hours down.

Lorin Hochstein’s analysis nailed it: “good intentions, bad outcomes.” The killswitch system they’d designed to disable misbehaving rules had never been tested against “execute” type actions.

So what’s the lesson? Even mature infrastructure companies fail from routine changes. Your staging environment needs production-like fidelity. Rollouts need gradual deployment. And sometimes fail-closed logic needs to be fail-open, though that comes with its own security trade-offs.

Active-Active vs Active-Passive Failover: Which Architecture Should I Choose?

Active-active runs workloads simultaneously in multiple locations. If one node fails, others take over instantly. RPO equals zero. RTO is seconds.

Active-passive keeps your primary handling traffic while the standby idles. When primary goes down, the system detects and switches. RPO is minutes. RTO is 5-30 minutes.

The cost difference is significant. Active-active requires 100% duplicate infrastructure—you’re paying double. Active-passive adds 30-50%.

The complexity difference matters too. Active-active requires distributed consistency, session management, global load balancing across locations. Active-passive is simpler—one primary, centralised state, no conflict resolution headaches.

The trade-offs break down to availability versus complexity. Active-active provides continuous availability because multiple nodes run in parallel. All your infrastructure is doing useful work on production traffic. But you need to manage concurrent writes, load distribution, and data synchronisation across locations.

Active-passive centralises state, which makes behaviour predictable. The standby runs in minimal state or scales up only when actually needed. No conflict resolution required. But failover creates a brief service interruption.

There’s a middle ground: pilot light strategy. You keep a minimal standby that can scale quickly. This achieves 45-60 minute RTO with 40-60% additional costs.

One more thing: manual failover in active-passive takes 3-6 hours. Automated failover takes 3-6 minutes. That’s a 60-120x difference.

How Do I Calculate the Real Cost of Downtime for My Infrastructure?

Revenue impact: take your annual revenue, divide by 8,760 hours, multiply by outage duration.

Productivity impact: affected employees × their hourly cost × duration.

Customer churn: somewhere between 2-8% of customers leave per hour you’re down. Multiply that by customer lifetime value.

Brand damage: model this as the marketing spend needed to restore your reputation—often 5-15x your direct revenue loss.

The industry benchmarks: SaaS companies average $5,600 per minute of downtime. E-commerce $4,000. Financial services $8,900.

The Delta breakdown tells the full story: $500M total = $300M direct revenue + $150M operational recovery + $50M brand impact. They received $60M in credits, covering just 12%.

Google’s error budget concept is useful here. If your SLA is 99.9%, that’s 43.8 minutes of allowed downtime monthly. Stay within budget, prioritise features. Exceed your budget, prioritise resilience.

Here’s the ROI framework: if one hour of downtime costs you $500K and multi-region architecture costs $1M annually, you break even by preventing three outages. If you’re averaging six outages yearly, multi-region pays for itself twice over.

What Are the Common Root Causes of Major Cloud Provider Outages?

Configuration changes cause 45% of major outages. Either human errors or automated deployment issues.

Control plane failures account for 25%. These are management layer failures that affect all resources in a region despite multi-AZ design.

Hardware failures represent 15%—datacentre power, cooling, networking problems.

Software bugs make up 10%. Platform bugs that only surface under specific load conditions.

Cascading failures are the remaining 5%. This is when one service’s problems overwhelm the dependent services downstream.

Forrester’s analysis highlights concentration risk from dependence on single providers. When foundational services like DNS fail, even well-architected applications become unstable.

Prevention maps directly to causes. Configuration changes need gradual rollouts. Control plane failures require multi-region architecture. Cascading failures need circuit breakers.

There’s a shared responsibility model at play here: providers own infrastructure reliability. You own application resilience.

How Does AWS US-EAST-1 Region Reliability Compare to Other AWS Regions?

US-EAST-1 delivers 99.89% uptime versus the 99.95% global average. That’s 30% more incidents. MTTR averages 3.8 hours versus 1.5-2 hours in newer regions.

The region is AWS’s oldest with legacy architecture, hosting 30-40% of all workloads. Even global apps anchor their identity and metadata flows in US-EAST-1. When it fails, the impacts propagate worldwide.

Here’s the catch: many services are US-EAST-1 only. Some CloudFormation features, some API operations. This forces architectural compromises.

The October 2025 impact: Downdetector captured 17M+ global reports. The control plane failure lasted 15 hours. DNS resolution services failed, which prevented automated failovers. Control plane APIs became unavailable, blocking the infrastructure changes needed to reroute traffic. Shared services like IAM, CloudWatch and Systems Manager created single points of failure across multiple regions.

What are the alternatives? US-EAST-2 (Ohio) achieves 99.96% uptime. EU-WEST-1 (Ireland) reaches 99.98%.

But migration isn’t simple. There’s data transfer costs at $0.02/GB. You need duplicate infrastructure. Substantial engineering time.

The practical strategy: use US-EAST-1 for control plane services like CI/CD and CloudFormation. Run your production workloads in US-EAST-2, US-WEST-2, or EU-WEST-1. Split the difference between service dependency and reliability requirements.

FAQ

What monitoring tools should I use to detect cloud provider outages quickly?

You need third-party monitoring from outside the provider’s network. This detects provider-level failures that your internal monitoring will miss. Options include ThousandEyes, Pingdom, and Uptime Robot. These tools operate independently of your cloud provider’s infrastructure. Set up multi-region health checks every 30-60 seconds—this provides faster detection than relying on provider status pages. Combine automated failover with external monitoring and you’ll achieve 3-6 minute RTO versus 3-6 hour manual response. Configure your alerting thresholds at 3 consecutive failed checks from 2+ locations to avoid false positives while detecting outages 10-12 minutes faster than the official status announcements.

Is multi-cloud worth the added complexity and cost?

Multi-cloud makes sense in specific situations: when your RTO requirements fall under 5 minutes, when regulatory requirements demand geographic redundancy, or when 1 hour of downtime exceeds the 100-150% infrastructure cost increase. For most organisations though, multi-region within a single provider offers a better complexity-to-resilience ratio. Kubernetes provides workload portability with 80-90% code reuse across AWS, Azure, and GCP. But managed services like RDS, CosmosDB, and BigQuery create vendor lock-in. Your data layer becomes your actual lock-in point, not compute. Our multi-cloud architecture strategies guide explores these patterns in depth.

How do I test disaster recovery procedures without causing an actual outage?

There are chaos engineering tools built for testing resilience. AWS Fault Injection Simulator, Gremlin, and Chaos Monkey inject controlled failures into your systems. These tools simulate infrastructure failures, latency issues, and service degradation. Run quarterly DR drills during low-traffic periods to validate your failover procedures. Try gradual traffic shifting—10% to standby, then 50%, then 100%—which tests capacity without risking a full production outage. Run game days that simulate specific scenarios: DNS failures, database outages, authentication service disruptions. The goal here is training your team’s response patterns, not just validating the technology works.

What is a control plane failure and why doesn’t multi-AZ protect against it?

The control plane coordinates cloud resources across all availability zones in a region. When it fails, all AZs become unavailable simultaneously despite their physical isolation. The October 2025 AWS outage demonstrated this perfectly: control plane APIs became unavailable, blocking infrastructure changes needed for rerouting. DNS resolution failed, preventing automated failovers from working. Only multi-region architecture protects against control plane failures because each region has independent control planes. Multi-AZ protects against datacentre failures—power loss, cooling issues, hardware problems. But the control plane sits above the AZ level, which means regional failures affect everything below.

Should I negotiate custom SLAs with my cloud provider?

If you’re spending above $500K annually, yes—you can negotiate enhanced SLAs. We’re talking 99.95% to 99.99% commitments, financial penalties beyond credits up to 3x your monthly spend, and priority incident response. These negotiations typically happen during contract renewals or when you’re committing to increased spend. For smaller organisations, standard SLAs are non-negotiable. Your alternative is architectural resilience: multi-AZ for basic reliability, multi-region for stricter requirements, active-active for mission-critical workloads.

How quickly can cloud providers recover from major outages?

Mean time to recovery varies significantly: Google Cloud averages 1.9 hours, AWS 2.8 hours, Azure 4.2 hours based on 2025 data. Regional factors matter too—US-EAST-1 averages 3.8 hours due to workload density while newer regions recover in 1.5-2 hours. Recovery speed depends on the root cause. Configuration rollbacks are quick—minutes to restore the previous state. Control plane failures take longer because core infrastructure must restart. Hardware failures depend on redundancy: multi-AZ recovers quickly, single-AZ waits for physical repairs.

What is the difference between 99.9% and 99.99% uptime in practical terms?

99.9% allows 43.8 minutes of downtime monthly. 99.99% allows 4.38 minutes. For a SaaS company with 10,000 customers and $10M annual revenue, the difference between 44 minutes and 4 minutes of monthly downtime is approximately $250K annually in revenue impact and churn. That compounds with brand damage and operational recovery costs. Here’s the cost calculation: if your hourly downtime cost is $100K, that 39.4 minute monthly difference equals $65K per month or $780K annually.

Can I rely on cloud provider status pages during outages?

No. Status pages lag 8-15 minutes due to internal escalation processes. Providers detect the issue, verify it, get approval, then publish to the status page. Independent monitoring alerts you directly. Set up external health checks from multiple geographic locations—these detect failures before the internal escalation process completes. Configure thresholds at 3 consecutive failed checks from 2+ locations. This approach detects outages 10-12 minutes faster than the official status announcements, buying you time for manual intervention or automated failover initiation.

What is the blast radius concept in cloud architecture?

Blast radius is the scope of impact when a component fails. Good architecture limits this through isolation. A regional failure affects only that region, not your global traffic. One service’s failure doesn’t cascade to others because circuit breakers stop the propagation. Gradual deployment—10%, then 50%, then 100%—limits the impact of configuration changes. If your changes break things, only 10% of users are affected. Availability zones provide physical blast radius containment. Separate control planes provide logical containment. Service mesh patterns create application-level containment.

How do I balance resilience investment against feature development?

Use the error budget framework. Define acceptable downtime based on your SLA—99.9% equals 43.8 minutes monthly. Measure your actual downtime against that budget. Stay within budget, prioritise features. Exceed your budget, prioritise resilience. This quantifies the reliability-velocity trade-off and guides your quarterly planning with actual data. Track your error budget consumption: if you’re consistently using 80% of your budget, invest in resilience. If you’re only using 20%, you can increase velocity. The framework converts subjective reliability discussions into objective resource allocation decisions.

What role does Kubernetes play in multi-cloud strategies?

Kubernetes provides workload portability by abstracting infrastructure differences. You can deploy identical workloads to AWS, Azure, and GCP with 80-90% code reuse using Terraform for infrastructure provisioning. Container orchestration standardises your deployment patterns. Service mesh abstracts networking complexity. But managed services create lock-in. RDS on AWS, CosmosDB on Azure, BigQuery on GCP—none of these are portable. Your data layer becomes your actual lock-in point, not compute. Multi-cloud Kubernetes requires avoiding provider-specific databases and using cloud-agnostic alternatives like PostgreSQL running on Kubernetes.

Should I avoid AWS US-EAST-1 entirely for new workloads?

Not entirely. US-EAST-1 remains necessary for some services and API operations that aren’t available elsewhere. Here’s the best practice: deploy your control plane resources like CI/CD and CloudFormation in US-EAST-1 for service availability. Run your production workloads in US-EAST-2, US-WEST-2, or EU-WEST-1 for better reliability. This approach splits the difference between service dependency and reliability requirements. Migration complexity depends on your workload: stateless applications move easily, stateful services require careful data transfer planning. Don’t forget the $0.02/GB egress cost—it adds up quickly for large datasets.

For more on understanding the systemic vulnerabilities these outages expose, see our infrastructure outages and cloud reliability overview, which provides a comprehensive framework for evaluating and addressing infrastructure fragility across your organisation.