Business

SaaS

Technology

•

Jan 23, 2026

The Observability Money Pit and How to Escape It

Q: What are intelligent alerting strategies to reduce alert fatigue?

Intelligent alerting includes SLO-based alerts, alert on business impact thresholds, anomaly detection reducing false positives, alert grouping and deduplication, severity-based routing, and alert tuning based on historical accuracy. Start with SLO definition for critical services, then layer anomaly detection. Target less than 30% false positive rates.

Your observability spending is out of control. Organisations are allocating 17% of infrastructure budgets to monitoring tools. 36% of enterprises are spending over $1 million annually. This isn’t a sign of maturity. It’s what happens when DevOps chaos meets vendor opportunism.

Microservices took over. DevOps teams adopted “you build it, you run it.” Observability requirements exploded. What started as visibility became an observability-industrial complex. Vendors capitalised on fear-driven procurement. Teams drowned in telemetry data. And get this – 90% of the time it goes unread.

This article walks you through why costs are soaring and where waste hides. You’ll learn how to calculate ROI, compare vendors, and consolidate tools to fix the sprawl.

Pillar Reference: This article is part of our guide to the broader DevOps cost crisis. It shows how platform engineering provides systematic cost optimisation.

How Much Should I Budget for Observability as a Percentage of Infrastructure Costs?

Allocate 15-25% of your infrastructure budget to observability. Grafana Labs’ 2025 survey found an average of 17%, median at 10%. Honeycomb’s guidance suggests 15-25% for quality observability.

For SMBs managing 50-500 employees, this means tens or hundreds of thousands annually. Running $500k infrastructure spend? Budget $75k-125k for observability.

The percentage doesn’t scale linearly though. Companies with $100k infrastructure bills should spend roughly $20k on observability, while $100m companies shouldn’t spend $20m. The reason? Economies of scope. The same tools cover more services without proportional cost increases.

What drives the variation? Microservices multiply observability requirements by 3-5x versus monoliths. Running Kubernetes with dozens of microservices? Expect the higher end of 15-25%. Simpler architectures stay closer to 10-15%.

One more thing. Over 50% of observability spending goes to logs alone. That’s your first cost optimisation target.

Why Are Observability Costs Increasing So Rapidly Year Over Year?

Observability costs are rising at 40-48% annually. Microservices are the main problem. One Honeycomb customer’s spending grew from $50,000 in 2009 to $24 million by 2025. That’s 48% year-over-year growth for 15 years.

Each microservice requires independent instrumentation generating logs, metrics, traces, and profiles. A monolith with 10 modules? One instrumentation profile. Split into 50 microservices? Fifty profiles. The math gets painful fast. This architectural complexity and monitoring costs creates a multiplication effect on observability requirements.

High-cardinality data creates the real cost explosion. Tracking unique dimension combinations – user IDs, request IDs, container instances – grows exponentially with service count. Traditional tools built on search indexes suffer from massive storage overhead. Vendor pricing amplifies this because they charge per-GB ingestion or per-host deployment rather than value delivered.

What makes it worse: organisations are retaining telemetry longer. 30-day retention expanded to 90+ days for compliance. Container churn adds to it. Short-lived Kubernetes pods generate instrumentation overhead as containers spin up and down.

Matt Klein points out that open source tools and cloud infrastructure made it easier to generate massive telemetry volumes. The zero interest rate era meant companies prioritised growth over cost management. Now that financial accountability is back, everyone’s wondering how their observability bills got so high.

The simple answer? SaaS platforms charge per gigabyte ingested, per host monitored, or per high-cardinality metric tracked. The more visibility you need, the more you pay.

What Is the Observability-Industrial Complex and Why Does It Matter?

The observability-industrial complex describes vendors profiting from fear-driven procurement, proprietary data formats creating lock-in, and overlapping tool categories generating sprawl. Matt Klein coined this term to describe how vendors sell on fear. “You can’t afford to be blind during incidents.” Meanwhile, they create dependency through proprietary agents.

Here’s how it works. Vendors build integrated platforms that discourage migration. Proprietary agents instrument your applications. Custom data formats make exporting difficult. Integrated platforms couple multiple capabilities so migrating logs means losing APM and infrastructure monitoring simultaneously.

Organisations deploy an average of eight observability technologies. 101 different tools were cited as currently in use. Many run 10-20 observability tools simultaneously. Logs, metrics, APM, infrastructure monitoring, security. Each creates integration overhead and cognitive load.

The fear-driven model works because downtime is expensive. Vendors sell on the premise that inadequate visibility risks catastrophic failures. This drives over-provisioning. Teams instrument everything “just in case” rather than focusing on high-value signals.

OpenTelemetry represents the counter-movement. It’s the second fastest growing project in the Cloud Native Computing Foundation. It provides vendor-neutral telemetry collection standards that reduce lock-in.

This matters because it’s part of the broader post-DevOps financial accountability challenge. When 28% of organisations cite vendor lock-in as their biggest observability concern, it’s not just theoretical.

What Waste Patterns Exist in Observability Spending?

Observability waste manifests in three patterns. First, 90% of collected telemetry never being read. You’re paying to collect, transport, store, and index unused data. This comes from DevOps fear culture. “Instrument everything or risk being blind.” Vendor pricing charges for ingestion regardless of value.

Second, health checks represent 25% of request volume in many systems. These repetitive probes add no diagnostic value. They’re filter candidates for immediate cost reduction.

Third, over-instrumentation happens when teams collect every metric “just in case.” Default 90-day retention applies when 7-day would suffice. No one queries which data streams inform incident resolution versus which dashboards consume budget without being viewed.

Alert fatigue indicates over-sensitive thresholds. Alert fatigue is the number one obstacle to faster incident response. When engineering managers cite alert fatigue at 24%, false positive rates are too high.

Duplicate collection is rampant. The same data gets collected by multiple tools. APM, infrastructure monitoring, and logs all capturing host metrics simultaneously.

Identifying waste requires shifting from “collect everything” to “collect what matters.” This means retention policies based on query patterns, filtering health check noise, and eliminating redundant collection.

Vendor Comparison: How Do Honeycomb, Datadog, New Relic, and Splunk Compare on Cost and Value?

Let’s cut through the marketing.

Datadog offers unified observability. Logs, metrics, APM, infrastructure, security. Pricing starts at $15 per host per month on Pro tier billed annually, $18 on-demand. Enterprise starts at $23 per host per month. Per-host pricing is simple until custom metrics multiply costs. SMB typical spend: $50k-200k annually.

Strength: unified platform reducing tool sprawl. Weakness: pricing may be prohibitive for large data volumes. Once you’re deep in, you’re locked into their ecosystem.

New Relic provides full-stack observability with user-based pricing. Free tier: 100 GB data ingest per month, unlimited basic users, one free full-platform user. Standard pricing: $10 for first user, $99 per additional. Data beyond 100 GB costs $0.35/GB.

User-based pricing favours large teams but penalises high-cardinality data. SMB typical spend: $40k-150k annually.

Splunk (now Cisco) delivers enterprise-grade log analytics with security integration. Observability Essentials costs $75 per host per month, Standard at $175. Complex and costly licensing targets large organisations. The proprietary SPL query language creates lock-in. Less SMB-friendly with typical enterprise spend $200k-2M+.

Honeycomb specialises in event-based observability and high-cardinality analysis. Free tier: 20M events per month. Pro: $100/month or $1,000/year. Enterprise: $24,000/year. Typical spend: $30k-100k annually.

Honeycomb excels at high-cardinality data exploration. The query-centric cost model means you pay for what you analyse rather than everything you collect.

Unified vs Best-of-Breed: Unified platforms like Datadog consolidate tools but risk lock-in. Best-of-breed approaches – Prometheus plus Grafana plus vendor APM – offer flexibility at the cost of integration complexity. Cost is the biggest criteria when selecting observability tools, but total cost includes engineering time.

OpenTelemetry compatibility matters. Vendors supporting vendor-neutral telemetry give you backend portability. Instrument once, switch backends without re-instrumenting.

What Is Matt Klein’s Control Plane/Data Plane Cost Framework for Observability?

Matt Klein’s framework separates observability into control plane and data plane. It reveals that vendors extract maximum margin from proprietary data planes while control planes could run open-source alternatives.

The control plane handles telemetry collection, routing, transformation, filtering. This is commodity functionality. OpenTelemetry agents do this with vendor neutrality.

The data plane handles storage, indexing, querying, visualisation, alerting. This is where vendors differentiate and extract margin. Proprietary data planes create lock-in because migrating means re-architecting storage and querying.

Klein’s insight: engineers must determine what data might be needed ahead of time, a paradigm unchanged for 30 years. This drives over-instrumentation because teams can’t predict future debugging needs.

Practical application: decouple collection from storage using OpenTelemetry as your control plane standard. This enables hybrid architectures routing high-value streams to premium analytics and bulk logs to cost-effective storage.

Example: OpenTelemetry agents collect all telemetry, then route traces to Honeycomb for high-cardinality analysis, metrics to Prometheus and Grafana, and logs to ClickHouse or S3. This gives backend flexibility without re-instrumenting applications.

The cost optimisation lever: credible threat of backend migration gives negotiating leverage. When you’re not locked in, vendors compete on price and features rather than extraction.

How Do I Calculate the ROI of My Observability Spending?

Observability ROI balances cost against two value streams. External – customer-facing reliability reducing revenue loss. Internal – developer productivity through faster incident resolution.

Example: you’re running a $10M ARR SaaS company with 12 annual incidents averaging 2-hour MTTR. Assume 1% revenue loss per hour of downtime. Annual downtime cost: 12 × 2 × ($10M × 0.01) = $2.4M.

If observability reduces MTTR by 50% – 2 hours to 1 – you protect approximately $1.2M annually. Against $100k observability spend, that’s $1.1M net positive ROI.

Internal ROI: observability reducing incident response from 8 hours to 2 hours across 12 incidents saves 72 engineering hours at $150/hour = $10,800.

Real-world data: centralised observability reduced MTTR by 40%, saving 15 engineer hours per incident translating to approximately $25,000 per quarter.

Research shows 100ms latency increase equals roughly 1% revenue loss for e-commerce. This makes external ROI calculation concrete for customer-facing applications.

When ROI is negative – cost exceeds value delivered – you’ve got waste patterns or inadequate utilisation. Either you’re collecting too much unused data or incident response processes aren’t translating observability into faster resolution.

What Are Effective Strategies for Consolidating 10-20 Observability Tools Into a Coherent Stack?

Consolidation follows four phases. Audit existing tools. Identify capability overlap. Evaluate unified versus best-of-breed alternatives. Migrate incrementally using OpenTelemetry.

Phase 1: Audit. Catalogue all tools and map to capabilities. Logs, metrics, traces, APM, infrastructure, security. Identify actual usage through query analytics. You’ll find tools no one uses but everyone pays for.

Phase 2: Overlap Identification. Find redundant coverage. Multiple tools collecting the same data represents duplicate costs.

Phase 3: Architecture Decision. Choose between unified platform – tool consolidation, vendor dependency – versus best-of-breed – flexibility, integration complexity.

Phase 4: Incremental Migration. Adopt OpenTelemetry for new services first. Migrate high-value workloads next. Maintain parallel collection during transition.

Real results: Adopting OpenTelemetry and centralising tools dropped maintenance complexity 5x. Costs tumbled by at least 35% with expected total reduction of 67%. A systematic approach to monitoring consolidation through platform engineering provides the organisational structure needed to sustain these improvements.

Vendor negotiation: consolidation creates competitive procurement. Use multiple-vendor RFPs to negotiate better pricing. When vendors know you’re serious about migrating, pricing improves.

Challenges: Conflicting requirements at 53%. Competing priorities at 50%. Resource constraints at 40%.

Change management matters. Involve team members in decisions. Champions drive adoption better than top-down mandates.

How Does Platform Engineering Provide Systematic Observability Cost Optimisation?

Platform engineering addresses observability costs through centralised ownership, golden paths with built-in instrumentation, and observability-as-a-service abstractions preventing tool sprawl.

Platform teams establish standardised instrumentation using OpenTelemetry. They select vendors strategically. They manage retention policies centrally. This prevents 10-20 tool sprawl.

Golden paths are templated compositions for rapid development. For observability, golden paths include pre-configured instrumentation. New services automatically get logs, metrics, and traces collected.

Contrast with DevOps tool sprawl. “You build it, you run it” meant each team selected their own tools. This created 10-20 vendor relationships and budget chaos. No one had visibility into total spending or authority to enforce standards.

Platform engineering builds guardrails that empower developers without compromising standards. Value Stream Management measures end-to-end effectiveness, not just data volume collected.

SMB implementation: a 2-3 person platform team can manage organisation-wide observability at 50-500 employee scale. You need clear ownership and systematic approaches, not massive teams.

This platform engineering cost optimisation strategy beats ad-hoc vendor consolidation because it addresses root causes. Inconsistent instrumentation. Lack of cost visibility. Absence of retention policies. For a complete overview of how platform engineering provides systematic solutions to DevOps cost challenges, see our complete guide to the post-DevOps transition.

FAQ Section

What is the total cost of ownership (TCO) for observability beyond licensing fees?

TCO includes vendor licensing – 50-70% – infrastructure for self-hosted components – 15-25% – engineering time – 10-20% – and training – 3-5%. For a 100-person organisation spending $150k on vendors, actual TCO likely reaches $200-250k.

When does self-hosted observability become more cost-effective than SaaS vendors?

Self-hosted typically reaches cost-effectiveness at 50-100 TB/day. That’s when vendor SaaS pricing – $0.10-0.50/GB – exceeds self-hosted infrastructure and engineering costs. For SMBs generating 1-10 TB/day, SaaS vendors remain more cost-effective. However, hybrid approaches work well. Self-hosted for low-value logs, SaaS for high-value traces. This optimises costs.

How can I reduce log storage costs without losing diagnostic data?

Reduce costs through tiered retention. 7 days hot, 30 days warm, 90 days cold. Add health check filtering to eliminate 25% noise. Sample non-critical services at 1-10%. Use query-driven retention policies. Start with health check filters and tiered retention for immediate 30-40% reduction.

What vendor lock-in risks should I consider when selecting observability platforms?

Lock-in risks include proprietary data formats, integrated platforms coupling multiple capabilities, long-term contracts with exit penalties, and vendor-specific agents requiring re-instrumentation. Mitigate through OpenTelemetry adoption. Separate data collection from storage. Negotiate shorter terms. Maintain backend portability.

How do I prove observability value to non-technical stakeholders like CFOs and CEOs?

Prove value through revenue protection quantification – downtime cost multiplied by MTTR reduction. Customer experience metrics – latency improvements correlating to conversion. Operational efficiency gains.

For CFOs, translate MTTR to dollars. “Reducing MTTR from 2 hours to 30 minutes saved $50k last quarter.”

For CEOs, frame as competitive advantage. “Our 30-minute incident response enables 99.95% uptime SLA while competitors average 99.9%.”

What is the difference between observability spending as a cost versus an investment?

Observability becomes an investment when ROI – revenue protection plus productivity gains – exceeds cost. This is achieved through strategic instrumentation of high-value services, query-driven retention, and MTTR reduction protecting revenue.

Observability remains a cost when organisations collect telemetry without querying it – 90% unread. Over-instrument low-value services. Lack processes translating observability into faster resolution.

How does microservices architecture impact observability budgets compared to monolithic applications?

Microservices typically multiply observability budgets 3-5x versus monoliths. Each service requires independent instrumentation. Distributed tracing adds cross-service visibility overhead. Service proliferation compounds telemetry volume.

A monolith with 10 modules generates one instrumentation profile. Fifty microservices generate 50 profiles. Container orchestration amplifies costs further.

What are intelligent alerting strategies to reduce alert fatigue?

Intelligent alerting includes SLO-based alerts. Alert on business impact thresholds. Anomaly detection reducing false positives. Alert grouping and deduplication. Severity-based routing. Alert tuning based on historical accuracy.

Start with SLO definition for critical services, then layer anomaly detection. Target less than 30% false positive rates.

Should I adopt OpenTelemetry and what are the practical benefits?

Yes, adopt OpenTelemetry. It decouples instrumentation from vendor backend choice. This enables backend flexibility – switch vendors without re-instrumentation. Hybrid architectures – route high-value traces to premium analytics, bulk logs to cost-effective storage. Vendor negotiation leverage.

Start with new services. Gradually migrate existing services. Leverage platform engineering golden paths to standardise adoption.

How do I benchmark my observability spending against industry standards?

Benchmark using infrastructure budget percentage – 15-25% target. Per-employee spending – $1,500-3,000 per engineer for SMBs. Observability-to-revenue ratio – 0.3-0.8% of ARR.

For a 100-engineer SMB with $20M ARR and $2M infrastructure budget: $300k-500k – 15-25% of infrastructure. $150k-300k – per-employee. $60k-160k – revenue ratio. Use the highest as conservative benchmark.

Compare against Grafana survey data – 17% average. Deviation above indicates waste.

What is the relationship between observability costs and DORA Four Key Metrics performance?

High-performing teams – DORA metrics – typically invest more strategically in observability. They achieve better cost-to-value ratios through targeted instrumentation, SLO-driven alerting, and rapid MTTR.

However, spending alone doesn’t guarantee DORA improvement. Organisations must translate investment into incident response processes. Deployment automation leveraging observability signals. Feedback loops informing improvements.

Platform engineering connects observability to DORA by embedding observability into golden paths. Establishing Value Stream Management. Creating feedback loops where insights drive improvements.