Business

SaaS

Technology

•

May 16, 2026

Enterprise OpenTelemetry Migration Playbook from Proprietary APM to Open Standard

The “is OpenTelemetry ready?” question has been answered. Airbnb’s new metrics pipeline processes over 100 million samples per second. Adobe runs thousands of Collectors per signal type across its infrastructure. Five new CVEs emerged in April-May 2026 — not because OTel has a security problem, but because it is being treated as production infrastructure worth scrutinising.

The harder question is whether your team is ready, and in what sequence to move. Elastic’s 2026 observability survey shows 11% of organisations in production with OTel and 36% experimenting. The gap between those two groups is not about OTel’s maturity — it is about migration complexity and team readiness.

This playbook covers the six phases of an enterprise OTel migration, with links to the cluster articles that go deep on each phase.

In this hub:

Airbnb’s StatsD-to-OTel migration and the dual-write pattern — Phases 1–2 production template at 100 million samples per second
Adobe’s three-tier Collector architecture at enterprise scale — Phase 3 reference for compliance-heavy environments
OTLP protocol and vendor lock-in escape strategy — Phase 4 backend selection and portability
OpenTelemetry CVEs and security hardening requirements — Phase 5 patch guidance for the April-May 2026 CVE cluster
Declarative configuration as an operational prerequisite — Phases 3 and 5 Collector fleet management

Why April-May 2026 Is the Enterprise Tipping Point for OpenTelemetry

Three convergent signals have closed the readiness debate in 2026. Airbnb validated OTel at 100 million samples per second with roughly an order-of-magnitude cost reduction. Adobe demonstrated organisation-wide adoption across a polyglot enterprise with thousands of Collectors. Five CVEs drew the kind of security scrutiny that only follows genuine production adoption. Proprietary APM costs and lock-in remain unchanged; OTel’s readiness argument has not.

The cost driver is clear. Proprietary APM pricing — per-host plus per-custom-metric models — creates predictable cost growth as your services and cardinality scale. Chronosphere customers report 60-80% telemetry cost reductions after migration. The Elastic survey data confirms the direction: vendor-sourced OTel distributions have grown from 44% to 60% of deployments year-on-year, meaning most enterprises are already in a hybrid OTel state. The question is no longer whether to move — it is how.

What Separates the Early Adopters from the Average Starting Point

Airbnb had a dedicated observability team and roughly 40% of services on a shared, platform-maintained metrics library that the team controlled directly. Adobe had a central platform group insulating hundreds of service teams from OTel complexity — service teams needed only two annotation lines in their Kubernetes manifests to get full auto-instrumentation. These conditions do not describe most 50-500 person engineering organisations.

The patterns transfer; the scale assumptions do not. Before starting, check for the conditions that justify deferring: no platform engineering capacity, an active compliance audit, a mid-scaling event, or a team without OTel-adjacent skills (Prometheus, Kubernetes, YAML at scale). Any of these is a legitimate reason to wait. The hybrid permanent architecture — OTel instrumentation with OTLP routing to your existing Datadog or New Relic backend — is a valid endpoint, not a compromise. It delivers vendor-portable instrumentation without requiring a full backend migration.

Phase 1–2: Instrumentation Audit and Dual-Write Transition

The instrumentation audit is the prerequisite most teams skip and later regret. Map every telemetry source, agent, business-critical dashboard, and alert before touching a single configuration file. Then use the dual-write pattern: run an OTel Collector pipeline alongside your existing proprietary pipeline simultaneously, validate parity under real production conditions, and cut over only after several real incident cycles confirm equivalence.

The inventory produced by the audit determines everything downstream: which services migrate first, which dashboards must be rebuilt before cutover, and which legacy metric coverage is low-value enough to retire. Dual-write temporarily doubles telemetry volume and spend — this is expected and should be budgeted. Airbnb, Adobe, TRM Labs, and Shopify all independently arrived at dual-write as the only safe migration approach. That convergence is the strongest argument for it.

Read more: Airbnb’s StatsD-to-OTel migration and the dual-write pattern — Airbnb’s production benchmark for phases 1 and 2, including delta temporality configuration for high-cardinality JVM services.

Phase 3: Collector Topology Design and Configuration Management

Collector topology is the architectural decision that determines operational resilience at scale. Your deployment pattern — agent sidecar, daemonset, gateway, or three-tier — should be chosen based on signal volume, fault isolation requirements, and team capacity. Signal-level isolation (separate Collector deployments per signal type) prevents a backend rate-limit event for one signal from cascading into others. Without declarative configuration management, topology decisions degrade to configuration drift within months.

Adobe’s three-tier architecture — sidecar Collector per pod, managed namespace Collectors per signal type, centralised routing to backends — is the reference pattern for compliance-heavy environments. Airbnb’s horizontal-sharding StatefulSet is the reference for high-throughput metrics. A 100-person SaaS company starting point sits closer to a simplified two-tier version of either. Fleet-scale Collector management requires GitOps discipline — version-controlled declarative YAML, CI/CD pipeline for config changes, and a quarterly upgrade cadence for custom distributions.

Read more: Adobe’s three-tier Collector architecture at enterprise scale for the organisational deployment model; declarative configuration as an operational prerequisite for making that topology manageable at fleet scale.

Phase 4: Backend Selection and Vendor Lock-In Escape Strategy

Backend selection has three viable paths. OTel-native open-source (SigNoz, ClickStack) offers the lowest total cost of ownership but the highest operational ownership. OTel-native managed (Honeycomb, Grafana Cloud) reduces operational burden and has compliance certifications available — Honeycomb Private Cloud meets GDPR, HIPAA, PCI DSS, and SOC 2 Type II. Hybrid permanent (OTel instrumentation with OTLP ingestion into Datadog or New Relic) is the fastest migration path and retains your existing tooling.

OTLP makes all three choices reversible — switching backends requires only Collector configuration changes, not application re-instrumentation. Thoughtworks Technology Radar Vol. 34 places SigNoz in Trial and ClickStack in Assess, providing independent analyst validation for both open-source options. For regulated industries, data sovereignty requirements are a decision gate that should be resolved before backend selection begins.

Read more: OTLP protocol and vendor lock-in escape strategy — how OTLP makes backend selection reversible, where semantic convention gaps persist, and what the Thoughtworks Radar placements signal.

Phase 5: Security Hardening — Patch Before You Go Live

The April-May 2026 CVE cluster makes security hardening a required migration phase, not an afterthought. CVE-2026-41078 affects the deprecated Jaeger exporter — there is no available patch; migrate to the OTLP exporter instead. CVE-2026-42191 is patched in opentelemetry-dotnet 1.15.3. CVE-2026-39883 in OpenTelemetry-Go remains unpatched in Linux distribution packages as of May 2026. Audit all Collector component versions before going live.

OTel’s growing CVE cadence is itself a maturity signal: the project is being scrutinised at the level of production infrastructure because it is being treated that way. Custom Collector distributions reduce your attack surface by excluding deprecated exporters and unnecessary components. This is the same technique Adobe uses for its quarterly upgrade cadence, and it closes the CVE blind spot that affects teams running vanilla distributions with components they do not use.

Read more: OpenTelemetry CVEs and security hardening requirements — specific CVE version numbers, patch status, and remediation steps for each vulnerability in the April-May 2026 cluster.

Phase 6: Organisational Readiness and the Go/No-Go Decision

Organisational readiness is the phase that case studies document implicitly but rarely name as a decision gate. Ask four questions before starting: Does your team have the capacity to own a Collector fleet in on-call rotation? Do you have OTel-adjacent skills in-house? Is a platform engineering function available, or will application engineers own the observability infrastructure? Is the timing right given your current growth stage?

Adobe’s voluntary adoption model — introducing OTel as an option for new applications rather than mandating a full migration — is a lower-risk approach that manages scope without abandoning the goal. TRM Labs gave each team control of their own validation timeline via the dual-write architecture, allowing self-paced cutover. Neither model works without dedicated capacity that owns the migration through its complexity spikes.

The Adobe and TRM Labs case studies document the specific organisational conditions that made their migrations work — Adobe’s three-tier Collector architecture at enterprise scale covers the team structure and voluntary adoption model; Airbnb’s StatsD-to-OTel migration and the dual-write pattern covers the dual-write architecture that enabled team-paced validation at scale.

The cluster articles below go deep on each phase. If you are at the start of a migration, the production case studies and the security article are the natural first reads; if you are at backend selection or operational scaling, the vendor strategy and declarative configuration articles apply.

Resource Hub: OpenTelemetry Migration Library

Production Case Studies and Migration Patterns

Airbnb’s OpenTelemetry Migration and What the 10x CPU Reduction Actually Required: The dual-write pattern, delta temporality, and vmagent topology behind Airbnb’s metrics pipeline migration at 100 million samples per second.
Adobe’s OpenTelemetry Pipeline Architecture at Enterprise Scale: Three-tier Collector architecture, signal-level isolation, and the voluntary adoption model for a polyglot enterprise environment.

Vendor Strategy and Operational Discipline

Vendor Lock-In Escape and What OpenTelemetry’s OTLP Protocol Actually Gives You: How OTLP enables backend portability, which vendors support native ingestion, and where semantic convention gaps persist.
Declarative Configuration as OpenTelemetry’s Most Important Maturity Signal: GitOps for Collector fleet management, custom distributions, and what the Thoughtworks Technology Radar Vol. 34 placements signal.

Security and Risk Management

OpenTelemetry Security CVEs Trust Boundaries and What to Patch Now: The April-May 2026 CVE cluster — CVE-2026-41078, CVE-2026-42191, CVE-2026-39883 — with specific patch versions and remediation actions.

FAQ

What is OpenTelemetry and why are enterprises migrating to it from proprietary APM?

OpenTelemetry is a CNCF-graduated, vendor-neutral standard providing APIs, SDKs, a Collector component, and OTLP wire protocol for generating and exporting traces, metrics, and logs. Enterprises migrate because proprietary APM platforms bundle vendor lock-in into their data formats and pricing models. OTel separates the instrumentation layer from the backend, making observability infrastructure vendor-portable — switching backends requires only Collector configuration changes, not application re-instrumentation.

Is OpenTelemetry ready for production at a 100-500 person SaaS company in 2026?

The production readiness question is answered. Elastic survey data shows 11% of organisations in production and 36% experimenting, and Airbnb, Adobe, and TRM Labs have all validated OTel in demanding environments. The real question is team readiness: OTel requires platform engineering capacity that a small SRE team may not have. For teams without that capacity, starting with a hybrid permanent architecture is a lower-risk entry point — Phase 4 covers the full range of backend options in detail.

Dual-write runs your OTel pipeline as a shadow destination alongside the existing proprietary pipeline. Your Collector exports to both simultaneously. You validate alert and dashboard parity under real production conditions before cutting over. Every major migration case study — Airbnb, TRM Labs, Shopify — independently arrived at dual-write because it is the only approach that proves equivalence before removing the safety net. TRM Labs described it as “the safety net that made a cross-company migration feel manageable rather than terrifying.”

Should we migrate to a fully open-source OTel backend or keep our existing vendor?

Both are valid long-term states. The hybrid permanent architecture — OTel instrumentation with OTLP routing to Datadog or New Relic — is where Elastic survey data suggests most enterprises are landing. It delivers the key benefit (vendor-portable instrumentation layer) without requiring a backend migration. A full move to an OTel-native backend (SigNoz, ClickStack, Grafana Cloud) has a lower TCO ceiling but requires more operational ownership. OTLP makes either choice reversible.

What should be done about CVE-2026-39883 in OpenTelemetry-Go if there is no patch?

As of May 2026, CVE-2026-39883 (PATH hijacking in OTel-Go versions 1.15.0–1.42.0) has no vendor patch in Linux distribution packages. Mitigations include: network isolation of affected Collector deployments, monitoring for PATH manipulation indicators, and building a custom Collector distribution that excludes the affected component if your workload does not require it. Verify patch status against NVD immediately before any production deployment.

Enterprise OpenTelemetry Migration Playbook from Proprietary APM to Open Standard

Why April-May 2026 Is the Enterprise Tipping Point for OpenTelemetry

What Separates the Early Adopters from the Average Starting Point

Phase 1–2: Instrumentation Audit and Dual-Write Transition

Phase 3: Collector Topology Design and Configuration Management

Phase 4: Backend Selection and Vendor Lock-In Escape Strategy

Phase 5: Security Hardening — Patch Before You Go Live

Phase 6: Organisational Readiness and the Go/No-Go Decision

Resource Hub: OpenTelemetry Migration Library

Production Case Studies and Migration Patterns

Vendor Strategy and Operational Discipline

Security and Risk Management

FAQ

What is OpenTelemetry and why are enterprises migrating to it from proprietary APM?

Is OpenTelemetry ready for production at a 100-500 person SaaS company in 2026?

Should we migrate to a fully open-source OTel backend or keep our existing vendor?

What should be done about CVE-2026-39883 in OpenTelemetry-Go if there is no patch?

Related Articles

5 Platforms For Optimising Your Agents Compared

How To Future-Proof Your Development Team Without Over-Hiring

Making it real – the software development process behind your app

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG

Enterprise OpenTelemetry Migration Playbook from Proprietary APM to Open Standard

Why April-May 2026 Is the Enterprise Tipping Point for OpenTelemetry

What Separates the Early Adopters from the Average Starting Point

Phase 1–2: Instrumentation Audit and Dual-Write Transition

Phase 3: Collector Topology Design and Configuration Management

Phase 4: Backend Selection and Vendor Lock-In Escape Strategy

Phase 5: Security Hardening — Patch Before You Go Live

Phase 6: Organisational Readiness and the Go/No-Go Decision

Resource Hub: OpenTelemetry Migration Library

Production Case Studies and Migration Patterns

Vendor Strategy and Operational Discipline

Security and Risk Management

FAQ

What is OpenTelemetry and why are enterprises migrating to it from proprietary APM?

Is OpenTelemetry ready for production at a 100-500 person SaaS company in 2026?

What is the dual-write pattern and why does every migration guide recommend it?

Should we migrate to a fully open-source OTel backend or keep our existing vendor?

What should be done about CVE-2026-39883 in OpenTelemetry-Go if there is no patch?

Related Articles

5 Platforms For Optimising Your Agents Compared

How To Future-Proof Your Development Team Without Over-Hiring

Making it real – the software development process behind your app

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG