Traditional application monitoring is built on one assumption: same input, same output, and failures are binary. AI systems break that assumption completely.
AI observability is the practice of understanding AI-powered systems by tracking telemetry signals that traditional APM tools were never built to capture — token consumption, response quality, model drift, and multi-step agent decision chains. It takes the classic three pillars of observability (logs, traces, metrics) and extends them with signals that only exist in probabilistic systems.
This article defines AI observability precisely, compares what an AI trace contains versus a traditional APM trace, clarifies the difference between monitoring and observability, and explains why OpenTelemetry is the vendor-neutral standard that stops you getting locked in. By the end you will have a clear mental model of what AI observability is, why your existing tools are necessary but not sufficient, and what mature AI observability actually looks like — setting you up for the platform decisions covered in the AI observability and guardrails guide.
What is AI observability and why is it different from what you already have?
AI observability is the ability to understand AI models and AI-powered systems by monitoring their unique telemetry data — token usage, response quality, and model drift. It extends the traditional three pillars of observability with AI-specific signals that conventional APM tools were never designed to capture.
The core difference is non-determinism. Traditional software produces the same output for the same input, so monitoring can rely on threshold-based checks. LLMs produce variable outputs. Identical prompts generate different responses. “Correct” cannot be defined by a simple threshold — it requires qualitative and statistical assessment over time.
That creates a whole category of failure that is completely invisible to conventional dashboards. Traditional APM tells you a service is slow or throwing errors. AI observability tells you the model’s outputs are drifting, token costs are spiking on a specific input pattern, or an agent is choosing the wrong tool on 12% of requests — problems that look like a perfectly healthy 200 OK to your existing monitoring setup.
AI observability does not replace traditional monitoring. It layers on top of it. Your current Datadog or Prometheus setup still matters. The question is what you need to add.
What does an AI trace actually contain compared to a traditional APM trace?
A traditional APM trace records the execution flow of a request through your services: HTTP calls, database queries, cache hits, service-to-service dependencies. Each span answers the same question — how long did this call take, and did it succeed?
An AI trace asks and answers an entirely different set of questions.
Consider a single user request to a RAG-based chatbot. The resulting trace might contain:
- A retrieval span (what documents were fetched, from which store, at what latency)
- An embedding span (what text was vectorised, which embedding model was used)
- An LLM generation span (what prompt was sent, what completion was returned, how many tokens were consumed, which model version generated the output)
- One or more tool invocation spans (which tools the agent called, what arguments were passed, what was returned)
- An agent decision span (which reasoning path the agent followed)
All nested within a single parent trace. Same parent-child span model you already know from APM — but with span semantics that have no equivalent in traditional observability.
The OTel GenAI Semantic Conventions (v1.37+) define the standardised attribute schema for these AI spans. Where a traditional APM span carries http.method, http.status_code, and db.statement, an AI span carries gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.text, and an evaluation score.
There is one more structural difference worth flagging: AI traces capture both inputs and outputs — prompts and completions — which traditional APM never needed to do. That creates a data governance concern most APM workflows simply do not address. AI traces contain sensitive information that requires fine-grained access controls and content masking.
The structured records that traces produce — every LLM call, every tool invocation, every agent decision — are also what feed evaluation loops that convert traces into quality signals. What a trace contains is foundational to understanding how evaluation works.
What is the difference between AI monitoring and AI observability?
Monitoring tells you that something is wrong. Observability tells you why.
In traditional systems, most failure modes are known in advance: a database connection pool exhausts, a downstream service returns 503, memory climbs to 95%. You can write alerts for these because you have seen them before.
In AI systems, failure modes are often novel. A model might start hallucinating more frequently after a prompt template change that looked completely innocuous. An agent might enter a retry loop on a specific class of queries. Token costs might spike 40% on inputs containing a particular phrasing pattern. None of these produce an error code. They look completely healthy at the infrastructure layer.
AI monitoring covers the operational baseline: is the model endpoint responding, what is the P99 latency, are error rates within bounds. Necessary. But not sufficient on its own.
AI observability lets you investigate the why: What is the model actually saying, and is quality degrading? Can you trace a bad output back through every LLM call, tool invocation, and retrieval step that produced it?
Here is the practical test. A user reports a bad AI response. Can your team trace that exact request through every step, identify where quality broke down, and determine whether the root cause is in the model, the prompt, the retrieval, or the tool calls? That is observability. If all you can confirm is that the request returned a 200 OK, that is monitoring.
Why is OpenTelemetry becoming the vendor-neutral standard for AI telemetry?
OpenTelemetry (OTel) is an open-source observability framework governed by the Cloud Native Computing Foundation (CNCF). It already dominates traditional cloud-native observability. Its extension into AI through GenAI Semantic Conventions — available since v1.37 — means the same vendor-neutral instrumentation approach now applies to LLM workloads.
The GenAI Semantic Conventions define a standardised schema for AI telemetry: attribute names like gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.provider.name. A consistent vocabulary for spans, metrics, and events across any GenAI system — making AI telemetry portable across frameworks and vendors.
The practical consequence is significant. If your instrumentation is built on OTel, you can switch observability platforms without re-instrumenting your application. Instrument once, and your AI traces can go to Datadog, Azure AI Foundry, MLflow, or any future platform without touching your application code. Without OTel, you are locked to whichever vendor’s proprietary SDK you instrumented with first.
The OTel Collector adds a data pipeline layer between your application and your observability backend: redact sensitive prompt content, enrich spans with metadata, apply sampling policies, and route telemetry to multiple backends — all before data leaves your network. For AI telemetry, where prompts regularly contain sensitive user information, that is a governance control, not just a routing convenience.
OpenInference (Arize AI) and OpenLLMetry (Traceloop) are useful on-ramps here — open-source SDKs that output OTel-format telemetry for AI workloads without requiring deep OTel expertise.
As covered in how to select an AI platform on observability and control plane maturity, a platform’s OTel support level is a primary criterion for keeping your observability investment portable as your architecture evolves.
What is the AI control plane and why does it matter?
OTel provides the instrumentation layer. The control plane is the management layer above it — the centralised surface through which engineering and governance teams manage AI systems in production: evaluation, monitoring, tracing, policy enforcement, and audit logging, all in a single governed layer.
Control plane maturity varies dramatically. At the low end you get basic dashboards with no integration between pre-production testing and production monitoring. At the high end you get quality evaluation before deployment, continuous production monitoring, distributed tracing, cost governance by feature and team, and compliance audit trails — all coordinated.
Microsoft Azure AI Foundry is a concrete example, covering three capabilities across the AI application lifecycle: Evaluation (measuring quality, safety, and reliability during development), Monitoring (post-deployment production monitoring via Azure Monitor Application Insights, with continuous evaluation at sampled rates), and Tracing (distributed tracing built on OpenTelemetry supporting LangChain, Semantic Kernel, and the OpenAI Agents SDK).
Without an integrated control plane, teams default to a patchwork: open-source libraries for pre-production testing, separate tools for production monitoring, and manual processes connecting them. Insights from production rarely feed back into development. The implications for platform selection are covered in full in how to select an AI platform on observability and control plane maturity.
How do token economics change the observability conversation?
In traditional software, cost scales with compute: CPU cycles, memory, bandwidth. AI systems introduce an entirely new cost dimension. Every LLM call consumes input tokens and output tokens, and pricing varies by model, provider, and request complexity. Token spend is a first-class unit cost with no direct analogue in traditional APM.
AI observability needs to track token usage per request, per user, per feature, and per model to enable cost attribution. Without this you cannot answer questions that will become unavoidable: which feature is driving 60% of our LLM spend? Which model is most cost-effective for our use case?
Token observability also functions as a security and quality signal. A sudden spike in output tokens might indicate a prompt injection attack. A gradual increase in input tokens might signal a code change expanding the context window inadvertently. A shift in token consumption on a specific input type might indicate the model handling that request class differently — a leading indicator of model drift. Traditional cost monitoring cannot detect any of these, because it operates at the infrastructure level, not the request level.
The practical governance expression of this is establishing service level objectives that encompass token cost per request alongside latency and error rates. The minimum viable observability stack guide covers how to implement token cost tracking incrementally without requiring a full observability overhaul.
What does mature AI observability look like in practice?
Mature AI observability is not a single tool. It is an integrated capability spanning three layers:
-
Telemetry collection: Every AI interaction instrumented with OTel, emitting structured spans with full provenance — model version, prompt template, retrieval context, token counts, output content, quality scores.
-
Operational monitoring: Real-time dashboards and alerting covering the operational baseline alongside AI-specific signals — token cost trends, response quality distributions, agent task success rates.
-
Diagnostic investigation: The ability to query telemetry data ad hoc to investigate novel failure patterns — trace a specific bad output through every step that produced it.
A mature setup captures every AI interaction as a structured trace with full provenance — the basis for compliance reporting under the EU AI Act and similar regulations — and makes it debuggable when something goes wrong.
The Dynatrace State of Observability 2025 report is a useful reality check here. Only 28% of organisations use AI to align observability data with key performance indicators. And for the first time, AI capabilities have surpassed cloud compatibility as the primary criterion for selecting an observability platform. The market has recognised AI observability as a strategic priority. Execution is still catching up. The full AI observability and guardrails platform guide maps how leading platforms deliver against each of these maturity indicators.
Here are five practical maturity indicators to assess where you stand:
- You can trace any AI output back through every step that produced it — LLM calls, tool invocations, retrieval steps, agent decisions.
- You can attribute token costs to specific features, teams, and models without manual spreadsheets.
- You detect model drift before users report quality degradation, through automated statistical monitoring of output distributions.
- Your observability data feeds evaluation loops that continuously validate output quality — turning every production trace into a potential test case.
- Your control plane enforces governance policies across all AI workloads — access controls, audit logging, policy enforcement.
Most teams currently get items 1 and 2 done with effort, achieve items 3 and 4 partially, and item 5 rarely. The gap is real but it is addressable incrementally.
The AI observability and guardrails guide evaluates how leading platforms deliver these capabilities. For teams ready to start building, the minimum viable observability stack maps an incremental path from zero to production-grade AI observability without requiring a complete infrastructure overhaul.
FAQ
Can I use my existing Datadog setup for AI observability?
Partially. Datadog LLM Observability now supports OTel GenAI Semantic Conventions v1.37+ natively — GenAI spans can flow directly via an existing OTel Collector pipeline. But you will still need to add AI-specific instrumentation to emit LLM call spans, token usage metrics, and response quality signals that existing Datadog agents do not capture automatically. Necessary, but not sufficient on its own.
What are OpenTelemetry GenAI Semantic Conventions?
A standardised schema within OpenTelemetry (v1.37+) that defines how AI telemetry is structured — attribute names for model identity, token usage, and provider metadata. They establish a consistent vocabulary across any GenAI system, making AI telemetry portable across frameworks and vendors without re-instrumentation.
How does AI observability help explain decisions to the board?
It provides the data foundation for board-level reporting: cost attribution per feature, quality trends over time, compliance audit trails, and incident root-cause timelines. Token spend mapped to business outcomes, quality scores trending over time, and audit logs of every AI decision give the board traceable, quantified answers rather than anecdotes.
What is model drift and how do I detect it?
Model drift is the gradual change in a model’s output behaviour as real-world conditions evolve away from training. It does not produce an error code — it produces subtly different output distributions over time. AI observability detects drift by tracking output quality scores, response distributions, and token usage patterns continuously, flagging statistical deviations before they manifest as user-facing quality problems.
Is OpenTelemetry the right standard for tracking AI agents?
Yes, for instrumentation and data format. OTel’s span-based tracing model maps naturally to agentic workflows — each tool call, decision point, and LLM invocation becomes a span within a parent trace. OpenInference (Arize AI) and OpenLLMetry (Traceloop) provide agent-ready SDKs that output OTel-format telemetry without requiring deep OTel expertise.
What is the difference between AI observability and ML observability?
ML observability focuses on model-level behaviour within the machine learning lifecycle: data drift, feature importance, prediction distribution, and training/serving skew. AI observability operates at the application level: end-to-end tracing through LLM calls, agent decisions, tool invocations, and RAG pipelines. AI observability includes ML-level concerns but extends them to the full application stack.
Do I need AI observability if I am only running a single LLM-powered chatbot?
Yes. Even a single chatbot generates AI-specific failure modes — hallucinations, quality degradation, token cost spikes, prompt injection attempts — that traditional monitoring will not detect. The scope can be minimal (a single OTel-instrumented trace pipeline), but the need exists from the first production deployment.
What does it cost to implement AI observability?
The instrumentation layer — OTel SDKs, GenAI Semantic Conventions — is open source and free. Costs come from the observability backend (self-hosted or SaaS), data storage, and engineering time. For a small team, a minimum viable stack can be operational within days. The minimum viable stack guide covers this path for teams working within SMB resource constraints.
What telemetry signals matter most when starting AI observability from scratch?
Start with distributed traces for every LLM call and token cost attribution per request. Traces capture model version, token counts, latency, and prompt/completion content — the provenance record that makes a response debuggable and auditable. Token cost attribution tells you which features and models are driving LLM spend. Response quality scoring and drift detection build on this foundation and can be added incrementally.