According to LangChain’s survey of over 1,300 professionals, 89% of organisations have some form of observability running on their agents. If you’re running production deployments, that jumps to 94%.
The reason is simple. Multi-agent systems don’t behave like normal software. Same input, different output, every single time. You can’t debug with breakpoints and stack traces when the execution path changes on every run.
This guide is part of our comprehensive understanding of multi-agent orchestration, where we explore the infrastructure requirements for production deployments. Here we’re going to cover why production systems need observability, which platforms support it, and how to measure whether your agents are actually working. You’ll get platform comparisons, adoption stats for different evaluation methods, and a framework for choosing between LangSmith, Comet Opik, and OpenTelemetry.
Let’s get into it.
Why do eighty-nine percent of production deployments use observability?
The LangChain survey shows 89% overall adoption, jumping to 94% for production users. Among production organisations, 71.5% have detailed tracing versus 62% overall.
Quality is the top barrier to production, cited by 32% of respondents. Latency follows at 20%. Both need observability to fix.
Here’s what production failures look like without observability. Hallucinations you can’t trace. Tool selection errors with no reasoning chain to debug. Planning loops that repeat forever. Your mean time to resolution stretches from minutes to hours. This is why observability enables failure diagnosis—without it, you’re flying blind trying to debug the MAST failure modes that plague multi-agent systems.
That 89% adoption rate reflects reality—observability went from optional to mandatory as agents moved from experiments to real workloads. Cost used to matter, but falling model prices mean observability costs (typically 5-15% of total spend) are negligible compared to the cost of production failures.
How does debugging multi-agent systems differ from traditional software?
Traditional software is deterministic. Same input, same output, every time. Bugs reproduce. You set breakpoints, inspect stack traces, analyse logs, track down the problem.
Multi-agent systems are non-deterministic. LLMs generate different reasoning paths for identical inputs. Same user query, different tool selections, different parameters, different outcomes.
Traditional observability has three pillars: metrics, logs, and traces. Agent observability adds two more: evaluations and governance.
The evaluation pillar measures quality beyond error rates. The governance pillar covers safety checks, compliance monitoring, ethical alignment. None of this exists in traditional APM tools like Datadog or New Relic. Those tools provide infrastructure monitoring—CPU, memory, latency, errors—but they lack reasoning trace capture, LLM-as-judge evaluation, and governance capabilities.
What observability platforms support multi-agent systems?
The platform landscape breaks down into three categories: observability-centric tools like LangSmith, Galileo, and Helicone; evaluation-centric platforms like Comet Opik and Langfuse; and open standards like OpenTelemetry.
LangSmith is LangChain’s commercial platform with native integration and managed infrastructure. Free tier gives you one person and 5,000 traces per month. Paid plans start at $39 per user per month.
Comet Opik is open-source and free with LLM-as-judge integration and self-hosting. Performance benchmarks show Opik completes trace logging and evaluation in roughly 23 seconds versus Phoenix’s 170 seconds and Langfuse’s 327 seconds. The hosted plan includes 25,000 spans per month with unlimited team members. Pro plan runs $39 per month for 100,000 spans.
OpenTelemetry is the vendor-neutral standard with no platform lock-in. But you’ll need to build custom integrations for agent-specific features yourself.
Azure AI Foundry is the enterprise option with CI/CD integration and built-in governance via Microsoft Purview. Langfuse and Arize Phoenix are open-source alternatives with strong evaluation and tracing.
Traditional APM vendors are adding LLM observability extensions. Datadog and W&B Weave now provide LLM-specific monitoring on top of existing infrastructure.
The selection criteria matter more than the platforms. Match your ecosystem integration needs, evaluation priorities, deployment model, cost structure, and governance requirements to platform strengths. For a deeper look at how these platforms integrate with multi-agent framework infrastructure, including Redis state management and cloud deployment options, see our framework landscape guide.
What evaluation methods work for non-deterministic agent behaviour?
Human review has 59.8% adoption, the highest of all methods. You need it for nuanced situations and high-stakes decisions where automated evaluation misses context.
LLM-as-judge sits at 53.3% adoption. Automated quality scoring for helpfulness, relevance, coherence, and guideline adherence. Comet Opik’s strength is LLM-as-judge integration, enabling scalable automated evaluation.
Offline evaluation has 52.4% adoption. Pre-deployment testing on synthetic test sets with the lowest barrier to entry. Most teams start here.
Online evaluation sits at 37.3% adoption overall but jumps to 44.8% among production users. Real-time production monitoring, sampling actual user interactions.
Most organisations use multiple evaluation methods at once. The multi-method strategy works like this: inexpensive offline evaluation during development, sample-based online evaluation in production (10-30% of traffic is enough), LLM-as-judge for scalable automated assessment, and human review reserved for complex or high-stakes situations.
Match evaluation methods to use case maturity, risk tolerance, and resource constraints. Start with offline during development, add online sampling in production, use LLM-as-judge for scale, reserve human review for situations where automated evaluation falls short. When planning your implementation, these evaluation methods become the foundation for measuring pilot project success and establishing KPI baselines.
How do you measure success in multi-agent implementations?
Tool selection accuracy measures whether the agent chooses the correct tool for the task. This is your first gate—wrong tool, everything downstream fails.
Parameter correctness evaluates whether the agent provides accurate arguments when calling tools or functions. Right tool, wrong parameters still equals failure.
Task completion is the primary business outcome. Did the agent successfully fulfil the user request end-to-end?
The workflow evaluation metrics track the full pipeline. Intent resolution assesses whether the agent accurately identifies and addresses user intentions. Task adherence evaluates whether the agent follows through on identified tasks according to instructions. Step completion tracks whether individual steps in multi-step workflows execute successfully.
Step utility identifies inefficient reasoning. Does each step contribute value toward task completion, or is the agent spinning its wheels? Response completeness evaluates whether agent responses include all necessary information to satisfy requests.
Quality dimensions include relevance, coherence, and fluency as standard AI quality assessments. For RAG systems, context precision measures the quality and relevance of retrieved context.
Efficiency metrics track minimal redundant calls, optimal token usage, and acceptable latency. Azure AI Foundry includes built-in evaluators for task adherence, intent resolution, and response completeness.
What does TAO cycle tracing reveal that logs cannot?
The TAO cycle—Thought, Action, Observation—is the iterative loop agents use to reason, act, and learn. This cycle is fundamental to how orchestration patterns coordinate autonomous agents, and understanding it is essential for effective debugging. Traditional logs capture events and errors but miss the reasoning chains connecting decisions.
TAO tracing shows you why the agent selected a specific tool, what reasoning led to those parameters, and how results influenced the next steps. End-to-end workflow tracing captures request flow through all agents, tool calls, and LLM invocations in multi-agent systems.
Graph visualisation shrinks debugging time from hours to minutes by pinpointing exact tool invocation failures. The structural view reveals subtle coordination failures like agents repeatedly trying the same failing approach or selecting tools in the wrong sequence.
Production monitoring enables real-time alerting on reasoning anomalies, unexpected tool selections, and planning loops. You can map observed failures back to root causes using the complete execution path with correlation ID, timing data, state transitions, token usage, and error conditions.
LangSmith provides comprehensive span-level tracing capturing full TAO cycles. OpenTelemetry enables TAO tracing without platform lock-in through a vendor-neutral standard, though you’ll need to build the integration yourself.
How do you choose between LangSmith, Opik, and OpenTelemetry?
Start with ecosystem integration. If you’re building on LangChain or LangGraph, LangSmith provides native integration with the lowest friction. If you’re in the Microsoft ecosystem with enterprise governance needs, Azure AI Foundry gives you Purview integration, CI/CD automation, and EU AI Act compliance support.
Evaluation priorities matter. Evaluation-centric tools like Opik and Galileo excel at measuring output quality and running comprehensive test suites. Observability-centric tools like Helicone and Phoenix prioritise operational metrics, tracing, and real-time monitoring.
Deployment model splits between managed platforms and self-hosting. Managed platforms reduce overhead but cost more. Self-hosting gives transparency, flexibility, and control but requires operating infrastructure yourself.
Cost structure varies. LangSmith charges per trace volume with paid plans starting at $39 per user per month. Opik is free open-source with a Pro plan at $39 per month for 100,000 spans. OpenTelemetry is free but requires integration effort—the cost is your team’s time.
Governance needs determine whether you need platforms with audit trails, safety evaluations, bias detection, and compliance reporting. Azure AI Foundry provides these capabilities for enterprise teams with strict compliance requirements.
Hybrid approaches use OpenTelemetry as the foundation with platform-specific evaluation layers. This gives you vendor neutrality for tracing while leveraging specialised tools for evaluation.
Map your organisational requirements—ecosystem, evaluation priorities, deployment preferences, budget, governance needs—to platform strengths. LangSmith for LangChain shops needing managed infrastructure. Opik for budget-conscious teams wanting open-source with strong LLM-as-judge capabilities. OpenTelemetry for vendor neutrality and heterogeneous stacks. Azure AI Foundry for Microsoft ecosystems with compliance requirements.
For a complete view of how observability fits into the production multi-agent orchestration ecosystem, including its relationship to protocols, frameworks, and governance patterns, see our comprehensive guide to the orchestration landscape.
FAQ Section
What percentage of organisations have agents in production?
57.3% of survey respondents have agents in production environments, with 94% of those organisations implementing observability compared to 89% overall. Once agents face real users and business workloads, observability becomes non-negotiable.
Can I use existing APM tools like Datadog for agent observability?
Traditional APM tools provide infrastructure monitoring but lack agent-specific capabilities like reasoning trace capture, LLM-as-judge evaluation, and governance. Datadog now offers LLM observability extensions, but comprehensive agent observability requires the five-pillar framework of metrics, logs, traces, evaluations, and governance.
How much does observability tooling cost compared to LLM API costs?
LLM-as-judge evaluation adds API costs typically 10-20% of production LLM spend. LangSmith charges per trace volume, Opik is free open-source with self-hosting costs, OpenTelemetry requires integration effort. Most organisations find observability costs negligible (5-15%) compared to production failure costs.
What’s the difference between observability and monitoring for agents?
Monitoring tracks system health metrics—latency, errors, throughput—focused on what happened. Observability shows you why it happened through reasoning traces, tool selection logic, and quality evaluations. Agent observability extends traditional monitoring with evaluation and governance pillars needed for non-deterministic LLM behaviour.
Do I need observability if I’m only running one agent, not multi-agent?
Yes. Non-deterministic LLM behaviour, quality assurance needs, and debugging requirements exist regardless of agent count. Multi-agent systems add complexity through distributed tracing and inter-agent communication, but single agents still require reasoning visibility, evaluation, and monitoring.
How does hallucination detection work in observability platforms?
Hallucination detection combines multiple approaches. LLM-as-judge evaluators assess factual correctness against retrieved context for RAG systems. Human review flags nonsensical outputs, automated checks compare responses to ground truth datasets, and anomaly detection identifies response patterns deviating from baselines.
What observability capabilities should be in place before production deployment?
Minimum viable observability includes distributed tracing capturing TAO cycles, offline evaluation on representative test sets, basic metrics (latency, error rates, tool selection accuracy), and human review process for quality spot-checks. Advanced needs include online evaluation, LLM-as-judge automation, and governance checks.
Can OpenTelemetry fully replace commercial platforms like LangSmith?
OpenTelemetry provides vendor-neutral distributed tracing infrastructure but requires custom implementation for agent-specific features like evaluation frameworks, LLM-as-judge integration, governance checks, and visualisation tools. Organisations choose OpenTelemetry to avoid lock-in, accepting higher integration effort versus managed platforms’ out-of-box capabilities.
How do evaluation adoption rates differ between early-stage and production organisations?
Production organisations show higher adoption across all methods. Online evaluation jumps from 37.3% overall to 44.8% in production, detailed tracing increases from 62% to 71.5%, and the percentage not evaluating drops from 29.5% to 22.8%. Moving to production accelerates evaluation maturity.
What role does observability play in regulatory compliance like the EU AI Act?
The EU AI Act requires risk assessment, transparency, and human oversight for high-risk AI systems. Observability platforms with governance capabilities like Azure AI Foundry provide audit trails, safety evaluations, bias detection, and compliance reporting. TAO tracing creates explainability documentation showing how agents make decisions.
Should I implement observability during development or wait until production?
Implement observability during development. Offline evaluation, tracing, and quality metrics enable rapid iteration that’s core to agent engineering workflows. Waiting until production creates technical debt, lacks baseline metrics, and increases production failure risk.
How do I balance evaluation costs with quality assurance needs?
Multi-method strategy: use inexpensive offline evaluation during development, sample-based online evaluation in production (not 100% of traffic), LLM-as-judge for scalable automated assessment, and human review reserved for high-stakes situations. Most organisations find 10-30% sampling sufficient for production monitoring while controlling costs.