For most of the past three years, the way you evaluated an AI platform was simple: look at the benchmark leaderboards. Which platform runs GPT-4o? Which scores highest on MMLU? Which has the longest context window? That is what shaped vendor pitches and procurement decisions. But Dynatrace‘s State of Observability 2025 report found that AI capabilities (29%) have now overtaken cloud compatibility as the number-one criterion for selecting an observability platform. What technical buyers actually care about is no longer which model a platform hosts — it is whether the platform can operate AI reliably in production.
The differentiator is not the model. It is the control plane: does the platform give you structured observability, guardrails, governance, and lifecycle management — or just an inference endpoint that calls a model and returns a string?
In this article we’ll give you a decision framework for evaluating AI platforms on control-plane maturity. We compare Azure AI Foundry and Databricks on that framework, work through the evaluation tooling decision (MLflow vs LangSmith), address the open-source vs SaaS trade-off at SMB scale, and give you a practical selection checklist. We also name an antipattern you will want to avoid: benchmark theater.
For the broader context on why observability and guardrails matter for production AI, read the AI observability and guardrails platform guide that anchors this cluster.
Why are benchmark scores the wrong primary criterion for AI platform selection?
Benchmark theater is selecting or marketing AI platforms based on standardised benchmark scores — MMLU, HumanEval, ARC-AGI — that do not reliably predict production performance for your actual use case. It creates false confidence by measuring model capability under controlled conditions, not how the platform handles failures, edge cases, and governance requirements in production.
Here is what benchmarks actually measure: performance on standardised tasks with known inputs and outputs. Here is what they fail to predict: hallucination rates on your domain-specific data, latency under real traffic, failure detection when an agent reasons incorrectly, and policy enforcement at scale.
The gap is not speculative. Research has found over 45% overlap on QA benchmarks, and models that achieve “superhuman” performance on leaderboards often fail on out-of-distribution inputs. Benchmark creators and model creators can have collaborative relationships, with models highlighted on favourable task subsets to create an illusion of across-the-board performance. Benchmark scores measure task memorisation as much as general capability.
The most telling evidence that benchmark theater is the wrong framework: despite widespread AI adoption, humans still verify 69% of all AI-driven decisions. That is not a story about model benchmark quality. It is a story about production reliability and the absence of the observability and guardrails infrastructure needed to close the trust gap.
The practical test is straightforward. Ask any AI platform vendor how their benchmark scores correlate with performance on your specific use-case data. If they cannot give you domain-specific evidence, the benchmark scores are not meaningful for your selection decision.
What is control-plane maturity and why should it drive platform selection?
Control-plane maturity is the degree to which an AI platform provides structured, production-grade capabilities across four pillars: controls (guardrails), observability (monitoring and evaluation), security (identity, threat detection, compliance), and fleet-wide operations (unified agent management and cost attribution). It is the central selection criterion explored in the AI observability and guardrails platform guide.
A model inference endpoint gives you predictions. A control plane gives you the ability to govern, observe, and operate those predictions at scale — and that operational capability is what determines whether an AI application survives contact with production traffic.
The four pillars in practice:
Controls: Guardrails applied at input, tool-call, and output stages — not just at the API boundary. This includes task adherence checking, sensitive data detection, groundedness verification, and prompt injection mitigation. A mature control plane enforces policy across the full execution path of an agent, not just at entry and exit.
Observability: End-to-end tracing of agent execution including tool calls, with evaluation at both pre-production and production stages. AI observability is not the same as traditional infrastructure monitoring. It must detect the specific failure mode of AI systems: an agent returning HTTP 200 with confidently wrong content. LLM-based agents are nondeterministic and failures often do not throw errors. Without proper observability, you cannot explain why an agent behaved a certain way or how to fix it.
Security: Identity-based agent management, threat detection, and compliance-readiness signals. Includes RBAC, audit logging of all agent decisions, and integration with broader security infrastructure.
Fleet operations: A unified view of all agents — regardless of which framework built them — showing performance, ownership, policy coverage, and cost attribution.
The lifecycle test is simple. Most platforms support stage one (base model selection and ideation) well. Stage two (pre-production evaluation) is inconsistent. Stage three (post-production monitoring) is where most platforms under-invest. A platform that cannot show you explicit lifecycle tooling for stage three is providing an inference endpoint with observability bolted on, not a mature control plane.
For detail on the observability pillar, see What AI observability actually is and how it differs from traditional monitoring. For the guardrails pillar, see The AI guardrails spectrum: from prompt filters to lifecycle controls.
How does Azure AI Foundry perform on observability and control-plane criteria?
Azure AI Foundry implements the four-pillar control plane architecture explicitly. Controls, observability, security, and fleet-wide operations are named, documented pillars — not a retrospective categorisation applied for this comparison.
Controls: Foundry provides unified guardrails spanning inputs, outputs, tool calls, and tool responses. Coverage includes task adherence, sensitive data detection, groundedness checks, and prompt injection mitigation. Azure AI Content Safety handles content filtering at the platform level.
Observability: Built-in evaluators cover quality metrics (coherence, fluency), RAG-specific metrics (groundedness, relevance), safety metrics, and agent-specific metrics (tool call accuracy, task completion). Teams can build custom evaluators using the Azure AI Evaluation SDK. Azure Monitor Application Insights gives you real-time dashboards and OpenTelemetry-based tracing with explicit support for LangChain, Semantic Kernel, and the OpenAI Agents SDK. Foundry supports both offline evaluation (pre-production test datasets) and online monitoring (production traffic sampling), with scheduled evaluation to detect drift and alerts when outputs fail quality thresholds.
Security: Every agent gets a Microsoft Entra Agent ID at creation. Foundry integrates with Microsoft Defender for threat detection and Purview for compliance visibility and organisation-wide policy enforcement.
Fleet operations: The Foundry operate dashboard gives you a single view of the entire agent estate — including agents built with external frameworks — showing performance, ownership, policy coverage, alerts, cost, and compliance gaps.
The lifecycle model — base model selection, pre-production evaluation, and post-production monitoring — is clearly separated with observability tooling at each stage. If your team values comprehensive documentation, the completeness here is a genuine differentiator.
The honest limitation: the full Azure stack assumes Azure infrastructure investment, and governance features are designed at enterprise scale. For teams already in the Azure ecosystem, that is a strength. For teams evaluating from a non-Azure baseline, the onboarding investment is real.
For more on evaluation architecture, see How AI evaluation loops work and why they matter for production reliability.
How does Databricks perform on the same control-plane criteria?
Databricks’ control-plane story is distributed across several components: Unity Catalog (governance), Mosaic AI Gateway (guardrails), MLflow (observability and evaluation), and the Databricks AI Security Framework (DASF). Evaluating Databricks on control-plane maturity means mapping these components to the same four pillars.
Controls: Mosaic AI Gateway provides centralised guardrails with input and output safety filtering, sensitive data detection, rate limiting, and fine-grained access controls at the agent level. Where Foundry applies guardrails at tool-call stage by default, Databricks’ guardrail coverage is primarily at the API boundary via the gateway — a meaningful difference in depth across the execution path.
Databricks also has a distinctive guardrails philosophy: the agent calibration pattern. Rather than treating guardrails purely as content filters, calibrated agents are designed to acknowledge when confidence is low rather than generating plausible-sounding wrong answers. For teams where hallucination risk is the primary concern rather than content policy enforcement, the calibration approach offers a different kind of production reliability.
Observability: MLflow Trace provides out-of-the-box observability for most agent orchestration frameworks — LangGraph, OpenAI, AutoGen, CrewAI, Groq — with auto-tracing via a single line of code. Traces follow the OpenTelemetry format. MLflow 3 offers built-in evaluation judges for safety, correctness, and groundedness; custom LLM judges for domain-specific criteria; and code-based scorers for deterministic business logic. The same evaluation scorers used during offline testing can run on live production traffic via Databricks GenAI application monitoring.
Security: DASF identifies 62 distinct AI risks across 12 components of an AI system, organised into four categories: security, operational, compliance and ethical, and data risks. It maps controls to 10 industry standards. This is not just a feature list — it is a systematised approach to AI risk, and the existence of a formal framework like DASF signals governance maturity.
Fleet operations: Unity Catalog provides centralised governance of AI assets — models, agents, and functions — with data lineage tracking, access controls, and compliance auditing built in.
The honest limitation: the full Databricks stack assumes data platform investment. Unity Catalog governance is most valuable at organisations with significant data engineering infrastructure already in place. Teams without existing Databricks investment face a larger onboarding burden.
For guardrail maturity context, see The AI guardrails spectrum: from prompt filters to lifecycle controls.
MLflow vs LangSmith: which evaluation tooling fits your platform decision?
The evaluation tooling choice is embedded in the platform decision. Each tool has ecosystem affinities, and for smaller teams the practical question is which creates the least friction given the platform you are already building on.
MLflow is open-source and Databricks-native. Auto-tracing enables observability for most major frameworks with a single line of code, using OpenTelemetry format. It is the strongest choice for teams already in the Databricks ecosystem and for teams that prioritise open-source licensing and self-hosting control.
LangSmith is a proprietary SaaS product from LangChain with the deepest native integration for LangChain and LangGraph. Self-hosting is Enterprise-only. The free tier offers 5,000 traces/month; Plus is $39/seat/month.
Langfuse is open-source (MIT licence) and framework-agnostic. Self-hosting is first-class with full feature parity — not an enterprise add-on. It integrates with 80+ frameworks via OpenTelemetry, making it the most practical choice for teams with data residency requirements. Cloud plans start at $29/month.
Braintrust integrates evaluation directly into the observability workflow — structured evaluation as part of continuous monitoring, not just trace capture. The Pro tier is $249/month. Self-hosting is enterprise-only.
The decision is simple: follow your existing ecosystem. If you are building on Databricks, use MLflow — it is already there. If you are building on LangChain, use LangSmith. If you are on neither and have data residency requirements, use Langfuse. If you need evaluation-first observability and have the budget, Braintrust is worth evaluating.
One thing worth knowing: all four support OpenTelemetry-compatible tracing. That means migration between tools is a configuration change on the tracing export, not a full replatforming. Your switching costs are lower than they appear.
For implementation-level guidance, see Building a minimum viable AI observability stack for a small engineering team.
What are the real trade-offs between open-source and SaaS AI observability at SMB scale?
Three constraints shape this decision at smaller scale: limited DevOps capacity, budget sensitivity, and data residency requirements.
DevOps capacity: Self-hosting any observability tool means managing infrastructure — updates, security patches, scaling, uptime. For a team with fewer than two dedicated DevOps engineers, that overhead is not trivial. SaaS tools eliminate it at the cost of a recurring subscription and data leaving your infrastructure.
Budget: SaaS costs are more predictable but accumulate. Langfuse cloud starts at $29/month; LangSmith Plus is $39/seat/month; Braintrust Pro is $249/month. Self-hosting is free in licensing costs, but the infrastructure to run it reliably is not. For many teams, a $200–300/month SaaS subscription costs less than the engineering hours required to operate a self-hosted alternative.
Data residency: This is the hard constraint that overrides the others. If traces contain personal data, sensitive business data, or data subject to regulations like the EU AI Act, SaaS tools that send trace data to vendor infrastructure may not be viable. Self-hosted Langfuse or MLflow is not a preference in these cases — it is the only compliant path.
The 30% rule as a selection test: Roughly 30% of AI project effort should go into post-deployment monitoring and risk management. Apply it this way: if your observability tooling requires so much DevOps overhead that this 30% gets consumed by operations rather than actual monitoring, the tool is not a good fit. That 30% should produce monitoring output — alerts, evaluation scores, drift detection — not infrastructure maintenance logs.
The decision heuristic is straightforward:
- Fewer than two DevOps engineers and no data residency requirement: use SaaS. The operational risk reduction outweighs the cost.
- Data residency requirement (EU AI Act, financial services, healthcare): use self-hosted open-source. Budget for the operational overhead as a fixed infrastructure cost.
What should a practical AI platform selection checklist include?
The following checklist translates the control-plane maturity framework into actionable evaluation criteria. Apply it to any AI platform — it is vendor-agnostic.
1. Evaluation Maturity
Does the platform support offline evaluation (pre-production test datasets with automated scoring) and online monitoring (production traffic sampling)? Can you define custom evaluation rubrics? Does the same evaluation framework cover both stages? Strong platforms have built-in evaluators plus a custom evaluator SDK, and use the same scorers on both test data and live traffic.
2. Guardrail Architecture
Are controls applied at input, tool-call, and output stages — or only at the API boundary? Can guardrail policies be updated without redeploying the application? Do the guardrail categories match your use-case risk profile: safety, groundedness, task adherence, data detection?
3. Governance Controls
Does the platform provide audit logging of all agent decisions, RBAC at the asset level, data lineage tracking, and compliance-readiness signals? Is there a formal risk framework — like DASF — or just a collection of features? A structured risk taxonomy signals that governance is designed in, not bolted on.
4. Observability Depth
Does the platform use OpenTelemetry-compatible tracing (not a proprietary format)? Does the trace capture tool calls as well as LLM calls? Does it support the agent frameworks you actually use? Can you see agents from multiple frameworks in one view?
5. SMB-Appropriate Cost Model
Can you start without enterprise contracts? Does pricing scale linearly or in cliff thresholds? Is self-hosting available without the enterprise tier? How much DevOps does this require from a team with limited platform engineering?
6. The 30% Rule Test
After provisioning the observability tooling, does your team’s monitoring effort go into improving AI quality — or maintaining the observability infrastructure? If the 30% post-deployment monitoring budget is consumed by infrastructure operations, the platform is not viable for your team size.
7. Ecosystem Compatibility
Does the platform align with your existing cloud provider and data stack? Does it support the agent frameworks you use? Can you export traces and models if you switch?
The benchmark theater test: When evaluating any vendor, ask: “How do your benchmark scores correlate with performance on our specific use-case data?” If the vendor responds with general leaderboard position and cannot provide domain-specific evidence, you are looking at benchmark theater. Weight vendor-provided benchmark data accordingly.
Verify OpenTelemetry compatibility as the minimum interoperability signal — it prevents total lock-in and enables correlation with your broader infrastructure metrics regardless of which observability frontend you choose.
FAQ
What is benchmark theater?
Benchmark theater is selecting AI platforms primarily based on standardised benchmark scores — such as MMLU or HumanEval — that do not reliably predict production performance for your specific use case. Benchmarks measure model capability under controlled conditions, not how the platform handles failures, edge cases, or governance at scale. The practical test: ask any vendor how their scores correlate with performance on your domain data. If they cannot answer, the score is not meaningful for your decision.
Which is better for AI observability: Azure AI Foundry or Databricks?
Neither is universally better. Azure AI Foundry offers a tightly integrated four-pillar control plane with built-in evaluators and Azure Monitor — strongest for teams already in the Azure ecosystem who want a complete, well-documented control plane. Databricks distributes its capabilities across Unity Catalog, Mosaic AI Gateway, MLflow, and DASF — strongest for teams with existing Databricks data platform investment who value governance depth and the agent calibration pattern. The right choice depends on your existing infrastructure, not abstract feature comparisons.
Should a small team choose open-source or SaaS AI observability tools?
Teams with fewer than two dedicated DevOps engineers will generally do better with SaaS tools like LangSmith or Braintrust because managed infrastructure reduces operational risk. If data residency requirements are non-negotiable — EU AI Act compliance, healthcare, financial services — self-hosted open-source options like Langfuse or MLflow are the more practical path, but budget explicitly for the operational overhead. If the observability tooling consumes the 30% post-deployment monitoring budget on infrastructure maintenance rather than monitoring quality, choose the option that shifts that burden to a vendor.
What is the 30% rule in AI risk management?
The 30% rule is the principle that roughly 30% of AI project effort should go into post-deployment monitoring and risk management. It functions as a platform selection test: if the observability tooling requires so much DevOps overhead that this 30% budget is consumed by operations rather than actual monitoring, the platform is not a good fit. That investment should produce monitoring output — evaluation scores, drift alerts, quality metrics — not infrastructure maintenance.
What does control-plane maturity mean for an AI platform?
Control-plane maturity measures how well an AI platform provides structured capabilities across four pillars: controls (guardrails at input, tool-call, and output stages), observability (tracing, evaluation, production monitoring), security (identity management, threat detection, compliance), and fleet-wide operations (unified agent management, cost attribution, policy coverage). The lifecycle test: does the platform provide observability tooling at model selection, pre-production evaluation, and post-production monitoring stages — or does it only provide inference?
What is the Databricks AI Security Framework (DASF) and why does it matter for platform selection?
DASF is Databricks’ formal AI risk taxonomy covering 62 distinct risks across 12 AI system components, organised into security, operational, compliance and ethical, and data risk categories, mapped to 10 industry standards. It signals governance maturity — the vendor has systematised risk management rather than treating it as an afterthought. Evaluate whether any AI platform you consider has an equivalent formal risk taxonomy, or whether governance is just a collection of features without a unifying risk model.
How do I evaluate an AI platform’s control-plane maturity before committing?
Apply the structured checklist: evaluation maturity (offline and online evaluation, custom rubrics), guardrail architecture (input, tool-call, and output stage controls), governance controls (audit logging, RBAC, data lineage, formal risk framework), observability depth (OpenTelemetry support, end-to-end tracing including tool calls), and cost model viability for your team size. Run the 30% rule test, apply the benchmark theater test to vendor claims, and verify OpenTelemetry compatibility as the minimum interoperability signal.
Why do most AI applications fail to reach production?
Most AI applications fail the transition from prototype to production because they lack the observability, guardrails, and governance infrastructure needed to operate reliably under real-world conditions. Benchmark performance in controlled settings does not transfer to production environments where inputs are unpredictable, failures are silent (HTTP 200 with confidently wrong content), and compliance requirements apply. The 69% human verification rate in AI-driven decisions is the measure of how far production reliability lags behind prototype capability.
How does MLflow compare to LangSmith for AI agent observability?
MLflow is open-source, Databricks-native, and provides experiment tracking, model registry, LLM tracing, and evaluation — the default choice for any team already on Databricks. LangSmith is a SaaS product with the deepest native integration for LangChain and LangGraph. The choice is driven by existing ecosystem investment rather than isolated feature comparison. Both use OpenTelemetry-compatible tracing, which reduces switching costs if your ecosystem changes. If neither ecosystem applies, Langfuse provides a framework-agnostic, fully self-hostable open-source alternative.
What should I look for in AI platform governance capabilities?
Look for audit logging of all agent decisions and tool calls, RBAC at the asset level, data lineage tracking, compliance-readiness signals, and a formal risk framework rather than an ad hoc feature list. The absence of formal governance documentation is itself a signal: governance was added as an afterthought, not a foundational platform property.
What is the minimum viable control plane for a small AI team?
At minimum, you need input/output guardrails on your agents, end-to-end tracing that captures tool calls (not just LLM calls), offline evaluation against a test dataset before each deployment, and production traffic sampling with alerts when quality drops. RBAC and audit logging are worth adding early even if your team is small — they are much harder to retrofit than to build in from the start.
What comes next
Platform selection is a prerequisite, not a destination. Once you have chosen a platform on control-plane maturity, the implementation work begins: building the minimum viable observability stack, instrumenting agents for tracing, establishing evaluation baselines, and configuring guardrail policies for your specific risk profile.
For the implementation-level guidance that follows platform selection, read Building a minimum viable AI observability stack for a small engineering team. For the full cluster context, the AI observability and guardrails platform guide is the starting point.
The organisations closing the human verification gap are the ones that selected platforms on control-plane maturity and then invested the 30% post-deployment monitoring effort to build production reliability over time. The framework above is how you make that selection decision.