Here is a number worth sitting with: 82% of executives in Gravitee‘s 2026 State of AI Agent Security report feel confident their existing policies protect against unauthorised agent actions. Only 47.1% of those same organisations actively monitor their AI agents at runtime. Only 14.4% report that all their agents went live with full security or IT approval.
That is the Confidence Paradox. Your governance policy describes what your AI agents are supposed to do. Runtime observability tells you what they actually do. Most organisations have written the policy. Most have not built the infrastructure.
As AI agents move from pilot projects into operational infrastructure, the absence of runtime observability is not a tooling maturity problem. It is a governance problem.
This article covers the technical layer of the broader AI governance gap observability helps close: behavioral drift, what AI observability infrastructure actually includes, how observability-driven sandboxing works, and how to design human oversight into an automated system at scale.
Why Is Runtime Governance Different From Policy Governance?
A governance policy is written before deployment. It describes intended behaviour, sets out permitted actions, and establishes accountability. It is a statement of intent — and it has no way of detecting when agent behaviour diverges from that intent at execution time.
Here is the problem. AI agents fail differently from traditional software. A broken API call throws an exception. An agent reasoning failure produces confident, plausible output that is completely wrong — no error, no alert, no log entry unless you have built the infrastructure to generate one. The most expensive failures are silent errors amplified through multi-agent pipelines before anyone notices.
Traditional software governance assumes auditing the code is sufficient. That assumption does not transfer to agentic systems. Agents make decisions at runtime that were never explicitly coded. The Gravitee Confidence Paradox makes this concrete: of organisations in active testing or production, more than half have agents running without any security oversight or logging. 88% confirmed or suspected an AI agent security incident in the last year.
Static AI policy documents describe intended behaviour. Continuous AI governance monitors and enforces actual behaviour in production. Both are required; neither replaces the other. If you have a CI/CD pipeline but no observability for your AI agents, you have automated the build but not the governance. Building enterprise AI governance means addressing both layers.
What Is Behavioral Drift and Why Does It Make AI Harder to Govern Than Traditional Software?
Behavioral drift is the progressive degradation of an AI agent’s decision patterns, tool usage, reasoning pathways, and inter-agent coordination over time in production — without any code change, parameter update, or explicit human action.
This is different from model drift, which is about training data distribution shift. Behavioral drift occurs within a deployed, unchanged model. The weights have not changed. The behaviour in production has.
Research on multi-agent LLM systems identified three distinct manifestations. Semantic Drift is progressive deviation from original intent as context accumulates. Coordination Drift is breakdown in consensus mechanisms over extended interaction sequences. Behavioral Drift in the narrow sense is the emergence of unintended shortcuts.
The numbers are striking. Detectable drift emerged after a median of 73 interactions. Task success rates dropped from 87.3% to 50.6% — a 42% degradation. Human intervention requirements increased 3.2x. These are simulation results, but the directional signal is clear: drift is early-onset and its governance costs compound.
Model version changes are a primary but underacknowledged trigger. Every model upgrade or provider switch should be treated as a re-baselining event — re-establish the baseline and monitor closely in the first production window.
The Agent Stability Index (ASI) is the measurement framework for this. It is a 12-dimensional composite tracking Response Consistency, Tool Usage Patterns, Inter-Agent Coordination, and Behavioral Boundaries over rolling 50-interaction windows. Drift is flagged when ASI drops below 0.75 for three consecutive windows. The rate of decline increases nearly 2.5x between the 0–100 interaction window and the 300–400 window — drift accelerates, making late-stage governance harder.
Behavioral drift is undetectable without continuous monitoring. Code review, deployment testing, and periodic audits catch nothing here.
What Does AI Observability Infrastructure Actually Include?
That is the gap AI observability infrastructure is designed to close.
AI observability is not the same as APM. APM traces API latency, error rates, and uptime. AI observability traces reasoning chains, decision steps, tool call sequences, inter-agent messages, and policy decisions. As LangChain‘s 2026 survey puts it: “In software, the code documents the app. In AI, the traces do.”
There are five components of production-grade AI observability infrastructure.
- Execution Tracing: recording each agent execution step as structured OpenTelemetry spans — think of traces as the call stack for an AI system.
- Evaluation Pipelines: automated continuous assessment of outputs against quality, safety, and behavioral criteria. Only 37.3% of organisations currently run online evaluations of live agents — this is the core continuous governance capability most haven’t built.
- Behavioral Baseline Establishment: recording the first N production interactions as the reference for expected behaviour. Without a baseline, drift is undetectable.
- Alerting: notification systems triggered when drift signals, policy violations, or anomalous tool use exceed thresholds. This converts passive recording into active governance.
- Visualisation: dashboards and trace explorers that make agent execution inspectable without raw log access.
Arize Phoenix is the recommended open-source entry point — more detail in the tooling section below.
The operational cycle runs like this: trace collection → behavioral analysis → drift detection → alert generation → intervention → re-baselining. This is what governance execution requirements look like in practice.
For organisations with scale or compliance requirements, the commercial platform tier includes Arize AX (enterprise, production-scale evaluation and compliance reporting), Fiddler AI (lifecycle focus), DataRobot (unified AI development and governance platform), and Braintrust (evaluation-first, 25+ built-in scorers).
What Is Observability-Driven Sandboxing and How Does It Work?
Monitoring tells you what happened. Observability-driven sandboxing prevents it from happening. The sandbox sits between inference and side effects: the agent plans actions, but execution is gated by explicit policy checks.
Each tool invocation is treated as a capability request. Here is the mechanism:
- The agent plans an action — “write to this file path” or “call this external host”
- The request is intercepted and evaluated against Policy-as-Code definitions
- The allow/deny decision is emitted as an OpenTelemetry span — traceable and auditable
- If denied, the agent receives a policy-violation signal recorded in the trace
Three policy classes cover the primary agent execution risk surface.
- Workspace Enforcement: confines file operations to a designated directory. In practice, production workspaces routinely contain environment configuration files with active credentials — without workspace enforcement, an agent can read them simply because they are present on disk.
- Network Allowlisting: restricts external connections to pre-approved hosts. This directly addresses the OWASP Agentic Security Initiative’s data exfiltration risk category.
- Write Control: requires write operations to be explicitly versioned or approved before execution.
Policy-as-Code is the pattern engineers familiar with infrastructure-as-code will recognise immediately. Governance rules defined in code are version-controllable, auditable, and reviewable in a pull request.
For more on technical infrastructure for detecting unauthorised tool use, see the companion guide on detecting shadow AI and creating sanctioned pathways.
When Should AI Governance Be Automated and When Does It Require Human Review?
Sandboxing handles the preventive layer — but some decisions need a human in the loop, not just a policy check. The design question is which actions get which treatment.
The scale argument against defaulting to human approval everywhere is now empirical. McKinsey operates 25,000 AI agents for 45,000 employees. NVIDIA’s Jensen Huang projects roughly 100 AI agents per employee. At that density, human approval at every decision point is operationally impossible.
Here is how to think about the design decision.
Human-in-the-Loop (HITL) pauses agent execution for human approval. Use it for irreversible, high-liability, or external-facing actions — financial transactions, external communications on behalf of the organisation, sensitive customer data access.
Human-on-the-Loop (HOTL) lets the agent continue while humans monitor via dashboards and receive drift alerts. Use it for reversible, lower-risk actions within a sandboxed scope — file operations, internal data retrieval, draft generation with downstream human review.
HOTL is the scalable supervision model. Humans monitor running agents, receive ASI-driven alerts, and intervene when signals warrant — governance at scale without blocking throughput.
Implementation options for HITL checkpoints: LangGraph (built-in HITL support), CrewAI, HumanLayer (purpose-built approval workflows), Permit.io (fine-grained access control).
Stop Authority — the principle that specific people have explicit authority to halt AI systems — needs technical infrastructure to back it up. That infrastructure is a HITL checkpoint backed by observability dashboards. For more on implementing stop authority through observability infrastructure, see the guide on assigning accountability for enterprise AI.
What Does the AI Observability Tools Market Look Like in 2026?
The market has two tiers: open-source and free-tier tooling for mid-market companies without dedicated MLOps teams, and enterprise commercial platforms for organisations with scale or compliance requirements.
Open-source entry point — Arize Phoenix: zero cost, no budget approval required. It integrates with NVIDIA NeMo Agent Toolkit, LangGraph, and CrewAI. DeepLearning.ai‘s course uses Phoenix as the standard observability layer — that is an enterprise credibility signal worth noting. A 50-person company can implement foundational observability — tracing, baseline, alerting — without specialised staffing.
Commercial platform tier:
- Arize AX: enterprise extension of Phoenix. Production-scale evaluation, alerting, and compliance reporting.
- Fiddler AI: lifecycle focus. Single unified workflow from development to millions of production interactions. Trust Models run within your environment.
- DataRobot: unified AI development and governance platform with continuous drift monitoring built in.
- Braintrust: evaluation-first, 25+ built-in scorers.
One thing worth being clear on: general APM tools cannot do this job. Datadog, New Relic, and Dynatrace monitor AI infrastructure at the infrastructure layer — latency, throughput, error rates. They cannot trace agent reasoning chains or detect behavioral drift. That category distinction matters when you are doing procurement.
If you are a team without a dedicated MLOps function, here is your evaluation checklist:
- Agent-level tracing (reasoning chains, tool call sequences)
- OpenTelemetry support — portability and auditability
- Evaluation pipeline support — automated assessment, not just logging
- Behavioral baseline and drift detection
- HITL annotation support
- OWASP Agentic Security alignment
As agent density increases, observability infrastructure becomes the de facto governance layer. The EU AI Act‘s requirements for high-risk AI systems — documentation, logging, human oversight — are directly addressed by execution traces, continuous monitoring, and HITL checkpoints. The organisations investing in this now are building the foundation that compliance requirements will eventually mandate for everyone. For a complete view of the broader AI governance gap observability helps close — from shadow AI through to measurement — see the series overview.
For a guide to the governance metrics that observability infrastructure enables, see the companion article on measuring whether AI governance is working beyond usage counts.
FAQ
What is the difference between AI observability and traditional application monitoring?
APM traces infrastructure metrics: latency, error rates, resource utilisation. AI observability traces what agents actually do — reasoning chains, tool selections, inter-agent interactions, and policy conformance. APM monitors performance; AI observability monitors behaviour.
What causes behavioral drift in AI agents?
Behavioral drift occurs when an agent’s decision patterns deviate from baseline without any code change. Primary causes: context accumulation (conversation histories shift response patterns), model version changes (upgrading the model can alter behaviour even when configuration is unchanged), and multi-agent coordination degradation.
What is continuous AI governance and how is it different from a static AI policy?
A static AI policy describes intended behaviour; it has no mechanism for enforcing it during execution. Continuous AI governance runs monitoring, evaluation, drift detection, and enforcement infrastructure permanently in production. Governance as ongoing infrastructure, not a compliance checkbox.
How does observability-driven sandboxing prevent unauthorised agent actions?
Observability-driven sandboxing intercepts agent tool calls before execution and enforces an allow/deny decision before any side effect occurs. The enforcement decision is emitted as an OpenTelemetry trace span — making it auditable. Denied actions generate a policy-violation signal rather than a silent failure.
When should I use human-in-the-loop versus human-on-the-loop governance?
HITL pauses execution for human approval — use it for irreversible, high-liability, or external-facing actions. HOTL allows agents to continue while humans monitor and can intervene — use it for reversible, lower-risk actions within a sandboxed scope. Most production architectures use HOTL as the default and HITL at specifically designated high-risk checkpoints.
What is the Agent Stability Index and how is it used?
The ASI is a 12-dimensional composite metric tracking Response Consistency, Tool Usage Patterns, Inter-Agent Coordination, and Operational Efficiency over rolling 50-interaction windows. Drift is flagged when ASI drops below 0.75 for three consecutive windows.
Can small companies implement AI observability without a dedicated MLOps team?
Yes. Arize Phoenix is open-source and free. It integrates with LangGraph and CrewAI and provides tracing, evaluation pipelines, and drift visualisation out of the box — no specialised staffing required.
What does the EU AI Act require in terms of runtime AI governance?
For high-risk AI systems, the EU AI Act requires technical documentation, logging, and human oversight mechanisms. Runtime observability infrastructure addresses the logging and oversight requirements directly — execution traces provide the documentation artefacts, and HITL checkpoints implement the required human oversight.
What happens to behavioral drift when you update your AI model or switch providers?
Model version changes are a primary drift trigger. Agent behaviour can shift substantially even when code and configuration are unchanged — the underlying model’s response patterns have changed. Treat every model version change as a re-baselining event.
What is Policy-as-Code in the context of AI governance?
Policy-as-Code defines sandbox rules as executable code — making enforcement deterministic, version-controllable, and auditable. The pattern will be familiar to anyone who has worked with infrastructure-as-code. In AI governance it typically covers: Workspace Enforcement (file path access), Network Allowlisting (approved external hosts), and Write Controls (operations requiring approval).
What are the OWASP Agentic Security top risks and how does observability address them?
The OWASP Agentic Security top-10 includes prompt injection, unauthorised tool invocation, data exfiltration, and uncontrolled autonomy. Sandboxing addresses tool invocation and exfiltration by intercepting tool calls before execution. Execution tracing and evaluation pipelines cover prompt injection and autonomy violations.
How do I know if our AI governance programme is actually working?
Check whether your observability infrastructure is generating actionable data. Are you collecting execution traces? Do you have a behavioral baseline? Are drift alerts firing? Are policy violations intercepted by sandboxing rather than discovered post-incident? If the answer to any of these is “no,” the governance programme exists on paper but not in infrastructure.