Your AI agents pass every test in development. You deploy to production. Three weeks later, a subset of users is getting responses that are technically valid but factually wrong — and you have no idea when it started.
Traditional software testing tells you whether the code runs correctly. AI systems need something more: a way to assess whether the outputs are actually good. That is the gap evaluation loops are designed to close.
AI evaluation loops have two phases. First, pre-deployment testing against known-good reference cases — that is offline evaluation. Second, continuous quality scoring of real user interactions in production — that is online evaluation. Together they form a closed feedback loop that catches regressions before users encounter them and picks up novel failure modes after deployment.
The monday.com AI team showed what this looks like at production scale. By building an evals-driven development framework with LangSmith, they compressed their evaluation feedback loop from 162 seconds per iteration down to 18 seconds — an 8.7x improvement. This article is part of our broader guide to AI observability and guardrails, which covers the full platform architecture from data collection through to governance.
What is the difference between AI testing and AI evaluation?
Unit tests check whether your code works. AI evaluation checks whether your AI produces good outputs. These are complementary concerns, and teams that rely on unit tests alone for LLM applications are missing the output quality dimension entirely.
Traditional software testing was built for deterministic systems. Given input X, the function returns output Y. Pass or fail. That works because the relationship between inputs and outputs is fixed.
AI systems are probabilistic. The same prompt can produce different outputs across identical runs. Outputs vary based on context, phrasing, model temperature, and emergent behaviour that no static test case anticipates. AI testing confirms your API calls work, your response schemas validate, and your agent does not crash. That is necessary — but it is not sufficient. It does not tell you whether the responses are accurate, grounded in your knowledge base, or free from hallucinations.
AI evaluation fills that gap. It assesses output quality across correctness, groundedness, relevance, and safety. The evaluation infrastructure uses two types of graders: deterministic graders for binary, unambiguous checks like JSON validity, keyword presence, and format compliance, and LLM-as-a-judge scorers for subjective quality dimensions where “correct” is a spectrum.
The monday.com team’s Group Tech Lead, Gal Ben Arieh, put it plainly: “Many teams treat evaluation as a last-mile check, but we made it a Day 0 requirement.” That shift — from evaluation as a QA afterthought to a first-class engineering discipline — is what separates teams that catch quality problems early from teams that find out through user complaints.
Regression evaluations sit at the intersection of the two approaches. They use the evaluation scoring framework but serve the same function as regression tests in traditional software: detecting when a change degrades performance on cases that previously passed.
What is offline evaluation and why is it the safety net that catches regressions before production?
Offline evaluation is pre-deployment testing of your AI system against a curated dataset of reference input-output pairs. It runs before any change reaches production — on every pull request that touches a prompt template, model configuration, or agent logic.
The mechanism is straightforward. Run the AI system against the golden dataset. Score outputs using your defined graders — deterministic checks first, LLM judges for the quality dimensions. Compare scores against baseline thresholds. Block deployment if scores fall below the minimums.
Here is what offline evaluation catches that integration tests miss: the prompt regression. This is the failure mode where optimising one dimension of output quality silently degrades another. You improve the VPN resolution flow. The agent now handles VPN tickets better, but your prompt change inadvertently affected how it handles Access & Identity requests — and none of your integration tests cover that interaction. Offline evaluation runs the full golden dataset on every change and flags the degradation before it reaches users.
The monday.com team’s approach illustrates the practical starting point: two tiers — deterministic smoke checks covering runtime health, output shape, and basic tool sanity, and LLM-as-judge correctness scoring for the dimensions that matter to users. The smoke checks are cheap and fast. The LLM judges handle the quality assessment.
One important caveat: offline evaluation only tests against scenarios you have anticipated and captured in your dataset. It cannot detect failure modes that emerge from real-world input distributions. That is not a flaw — it is a design constraint that defines the boundary between the offline and online phases.
What is a golden dataset and how do you build one for AI evaluation?
A golden dataset is a curated collection of reference input-output pairs that represents either known-good outputs or the quality criteria your system should meet. The quality and coverage of your golden dataset directly determines how much protection your offline evaluation actually provides.
The most common blocker is overengineering the dataset question. Teams assume they need hundreds of carefully annotated examples before evaluation is meaningful. The monday.com experience refutes this directly. They started with approximately 30 real, sanitised resolved IT tickets covering the most common request categories. Their team’s assessment: “The challenge wasn’t designing a perfect coverage strategy — it was simply picking a practical starting point.”
Start with 20 to 50 representative input-output pairs. Seed the dataset from three sources:
- Real production cases — sanitised historical examples of requests the system handled well
- Known failure modes and edge cases — scenarios where the system has previously struggled
- Adversarial inputs that probe common AI failure modes, particularly hallucination triggers
Each entry needs three elements: the input, the expected output or quality criteria, and annotations indicating which quality dimensions to evaluate for that input. Not every entry needs to be scored on every dimension.
The golden dataset is a living engineering artefact, not a static test fixture. It grows over time through a specific feedback mechanism: when online evaluation surfaces a novel failure mode from production traffic, that case gets added to the golden dataset. This is what makes the evaluation architecture self-improving — each production failure discovered in the online phase becomes permanent regression coverage in the offline phase.
Version the dataset alongside your code. Check it into source control. Treat dataset changes with the same engineering rigour as prompt changes.
What is online evaluation and how does it detect quality degradation in live AI traffic?
Online evaluation runs continuously against live production traces. It applies the same quality scoring framework used in offline evaluation to real user interactions, in near-real time, at production volume. Where offline evaluation is a static snapshot of anticipated scenarios, online evaluation captures the actual distribution of inputs your users send — including scenarios no curated dataset ever anticipated.
This is the failure mode online evaluation is specifically designed to catch: gradual quality degradation. A prompt that performs well on your golden dataset may drift in production as user behaviour evolves, as the knowledge base it queries changes, or as edge cases accumulate that were never represented in your pre-deployment tests. Offline evaluation cannot detect this drift. Online evaluation catches it by scoring against live traffic continuously.
The mechanism connects directly to your observability infrastructure. Distributed tracing as the raw material for evaluation captures full execution traces — inputs, intermediate reasoning steps, tool calls, outputs, latency, and cost — and those traces become the input that online evaluation scores. MLflow‘s OpenTelemetry integration stores traces as Delta tables, creating analytics-ready data that downstream evaluation pipelines can process immediately.
Monday.com implemented online evaluation using LangSmith’s Multi-Turn Evaluator. Rather than scoring individual turns in isolation, it assesses the full conversation trajectory — measuring outcomes like user satisfaction, tone, and goal resolution across the entire session. This matters for agentic systems: an agent that reaches the right answer through an inefficient reasoning path may pass output-only scoring but fail trajectory evaluation.
The two-phase architecture is the core insight. Offline evaluation prevents known regressions from deploying. Online evaluation discovers unknown failure modes in production. Neither alone is sufficient.
How did the monday.com team achieve 8.7x faster evaluation feedback loops with LangSmith?
The monday.com AI team built their internal AI service workforce on a LangGraph-based ReAct agent architecture. Their evaluation feedback problem was concrete: sequential evaluation runs on their dataset of 20 sanitised IT tickets took 162 seconds per iteration. At that speed, developers faced a clear trade-off — thorough evaluation or fast iteration. Pick one.
The solution was parallelisation at two levels using LangSmith’s Vitest integration. They used Vitest’s pool:'forks' configuration to distribute workload across multiple CPU cores, and ls.describe.concurrent to overlap LLM evaluation latency within each test file. The results: sequential baseline at 162.35 seconds, concurrent-only at 39.30 seconds (4.1x faster), and parallel plus concurrent at 18.60 seconds — that is the 8.7x improvement.
The methodology they built around this infrastructure is evals-driven development (EDD). The analogy to test-driven development is intentional. In TDD, you write the test before the code and use test results to drive implementation decisions. In EDD, you write the evaluation before the prompt change and use evaluation scores to drive every prompt edit, model swap, and architecture decision.
Their scorer architecture combined off-the-shelf and custom components. For baseline quality, they used OpenEvals correctness scorers straight out of the box — which shows that the starting investment is lower than most teams assume. For multi-step agent quality, AgentEvals Trajectory LLM-as-judge evaluates the full sequence of agent actions, not just the final output.
The evaluations-as-code implementation is what made the infrastructure sustainable. Monday.com defined judges as structured TypeScript objects subject to the same version control and peer review standards as production code. Their yarn eval deploy CLI command runs in the CI/CD pipeline on every PR merge: syncing prompts, reconciling evaluation definitions, and pruning “zombie” evaluations no longer present in the codebase.
What is LLM-as-a-judge and how reliable is it for automated quality scoring?
At production volume, manually reviewing AI outputs is not feasible. LLM-as-a-judge resolves this by automating quality scoring: use a capable language model to assess the outputs of another model against defined quality criteria, without requiring human review of every interaction.
The mechanism is simple enough. The judge model receives the original user input, the AI system’s output, and a scoring rubric. It produces a quality score with reasoning — so engineers can understand not just that a response scored poorly, but why. Scoring can be binary, categorical, or continuous depending on what the evaluation criterion requires.
Start with built-in judges for rapid coverage — these are research-backed metrics for safety, correctness, and groundedness that require no configuration. Build custom LLM judges as domain-specific needs emerge. Create custom code-based scorers for deterministic business logic where binary checks are faster and more reliable than asking a language model to decide.
LLM judges have known biases you need to manage. Verbosity bias causes longer responses to score higher independent of quality. Position bias creates preferences for certain orderings. Self-preference bias means models score outputs from similar models more favourably. The way to manage this is calibration: periodically compare LLM judge scores against human reviewer scores on a shared sample to detect systematic drift. When you change the judge model or update the scoring rubric, calibrate before trusting the new configuration.
The practical guideline: treat LLM judge scores as quality signals, not ground truth. They are reliable enough to scale evaluation beyond what human review can cover, and their known biases are manageable. Use deterministic graders for everything they can handle — binary checks are cheaper, faster, and more reliable. Reserve LLM judges for the subjective quality dimensions where telling a good response from a mediocre one requires natural language understanding.
How do you integrate AI evaluation into a CI/CD pipeline using quality gates?
Quality gates transform evaluation from a periodic audit into a continuous engineering control. The principle is the same one you already apply to application code: automated thresholds that block deployment when quality falls below defined minimums. Extend it to AI quality dimensions and you get the same protection for output quality that failing unit tests provide for code correctness.
The implementation pattern is this: on every pull request that touches a prompt template, model configuration, or agent logic, the CI pipeline triggers the offline evaluation suite against the golden dataset. The infrastructure scores outputs, compares results against baseline thresholds in version-controlled configuration, and blocks the merge if scores regress. Engineers see exactly which cases degraded and by how much before any decision to override the gate.
LangSmith’s Vitest integration logs every CI run as a distinct experiment. Braintrust provides a native GitHub Action that gates releases on evaluation results. Both implement the same principle: evaluation results gate deployment, not just inform it.
The CI/CD synchronisation is where evaluations as code becomes operational. The monday.com yarn eval deploy Reconciliation Loop runs on every PR merge and ensures the production evaluation infrastructure always reflects the repository state. Without this synchronisation, evaluation configurations drift from the code they are supposed to evaluate — and that creates false confidence in stale quality signals.
The eval-to-guardrail connection is the final element. When evaluation consistently flags a quality dimension — elevated hallucination rates on a specific input category, policy violations on a particular request type — those findings should trigger updates to runtime guardrail policies. Evaluation measures where quality is failing; guardrails enforce the constraints that prevent those failures from reaching users. For a detailed treatment of how evaluations feed guardrail policy, see the article on the AI guardrails spectrum.
The maturity of your evaluation architecture is also a signal you should use when selecting an AI platform. For a structured approach to evaluation architecture maturity as a selection criterion, see the platform selection guide, which covers how evaluation capability compares across the major platform options.
FAQ
Can unit tests replace evaluation for LLM applications?
No. Unit tests verify deterministic code paths — given input X, expect output Y. LLM applications produce probabilistic outputs that vary across runs. Unit tests confirm your API calls work and schemas validate; evaluation assesses whether the outputs are actually good. You need both: unit tests for code correctness, evaluation for output quality.
How many examples do I need in a golden dataset to get started?
Start with 20 to 50 representative input-output pairs. The monday.com team started with approximately 30 real, sanitised resolved IT tickets and it was sufficient to catch meaningful regressions. Grow the dataset iteratively as online evaluation surfaces new failure modes. A small, well-curated dataset that runs automatically is worth more than a comprehensive dataset that never gets built.
What is LLM-as-a-judge and is it reliable?
LLM-as-a-judge uses a capable language model to score another model’s outputs against defined quality criteria. It is reliable enough to scale evaluation beyond human review capacity, but it has known biases — verbosity, position, and self-preference — that require periodic calibration against human scores to manage. Treat LLM judge scores as quality signals, not absolute ground truth.
What is evals-driven development and how does it differ from test-driven development?
Evals-driven development (EDD) is an engineering methodology where evaluation results — not intuition or manual spot-checks — drive all prompt changes and model updates. The analogy to test-driven development is intentional: write the evaluation before the change, use evaluation scores to drive implementation decisions, and treat a failing evaluation as a blocking signal. The difference is that evals assess probabilistic quality across distributions, not deterministic pass or fail on fixed inputs.
How often should I calibrate an LLM judge against human reviewers?
Calibrate whenever you change the judge model, update the scoring rubric, or observe unexpected score distributions. For most teams, a monthly calibration cycle on a sample of 50 to 100 scored outputs is a practical starting point. If you swap judge models or change scoring criteria significantly, calibrate immediately before trusting the new configuration.
What is the difference between offline evaluation and online monitoring for AI?
Offline evaluation tests against a curated golden dataset before deployment — it catches known regressions. Online monitoring scores live production traffic continuously — it discovers unknown failure modes and quality drift. Neither alone is sufficient. Offline evaluation prevents regressions from deploying; online monitoring detects problems that no pre-built dataset anticipated and feeds new failure patterns back into the golden dataset.
What does “evaluations as code” mean in practice?
Evaluations as code (EaC) means treating eval definitions, grader configurations, dataset references, and quality thresholds as version-controlled source code artefacts — checked into your repository, subject to pull request reviews, and executed automatically via CI/CD. The monday.com implementation defined judges as structured TypeScript objects with a CLI command that synchronises evaluation infrastructure with the repository on every PR merge. This prevents eval logic from becoming tribal knowledge.
How do quality gates work in a CI/CD pipeline for AI applications?
Quality gates are automated thresholds that block deployment if evaluation scores fall below defined minimums. On every pull request touching prompts or agent logic, the CI pipeline triggers the offline evaluation suite, scores outputs against the golden dataset, and compares results to baseline thresholds. If scores regress, the merge is blocked. The key requirement is that quality gate configuration lives in version-controlled code alongside eval definitions — configuration that lives outside the repository drifts out of sync with the system it governs.
Do I need a dedicated ML ops team to implement AI evaluation loops?
No. The monday.com case shows that a development team using off-the-shelf tools can implement a production-grade evaluation loop without dedicated ML ops staff. They used LangSmith’s Vitest integration, OpenEvals off-the-shelf correctness scorers, and AgentEvals trajectory evaluation — standard tooling that requires no specialised ML operations expertise. Start with a minimum viable setup: 20 to 50 golden dataset examples, one or two automated scorers, and a CI integration. Expand the coverage as the system matures.
How do evaluation findings connect to guardrail policies?
When evaluation consistently flags a quality dimension — elevated hallucination rates on a specific input category, policy violations on a request type, degraded safety scores — those findings should trigger updates to runtime guardrail policies. Evaluation measures where quality is failing; guardrails enforce the constraints that prevent those failures from reaching users. The evaluation pipeline analyses trends, identifies systemic patterns, and informs which constraints to tighten or adjust. For a detailed treatment of the eval-to-guardrail lifecycle connection, see the article on guardrails implementation and the AI observability and guardrails platform guide.