Business

SaaS

Technology

•

Feb 25, 2026

How to Build an AI Evaluation Programme Your Engineering Team Will Actually Use

Your AI feature worked perfectly in the demo. Three weeks after launch, customers are complaining. Your team spends a week trying to reproduce the failures, ships two changes that each fix one thing and break another, then pushes a hotfix that introduces a third problem.

Sound familiar?

Traditional QA gates were built for deterministic software. AI outputs aren’t deterministic, and the same testing discipline doesn’t apply. The broader problem of AI benchmark theater — where public benchmarks promise capability your AI can’t actually deliver for your users — is a real problem. It gets solved the same way software quality problems have always been solved: systematic measurement, automated gates, and continuous monitoring.

This article gives you a concrete progression path from Level 1 (manual testing, no ML ops) through to Level 5 (continuous optimisation in CI/CD), mapped to team size and resource constraints. Evaluation has become a core engineering competency, not a specialist function. Here’s how to build one.

What Is an AI Evaluation Programme and Why Is It Now a Core Engineering Discipline?

An AI evaluation programme is a structured, ongoing practice of testing and monitoring AI system outputs against defined quality criteria — embedded throughout the development and deployment lifecycle, not bolted on at the end. It runs continuously from model selection through to production monitoring.

Here’s the strategic shift: evaluation belongs to engineering, not to data science or QA.

Think about the arc DevOps followed. Infrastructure management moved from “ops handles deployment” to “engineers own the pipeline.” AI quality is following the same arc. Anthropic‘s engineering team puts it directly: evaluations are to AI what tests are to software — they catch regressions early and give engineers the confidence to move fast without breaking things.

The operating model is Evaluation-Driven Development (EDD). It mirrors TDD at the conceptual level — define what success looks like before you build, then iterate against those criteria. The key difference from TDD is that you’re measuring statistically rather than in binary terms. Every change — a prompt tweak, a RAG pipeline update, a model upgrade — can improve performance in one area while quietly degrading another. Without evals, you learn this from customer complaints.

The business case is simple: risk reduction. The regulatory obligations that elevate evaluation beyond engineering best practice are covered in our companion piece. This article focuses on the engineering programme that makes any of it achievable.

How Does AI Agent Evaluation Differ from Traditional Model Evaluation?

Model evaluation is fairly straightforward: a prompt, a response, grading logic. Is the response accurate? Relevant? Safe?

Agent evaluation is a different problem entirely.

An agent uses tools across many turns, modifies state, and adapts as it goes. Mistakes compound. A model evaluation asks “Is this summary accurate?” An agent evaluation asks “Did it search the correct database, extract the right fields, format them correctly, handle the missing record, and produce an accurate summary — and did the path it took create latent reliability risk?”

Three dimensions come into play with agents that model evaluation simply doesn’t require:

Trajectory analysis: Scoring the sequence of steps, not just the final output. An agent can produce a correct final answer via an incorrect path, creating reliability risk that only surfaces under load or edge conditions.

Tool-call scoring: Did the agent select the right tool? Call it with correct parameters? Handle errors gracefully?

Multi-step assessment: Intermediate errors compound. An error in step two of seven can cascade — pass/fail on the final output misses the fragility entirely.

Non-determinism changes how you measure everything. Binary pass/fail becomes pass rate. The pass^k metric (introduced in our article on AI reliability measurement) formalises this: a 75% per-trial success rate across three trials produces a (0.75)³ ≈ 42% probability that all three succeed. For customer-facing agents, that gap is exactly what your evaluation programme must quantify.

What Does an AI Evaluation Maturity Model Look Like and Where Does My Team Fit?

The Evaluation Maturity Model (attributed to Databricks) provides a five-level progression framework that maps evaluation capability to team size and resource constraints. Most teams without dedicated ML ops capacity sit at Level 1 or Level 2. Build the habit at the level your team can sustain, then grow from there.

Level 1 — Manual Testing: Engineers manually run representative tasks and inspect outputs. No automation required. Start with 20-50 test cases — both Anthropic and Confident AI converge on this range. Record results in a spreadsheet and establish a baseline pass rate. You have no scripted tests. Quality assessment happens through team intuition and spot-checking before releases.

Level 2 — Scripted Test Suite: A repeatable set of test cases with expected outputs, run on demand. Tooling: DeepEval or Promptfoo — both designed for software engineers, not ML specialists. This is where regression evals begin. You have tests but they require manual initiation and result review. Regressions are sometimes caught before deployment, sometimes not.

Level 3 — Automated LLM-as-a-Judge Pipeline: Evaluation runs automatically on every significant change. Tooling: LangSmith for trace capture, LLM-as-a-judge for automated scoring. You have a scripted test suite but manual review is becoming a bottleneck. Prompt changes take a day to evaluate properly.

Level 4 — Continuous Monitoring: Production traffic is sampled and scored automatically. Alert thresholds trigger investigation when quality degrades. Tooling: MLflow for experiment tracking, custom alerting. You have offline evaluation but no visibility into production quality. User complaints are your primary signal for production failures.

Level 5 — CI/CD Integration and Deployment Gates: Evaluation is a mandatory gate in the deployment pipeline. No AI version ships without passing evaluation thresholds. Tooling: GitHub Actions, MLflow, evaluation deployment gates. This is evaluation-driven AI development at its most mature. Evaluation is automated but siloed from deployment. Someone occasionally checks results manually before shipping.

Level 1 to Level 2 can happen in a sprint. Level 2 to Level 3 requires tooling investment. Level 3 to Level 4 requires monitoring infrastructure. Level 4 to Level 5 requires organisational commitment to make evaluation a deployment blocker.

At the Level 3 to Level 4 transition, tool selection becomes the primary constraint. See our companion guide on the tools that implement each level of the evaluation maturity model.

How Do I Set Up Offline Evaluation Before Deploying an AI System?

Offline evaluation is the pre-deployment phase: testing against a curated dataset before any version reaches production users. Microsoft Azure AI Foundry’s three-stage evaluation lifecycle provides the structural frame. Stages one and two are your offline work.

Stage 1 — Base Model Selection: Compare candidate models against your specific use case using representative tasks. Don’t rely on public benchmarks — they measure general capability, not your workload.

Stage 2 — Pre-Production Evaluation: Run the full test suite against every prompt change, model update, or code change before deployment. This replaces intuition with structured measurement.

Three grading methods, matched to output type:

Code-based grading: String matching, regex, JSON schema validation. Fast, cheap, and objective. Use this for any structured output.

LLM-as-a-judge: A separate LLM scores outputs against defined criteria — relevance, coherence, helpfulness. Scalable for natural language quality. Requires calibration against human judgement.

Human evaluation: The gold standard for subjective quality and the calibration mechanism for LLM-as-a-judge. Expensive and slow — reserve it for calibration, not primary evaluation at scale.

The decision rule is simple: structured output gets code-based grading; natural language gets LLM-as-a-judge; high stakes or calibration work gets human evaluation.

How Do I Implement Continuous Monitoring After Deploying an AI Application?

Continuous monitoring is Stage 3 of the Microsoft Azure AI Foundry lifecycle: sampling and scoring live production outputs to detect quality drift after you’ve shipped.

Teams that invest in offline evaluation and stop there have no visibility into production quality degradation. Real-world inputs are more diverse and adversarial than any test dataset you’ll build.

Here’s what you actually need to do:

Sample production traffic: Start with 5-10% of live requests. That’s sufficient for statistical signal without the cost of scoring every interaction.

Apply consistent scoring criteria: Use the same LLM-as-a-judge rubric from your offline evaluation suite. Consistency between offline and online scoring is what makes comparisons meaningful.

Set alert thresholds: When quality scores drop below defined thresholds, trigger automatic alerts. Without alerts, monitoring data just sits unread.

Close the feedback loop: Production failures are the most valuable additions to your offline test dataset. Each failure pattern becomes a new test case. This is how evaluation becomes a continuous improvement practice rather than a one-time quality gate.

Why Does Workload Modelling Determine Everything in AI Evaluation?

Workload modelling is constructing a test dataset that reflects the actual distribution of tasks your AI encounters in production. It’s the non-negotiable prerequisite for meaningful evaluation.

An evaluation suite built from “happy path” scenarios will pass at high rates against an AI that fails routinely in production. Consider a customer support AI tested only on politely phrased, single-issue queries. In production, the inputs are angry, multi-part, poorly punctuated, and referencing account history the AI can’t access. Your high evaluation pass rate has measured nothing useful.

Confident AI warns explicitly against building your baseline on synthetic data — you end up optimising for passing tests that have no correlation to actual outcomes.

If you have production logs, extract real user inputs and weight your test distribution to match actual usage frequency. If you have no production data yet, interview the team about expected use cases and include adversarial and edge-case inputs deliberately — these are where AI systems fail, and intuition consistently underestimates the risk here.

Start with 20-50 test cases. Don’t scale beyond 100 until your metrics demonstrate correlation to real-world outcomes. As usage patterns change, the test dataset must evolve with them.

What Does Error Analysis and Trace Review Actually Look Like in Practice?

Error analysis and trace review is the structured human review of AI execution traces — the full sequence of tool calls, intermediate reasoning steps, and outputs that make up an agent’s execution path.

Automated metrics tell you your pass rate dropped from 87% to 79%. Trace review tells you why.

A trace review session in practice: 2-3 engineers, weekly, 30 minutes. Review 10-15 failed or low-scoring outputs. Walk through each trace step by step — what tool was called, what parameters were used, where the reasoning diverged. Produce a categorisation of failure patterns. Each recurring pattern becomes a new test case.

LangSmith and MLflow both provide trace capture and visualisation that make this practical. And it’s how teams at Level 2 identify what to automate at Level 3 — the patterns you discover manually become the rubric criteria for LLM-as-a-judge automation.

How Do I Integrate AI Evaluation Into an Existing CI/CD Pipeline?

Evaluation gates are a direct extension of existing code quality gates. The principle is identical: tests must pass before code ships. Your team already knows what “all tests green before deploy” means — evaluation deployment gates are the AI equivalent.

Here’s the implementation for GitHub Actions:

Fast smoke tests on every PR: Run deterministic, code-based graders only. These catch obvious regressions quickly without LLM API costs.

Full evaluation suite on merge to main: Run the complete test suite including LLM-as-a-judge scoring. It takes longer but runs at the right point in the pipeline.

Define pass/fail thresholds: 85% minimum pass rate on regression evals, 70% minimum on capability evals is a reasonable starting point. Start conservative and adjust based on experience.

Block deployment on threshold failure: An AI version that doesn’t meet the quality bar doesn’t ship — the same decision you make when unit tests fail.

One practical constraint worth flagging: LLM-as-a-judge evaluation runs cost money and take time. Optimise by running deterministic checks first, with LLM-as-a-judge reserved for qualifying changes. Run multiple trials per test case and aggregate results statistically — a single evaluation pass is not sufficient for non-deterministic outputs.

What Does a Minimum Viable AI Evaluation Programme Look Like for a Small Engineering Team?

This is the practical starting point for a 3-5 person team with no ML ops capacity.

Week 1 — Level 1: Manual Testing

Identify the three most common task categories your AI handles. Create 20-50 test cases from real examples — check the support queue, check the bug tracker. Record inputs and expected outputs in a spreadsheet. Run them against the current AI version and establish your baseline pass rate. Even a rough baseline is more useful than no baseline.

Weeks 2-3 — Level 2: Scripted Test Suite

Install DeepEval (open source, implements evaluation metrics in five lines of code) or Promptfoo (YAML-configured, Anthropic-endorsed for agent evaluation). Convert your spreadsheet test cases into scripted evaluations. Add code-based graders for any structured outputs. Run the suite on every prompt or model change — same status as running unit tests before committing.

Ongoing

Weekly 30-minute trace review: 2-3 engineers, 10-15 failed or low-scoring outputs, categorical analysis. Add 5-10 new test cases per week. When the suite reaches 100+ cases and review becomes a bottleneck, you’re ready for Level 3. For tool selection at that point, see our guide on the tools that implement each level of the evaluation maturity model.

This is why evaluation has become a core engineering competency that teams of any size can build. The investment is a one-time setup of 2-3 days and a weekly 30-minute commitment. The alternative is debugging production incidents that evaluation would have caught.

Frequently Asked Questions

How many test cases do I need in my evaluation dataset to start?

Start with 20-50 test cases covering your three most common task categories — that’s the recommendation from both Anthropic and Confident AI. The goal is to establish a baseline, not achieve exhaustive coverage. Add 5-10 cases per week based on production feedback and trace review findings.

What is the difference between LLM-as-a-judge and deterministic code-based scoring?

Code-based scoring uses programmatic checks — regex, JSON schema validation, exact string matching. Fast, cheap, and deterministic, but limited to structured outputs. LLM-as-a-judge uses a separate LLM to assess subjective quality: relevance, coherence, helpfulness. It scales better than human review but requires calibration. Use code-based grading where output structure is predictable; LLM-as-a-judge for natural language quality.

How often should I run AI evaluations in CI/CD?

Deterministic smoke tests on every pull request touching AI-related code. Full evaluation suite on every merge to main. Complete regression suite nightly or weekly depending on your deployment frequency. Always run evaluations multiple times and aggregate results — single-pass evaluation is not sufficient for non-deterministic outputs.

What does a trace review session actually look like?

Weekly, 30 minutes, 2-3 engineers. Review 10-15 failed or low-scoring outputs. Walk through each trace step by step: what tool was called, what parameters were used, where the reasoning diverged. Categorise failure patterns and convert recurring ones into new test cases.

Can I build an AI evaluation programme without an ML ops team?

Yes. Levels 1 and 2 require no ML ops capacity — they use standard engineering tools. ML ops investment becomes relevant at Level 4-5 when production monitoring infrastructure needs to be built and maintained.

What is the difference between offline evaluation and online production monitoring?

Offline evaluation tests against a curated dataset before deployment — it catches known failure modes. Online monitoring scores a sample of live traffic after deployment — it catches unknown failure modes and quality drift from real-world inputs. Both are mandatory. Teams that stop at offline evaluation have no visibility into production quality degradation.

How do I handle non-deterministic AI outputs in evaluation?

Use pass rates across multiple runs rather than single pass/fail assertions. If your AI produces the correct output 8 out of 10 times, your pass rate is 80%. Set minimum pass rate thresholds for deployment gates. The pass^k metric provides a formal framework for reliability measurement across multiple trials.

What should I evaluate first when starting from scratch?

Start with correctness on your most common task type. One metric, measured consistently, is more valuable than five metrics measured sporadically. Once correctness is baselined, add relevance for retrieval-augmented tasks, then safety if your AI interacts directly with end users.

How do I convince my engineering team that AI evaluation is worth the overhead?

Frame it as risk reduction. Show a production failure that evaluation would have caught. Calculate what a single AI-generated error reaching customers costs. Then present the minimum viable programme: 20-50 test cases and a weekly 30-minute review session is not a significant time investment against the cost of debugging production incidents.

What is evaluation-driven development and how is it different from TDD?

EDD applies the TDD principle of “write the test first, then build to pass it” to AI systems. TDD uses binary pass/fail against deterministic code. EDD uses statistical pass rates against non-deterministic outputs, with requirements that evolve as usage patterns change. AI failure modes emerge from production usage rather than being predictable upfront.

How do I know when my team is ready to move from Level 2 to Level 3?

When manual review of evaluation results becomes a bottleneck — typically when your test suite exceeds 100 cases or you’re changing prompts or models more than twice a week. The signal is that you’re spending more time reviewing results than improving the AI. At that point, LLM-as-a-judge automation pays for itself immediately.

What does evaluation-driven development look like for AI agents versus simple LLM calls?

For simple LLM calls, evaluation checks input-output pairs. For agents, evaluation must also check the trajectory — did the agent select the right tools, call them in the right order, handle errors at each step? Agent evaluation requires trace capture tooling and multi-step scoring that simple prompt testing doesn’t. This is why LangSmith and similar tools become necessary at Level 3 and above.