AI vendors love to lead with benchmark scores. The problem is those scores are often next to meaningless. Documented research shows over 45% overlap in QA benchmark datasets between training data and test sets. GPT-4 can infer 57% of masked MMLU answers well above chance. Vendors pick metrics that make their models look best, report single-run results as if they were stable, and cite benchmarks that have been contaminated for years.
The regulatory environment is catching up. OMB M-26-04 requires US federal agencies to request evaluation artifacts from AI vendors by March 2026. The EU AI Act phases in mandatory performance disclosure for high-risk systems from August 2026. Australia’s 10 Guardrails framework treats evaluation documentation as a procurement checklist item. The regulatory landscape for AI evaluation is moving fast — and procurement requirements are where the compliance obligations become concrete.
But here’s the gap: nobody has published a concrete checklist of what to actually ask for. This article is that checklist. It’s the external procurement companion to building an internal benchmark governance framework. For the broader context, start with the AI benchmark governance guide.
What are evaluation artifacts and why should they be a procurement requirement?
Evaluation artifacts are the full set of documented evidence a vendor needs to produce to show that their claimed model performance is reproducible and independently verifiable. Think of them as the AI procurement equivalent of requesting audited financials before an acquisition. Not a nice-to-have. A basic due diligence standard.
They’re not the same as a model card. A model card tells you what the vendor says the model does. Evaluation artifacts let you verify whether that’s actually true. A vendor who hands you a model card but no artifact package has given you a summary without the underlying evidence.
The complete package has six components:
- Benchmark datasets — the held-out test data, with task diversity, domain coverage, and recency indicators
- Prompt sets — the exact prompts used during evaluation, since prompt phrasing materially affects output quality
- Scoring scripts — executable code that calculates benchmark scores, and the only way to independently reproduce published numbers
- Variance analyses — multi-run results with standard deviations showing score consistency across independent test runs
- Result logs — raw, unedited output logs to verify that published scores weren’t cherry-picked
- Eval factsheet — a structured questionnaire covering evaluation protocols, data sources, metrics, and reproducibility details
Regulatory deadlines are turning this into a compliance obligation across multiple regimes. Even if you’re outside a regulated industry, building the practice now reduces your compliance burden later.
What must a complete set of evaluation artifacts contain?
Each component blocks a specific form of evasion. Understanding why each one matters helps you evaluate partial submissions — because a vendor who provides benchmark datasets but no scoring scripts has not provided reproducibility.
Benchmark datasets must be the actual held-out test data, not just benchmark names. A vendor who names benchmarks but won’t provide the data can’t be assessed for contamination.
Prompt sets matter because inconsistent prompt templates can skew scores by double-digit percentages. If you can’t see the prompts, you can’t tell whether evaluation conditions match your deployment.
Scoring scripts are the executable code that produced the scores. Without them, you have a claim, not evidence.
Variance analyses exist because AI outputs are probabilistic. Single-run scores are unreliable. You need standard deviations across at least three independent runs to tell a genuinely high-performing model from one that got lucky.
Result logs verify that published scores weren’t cherry-picked from the best of multiple attempts.
Eval factsheets are an emerging standardisation format — a structured questionnaire covering who ran the evaluation, what was evaluated, what datasets were used, and how scoring works.
How do you request evaluation artifacts in practice?
Request them during pre-contract due diligence, not after signing. Artifacts are a procurement input, not a post-purchase audit.
Include specific clause language in your RFP. Generic requests for “evaluation documentation” aren’t enforceable. Name each component:
“As part of our technical evaluation, we require: benchmark datasets (with task diversity, domain coverage, and recency indicators), prompt sets (exact prompts used in evaluation), scoring scripts (executable code for reproducing scores), variance analyses (multi-run results with standard deviations), result logs (raw, unedited output logs), and an eval factsheet. All artifacts must be delivered in machine-readable, version-controlled format before contract execution.”
For API access procurement, request artifacts for the specific model version being licensed. For embedded AI features, request artifacts for the AI component specifically — the evaluation obligation applies regardless of delivery mechanism.
When vendors claim artifacts are proprietary, offer NDA terms. Legitimate vendors can provide artifacts under NDA. A vendor who declines even under NDA is telling you something important about their evaluation governance.
The request itself has value regardless of outcome. Connect the artifacts you receive to your internal benchmark governance workflow — they become inputs to your internal review process, not standalone documents.
How do you cross-reference vendor benchmark claims against community and independent sources?
Community evaluations are a cross-reference source, not a replacement for vendor-supplied artifacts. You need both. Here’s the five-step workflow:
Step 1: Record the specific claims. Note every benchmark cited and the exact scores reported — benchmark name, version, task subset, and model checkpoint.
Step 2: Locate the model on community platforms. Chatbot Arena (LMSYS) for conversational AI. HELM (Stanford) for multi-task capability. LiveBench for recency-controlled, contamination-resistant evaluation. Hugging Face Open LLM Leaderboard for open-source models.
Step 3: Compare vendor scores against community results. A vendor whose marketing-reported rank is 20 or more positions below community leaderboard placement warrants scrutiny. Consistent underperformance across multiple independent sources is a material concern.
Step 4: Check Artificial Analysis for independent cost-performance benchmarking. A model with prohibitive inference costs has a different risk profile than the capability benchmark alone suggests.
Step 5: Document the findings. Record platforms checked, scores found, and the delta between vendor claims and independent results. Every claim in your selection rationale needs to link to a specific source — not vendor marketing.
What are the red flags in vendor benchmark reporting?
Some of these are concerning. Some are disqualifying.
No scoring script disclosure. Without executable scoring scripts, reproducibility is impossible. Disqualifying.
Single-run results only. Single runs are unreliable for probabilistic models. Requires follow-up.
Cherry-picked task subsets. The vendor reports scores on tasks where the model performs well and quietly omits those where it doesn’t.
No benchmark dataset details. Benchmark names without the actual test data. Contamination risk can’t be assessed.
Stale benchmarks only. MMLU and HumanEval are contaminated and saturated. No results on LiveBench or equivalent dynamic evaluations is a problem.
Marketing-grade only. Infographics and summary statistics with no result logs, no methodology, no path to independent verification. That’s not evaluation evidence — that’s marketing collateral.
Refusal framed as IP protection. Legitimate vendors can provide artifacts under NDA. A vendor who won’t provide evidence even under NDA is indicating inadequate evaluation governance.
Demo-to-benchmark mismatch. AI demos are uniquely misleading. If the demo quality doesn’t match the benchmarks, dig into why.
How do you assess contamination risk in vendor-reported scores?
Data contamination happens when a model’s training data overlaps with the benchmark test data, producing inflated scores that don’t reflect real-world capability. Retrieval-based audits report over 45% overlap on QA benchmark datasets. With models trained on multi-trillion-token corpora, contamination is increasingly structural.
Ask vendors these four questions directly:
- What was the training data cutoff date relative to the benchmark dataset publication date?
- Were any benchmark datasets or derivatives included in training data?
- What contamination detection methods were applied?
- Can you provide results on contamination-resistant benchmarks like LiveBench?
A vendor who can’t answer the cutoff question is operating without evaluation governance.
A vendor who scores well on MMLU but poorly on LiveBench — where tasks refresh continuously — has a plausible contamination explanation for that gap. Ask whether they’ve participated in any proctored evaluations and request those results. PeerBench is the gold standard: secret test sets, proctored execution, continuous renewal.
How do you structure a traceable model selection decision?
A model selection decision document captures the full chain of evidence behind an AI procurement decision. Every claim in the selection rationale is linked to a specific artifact, cross-reference result, or red flag finding — not to vendor marketing.
Structure it in seven sections:
- Business requirements and use case definition — what the AI system needs to do, what performance matters, what constraints apply
- Vendor shortlist and evaluation criteria — which vendors were considered and what weighting was applied
- Evaluation artifact review findings per vendor — what was received, what was missing, what the review revealed
- Community eval cross-reference results per vendor — platforms checked, scores found, deltas noted
- Red flag assessment per vendor — which patterns appeared, whether concerning or disqualifying
- Contamination risk assessment per vendor — vendor responses, legacy vs dynamic benchmark comparison
- Final selection rationale with evidence references — the decision linked to findings in sections 3–6
Format it so non-technical stakeholders can read the rationale directly, with technical evidence in appendices. For teams without dedicated procurement staff, a structured template is fine — the goal is a clear evidence chain. Include a refresh clause so updated artifacts are required whenever the vendor releases a new model version. Align that with your AI benchmark governance review cycle.
A vendor evaluation artifacts checklist
Use this at procurement time. Each item is a binary verification.
Category 1: Artifact receipt — Benchmark datasets received? Prompt sets received? Scoring scripts received? (If not: disqualifying.) Variance analyses received? Result logs received? Eval factsheet received?
Category 2: Artifact completeness — Datasets include post-training-cutoff data? Scoring scripts are executable, not pseudocode? Variance analyses cover at least three runs? Result logs are raw and unedited?
Category 3: Cross-reference verification — Chatbot Arena checked. HELM checked. LiveBench checked — flag MMLU-vs-LiveBench gaps. Artificial Analysis checked. Hugging Face Open LLM Leaderboard checked where applicable.
Category 4: Red flag review — No scoring script disclosure (disqualifying). Single-run results only (request multi-run). Cherry-picked subsets (request full results). Stale benchmarks only (request dynamic equivalents). Marketing-grade only (request full package). Blanket IP refusal (offer NDA; if refused, document as material risk). Demo-benchmark mismatch (test on real use-case tasks).
Category 5: Decision documentation — Business requirements recorded. Artifact review findings recorded per vendor. Cross-reference results recorded per vendor. Red flag and contamination assessments recorded per vendor. Final rationale linked to evidence. Artifacts retained in governance system.
This checklist is the external procurement companion to the internal benchmark governance framework. Together they give you end-to-end governance coverage: vendor accountability on the outside, evaluation discipline on the inside. For the full landscape of how these practices fit into the emerging regulatory picture, the AI benchmark governance overview is the place to start.
Frequently asked questions
What if a vendor refuses to provide evaluation artifacts?
Request a written explanation, offer NDA terms explicitly, and escalate to vendor management. If they still refuse, document the refusal in the model selection decision record as a material risk factor. A vendor who can’t demonstrate that their model does what they claim shouldn’t pass procurement due diligence.
Does requiring evaluation artifacts apply to API access only, or also to embedded AI features?
Both. The evaluation obligation applies regardless of delivery mechanism — direct API, embedded in a SaaS product, or on-premise. For embedded AI features, request artifacts for the AI component specifically, even when AI isn’t the primary product.
How do I know if a vendor’s benchmark scores are contaminated?
Compare vendor scores on legacy benchmarks (MMLU, HumanEval) against scores on contamination-resistant benchmarks (LiveBench). A significant gap suggests training data overlap. Request multi-run results — contaminated models tend to show unusually low variance on legacy benchmarks because they’re recalling memorised answers.
What is the difference between a model card and an evaluation artifact package?
A model card is a disclosure document. An evaluation artifact package is an evidence package. One tells you what the vendor says the model does; the other lets you verify whether that’s true. Requiring a model card without evaluation artifacts is like requesting an annual report without the underlying financial statements.
Can I use community evaluations like Chatbot Arena instead of requiring vendor artifacts?
Community evaluations are a cross-reference tool, not a substitute for vendor-supplied artifacts. Vendor artifacts tell you how the vendor tested their own model and whether those results are reproducible. You need both.
What if my organisation does not have ML expertise to review evaluation artifacts?
The checklist above is designed for procurement teams without dedicated ML staff. You can verify artifact completeness, check cross-reference results, and identify red flags without ML expertise. For scoring script review, consider a third-party technical reviewer or an open-source evaluation framework such as promptfoo.
Are there regulatory penalties for not requiring evaluation artifacts?
For US federal agencies, OMB M-26-04 creates procurement compliance obligations with a March 2026 deadline. For EU-market organisations using high-risk AI, EU AI Act requirements phase in from August 2026. For private-sector organisations outside regulated industries, no direct penalty exists yet — but the regulatory trajectory makes artifact requirements a foreseeable standard.
What should I do with evaluation artifacts once I receive them?
Review for completeness against the checklist. Cross-reference vendor scores against community evaluations. Run scoring scripts against a sample of the benchmark dataset if you have the capability. Document findings in the model selection decision record. Retain artifacts as part of the procurement audit trail and include a refresh clause in the contract.