Business

SaaS

Technology

•

Mar 2, 2026

What an AI Bill of Materials Is and What to Demand From Vendors

AI models arrive wearing permissive licence labels — MIT, Apache 2.0, “open weights” — that look clean in a repository but hide downstream restrictions on training data, acceptable use, and weight access. A recent analysis of LLMware supply chains found 52% exhibit at least one licence conflict, and 35.4% of AI artefacts have no licence declaration at all. The pattern has a name: permissive-washing. And the traditional Software Bill of Materials was never built to catch it.

An AI Bill of Materials (AI-BOM) closes that gap. It is a structured, machine-readable inventory of every AI artefact in a system — models, datasets, weights, inference libraries — along with provenance, licensing terms, access restrictions, and regulatory compliance metadata. This article defines what an AI-BOM must contain, explains why permissive labels are not enough, and gives you a vendor procurement checklist you can take into your next model approval conversation. For the bigger picture, see the broader licensing risk landscape and how AI licence risk compounds across your supply chain.

What is an AI Bill of Materials and how does it differ from a standard SBOM?

Most teams know what an SBOM is: a record of code dependencies — libraries, packages, versions, licences. It answers the question: what code is in this product?

An AI-BOM answers a harder question: what went into making this AI artefact, under what terms, and what restrictions does that impose?

The difference matters because a model is trained on data from thousands of sources — each potentially under different terms — through a pipeline that may have involved a third-party base model, fine-tuning steps that create new obligations, and weight access policies that have nothing to do with the headline licence. None of those dimensions appear in a standard SBOM schema.

The naming varies — AI-BOM, AIBOM, ML-BOM, AI SBOM — and you will encounter all of them in vendor documentation. The best analogy: if an SBOM is the ingredients list on a packaged food product, an AI-BOM is the full supply chain audit trail — where each ingredient was grown, who processed it, and what the grower’s contractual terms allow the manufacturer to do with the final product.

Why do permissive AI licence labels not always mean what they say?

A model repository labelled “MIT” or “Apache 2.0” may carry acceptable use restrictions, training data licence conflicts, or weight access limitations that directly contradict that label. The licence on a model card reflects only what the publisher chose to show. This is permissive-washing: the gap between what the label signals and what your downstream rights actually are.

GitHub and Hugging Face are publisher-controlled metadata environments. Anyone who publishes a model controls what appears in the licence field — there is no independent verification. A study of 760,460 models on Hugging Face found one Google dataset appearing under six distinct licence designations in metadata, while the dataset card read “More Information Needed” in the licence field. In the LLMware analysis, one model categorised as “Other” was actually under the Llama3 licence — even the licence category itself can be wrong.

The practical risk is that you discover after deployment that training data includes copyleft-licensed material, the acceptable use policy prohibits your use case, or redistribution rights restrict how you can package your product. These things typically surface during M&A due diligence, a regulatory audit, or a customer contractual review. Not great timing.

Heather Meeker, an open-source licensing attorney and FOSSA adviser, puts it plainly: “Not all public code is open source. There’s a lot of public code on GitHub covered by other licenses that might very well specifically prohibit AI training, grant other limited licenses, or grant no rights at all.”

Checking the label and calling it done is not enough. A verified AI-BOM is the only defensible path.

What five things must an AI-BOM capture that a standard SBOM misses?

An AI-BOM has to document five categories that sit entirely outside a traditional SBOM’s schema. Each one maps to a concrete legal and operational risk.

1. Training dataset identity and licence terms — Which datasets were used, under what terms, and do those terms permit commercial use and redistribution of model outputs? If training data includes copyleft-licensed material, the permissive headline licence does not override those upstream obligations.

2. Model lineage — The documented chain from base model through fine-tuning and adaptation, with licensing terms at each stage. A restrictive licence at any point flows downstream. Only 15.4% of models on Hugging Face declare any base model relationship — without lineage, you cannot confirm the model you are deploying is legally unencumbered.

3. Weight access policy — Whether weights are openly available, gated, or proprietary — and what the access terms actually permit. A model can be “MIT licensed” in its code while the weights are gated behind terms that prohibit commercial redistribution. These need to be documented separately.

4. Acceptable use restrictions — Contractual constraints that override a permissive headline licence — prohibitions on military use, surveillance, or medical diagnosis. Often buried in terms of service documents that are entirely separate from the repository licence file.

5. EU AI Act compliance metadata — EU AI Act Article 53(1d) requires providers of general-purpose AI (GPAI) models to produce a Training Data Summary documenting datasets, copyright compliance, and data governance. Mandatory since August 2, 2025, for GPAI providers placing models on the EU market — regardless of where the provider is established.

Item 5 is a regulatory obligation, not a best practice. If you use GPAI models and deploy into EU markets, it is not optional. For what the EU AI Act requires from vendors, see EU AI Act and Cyber Resilience Act supply chain obligations explained.

What should you demand from a vendor before approving their AI model?

Treat this like a security questionnaire. You are not being difficult — you are doing your job.

AI Model Procurement Checklist

Complete AI-BOM in SPDX 3.0 or CycloneDX format — An actual file, not a summary. A proprietary format with no standard export is a red flag.
Training dataset licence documentation with dataset-level detail — Specific datasets, not broad categories. “We used publicly available internet data” is not adequate.
Model lineage showing base model and all fine-tuning steps — The chain with licence terms at each stage. “Proprietary process” should trigger legal review.
Weight access terms and redistribution rights — Specific, written, matched to your intended use case.
Acceptable use policy with explicit commercial use confirmation — Ask the vendor to confirm in writing that your specific use case is permitted.
EU AI Act Article 53(1d) Training Data Summary (if applicable) — Mandatory since August 2025 for GPAI providers on the EU market. Inability to provide this is a material procurement risk.
Evidence of independent licence audit or SCA scan results — A tool name, scan scope, and date. Self-attestation without tooling evidence is insufficient.

Red flags that should trigger escalation to legal review:

Vendor cannot provide the licence text for their model
Vendor cannot identify training datasets beyond generic descriptions
Vendor claims “MIT licensed” without supporting documentation
Vendor has no model lineage record
Vendor declines to confirm your specific use case in writing

The EU AI Act makes these demands legitimate, not adversarial. A vendor of a general-purpose AI model has regulatory obligations to provide Training Data Summary documentation. Frame it that way in the conversation and it stops feeling like an unusual request.

SPDX 3.0 AI Profile vs. CycloneDX ML-BOM: which format should you require?

Two leading standards exist. And the good news is it is not either-or — they serve different contexts, and you will probably use both.

SPDX 3.0 AI Profile has ISO/IEC 5962 lineage that gives it regulatory weight in procurement contexts and aligns with the NIST AI Risk Management Framework. Use it when legal defensibility is the priority.

CycloneDX ML-BOM comes from OWASP and is designed for CI/CD automation. The OWASP AIBOM Generator outputs CycloneDX format directly from Hugging Face model metadata, making it the practical open-source generation path.

Here is the approach that makes sense: require SPDX 3.0 AI Profile from external vendors for its regulatory weight, and use CycloneDX ML-BOM for internal generation in CI/CD pipelines where tooling availability matters more. Protobom and BomCTL translate between the two formats, so mandating SPDX from vendors does not prevent you from working with CycloneDX internally. For implementation detail, see adding AI licence compliance to your existing engineering workflow.

How do you generate an AI-BOM for models your team is already using?

Step zero is shadow AI discovery: finding every model in use, including the ones engineering teams adopted without procurement approval. Sonatype and Wiz both offer shadow AI tracking that surfaces unapproved model usage. Survey your teams, check cloud service logs, and establish an internal model registry for every model entering production. You cannot generate AI-BOMs for models you have not catalogued.

Once you have your registry, there are two tooling paths:

Enterprise path — FOSSA provides software composition analysis with snippet scanning for AI-generated code, integrating into CI/CD pipelines alongside existing dependency scanning. Sonatype is an AI governance platform with AI-BOM generation, shadow AI tracking, and procurement governance.

Open-source path — OWASP AIBOM Generator produces CycloneDX-format AI-BOMs from Hugging Face model metadata with field completeness scoring. GUAC (OpenSSF) aggregates SBOM and AI-BOM data across an organisation. BomCTL / Protobom provides CLI tooling for CI/CD integration.

If your team is already using Hugging Face models, start with the OWASP AIBOM Generator. It is the fastest path to a CycloneDX AI-BOM with completeness scoring that shows you exactly what fields are missing. For broader context on why shadow AI and licence ambiguity are converging into a single governance problem, see the open AI supply-chain licensing risk overview.

What is snippet scanning and why does it matter for AI-generated code compliance?

Snippet scanning detects code fragments that match known licensed material. Applied to AI-generated code, it identifies licence obligations introduced by AI coding assistants — GitHub Copilot, Cursor — that are invisible to standard dependency scanning because they show up as copied text, not declared dependencies.

AI-BOM covers pre-procurement risk. Snippet scanning covers post-deployment risk. Heather Meeker again: “There are two legally independent sources of IP risk with AI coding tools: model training (input risk) and model output (output risk).” You need to address both.

The 2026 Black Duck OSSRA report found that 68% of audited codebases contain open source licence conflicts, partly driven by AI assistants generating code derived from copyleft sources. And 76% of companies that explicitly prohibit AI coding tools acknowledge their developers are using them anyway. A policy is not a solution.

FOSSA Snippet Scanning integrates into pull request workflows and CI/CD pipelines, surfaces results alongside dependency results, and lets you apply a consistent licence policy to both.

How does AI-BOM generation fit into your existing DevSecOps pipeline?

AI-BOM generation slots into existing CI/CD pipeline gates as an additional verification step — not a new workflow built from scratch. The goal is a standard gate, not a manual one-off audit, so compliance scales as model adoption grows.

Three integration points cover the AI artefact lifecycle:

1. Model intake gate — Before any model enters your internal registry, verify its AI-BOM. Vendor-provided SPDX documentation is evaluated against your procurement checklist. Models without adequate provenance are blocked here.

2. Build gate — Generate or refresh the AI-BOM during CI. The OWASP AIBOM Generator can run as a pipeline step to produce CycloneDX output automatically. FOSSA integrates as a pipeline plugin and runs snippet scanning at the pull request and CI stage.

3. Deployment gate — Before production, confirm AI-BOM compliance metadata meets your policy requirements and link with compliance reporting so outdated models are flagged automatically.

The EU AI Act requires this gate for GPAI model deployments into EU markets — Article 53(1d) Training Data Summary requirements need documentation that tracks with model versions, and automated CI/CD generation is the only scalable path. The EU Cyber Resilience Act adds to the pressure: its SBOM mandate for software products with digital elements extends to AI-BOM documentation for AI-enabled software.

For the full implementation path, see adding AI licence compliance to your existing engineering workflow. For the regulatory obligations behind these gates, see EU AI Act and Cyber Resilience Act supply chain obligations explained.

Frequently Asked Questions

What is the difference between an AI-BOM and a traditional SBOM?

An SBOM lists software components and dependencies. An AI-BOM extends this to cover training data provenance, model lineage, weight access policies, acceptable use restrictions, and regulatory compliance metadata. A standard SBOM was designed before AI supply chains existed and cannot capture the data and model context that governs how AI systems can legally be used.

Can a vendor’s “MIT licensed” AI model still create legal problems?

Yes. A permissive label reflects only what the publisher chose to display. It does not guarantee the training data was permissively licensed, that outputs are unencumbered, or that acceptable use terms permit your deployment. 52% of LLMware supply chains exhibit at least one licence conflict.

What tools exist for generating AI-BOMs automatically?

Enterprise: FOSSA (SCA with snippet scanning) and Sonatype (AI governance with shadow AI tracking). Open-source: the OWASP AIBOM Generator (CycloneDX output from Hugging Face metadata), GUAC (OpenSSF supply chain aggregation), and BomCTL/Protobom (CLI tooling for CI/CD integration).

Should I require SPDX or CycloneDX format from AI vendors?

Require SPDX 3.0 AI Profile from external vendors for regulatory weight. Use CycloneDX ML-BOM for internal generation in CI/CD pipelines. The two formats are interoperable via Protobom.

What is snippet scanning and do I need it?

Snippet scanning detects code fragments matching known licensed material — it identifies licence obligations introduced by AI coding tools that standard dependency scanning cannot detect. If your team uses AI coding assistants, you need it. FOSSA is the primary commercial tool.

What does the EU AI Act require regarding AI-BOM documentation?

EU AI Act Article 53(1d) requires providers of general-purpose AI models to produce a Training Data Summary documenting datasets, copyright compliance, and data governance. Mandatory since August 2, 2025, for GPAI providers placing models on the EU market. An AI-BOM operationalises this in machine-readable format.

How do I discover undocumented AI usage (shadow AI) in my organisation?

Survey teams and check cloud service logs. Sonatype and Wiz offer shadow AI tracking. Establish a model registry requirement first — you cannot generate AI-BOMs for models you have not catalogued.

What should I do if a vendor refuses to provide AI-BOM documentation?

Treat refusal as a material procurement risk. Escalate to legal review and consider alternative vendors. Under the EU AI Act, GPAI model vendors have regulatory obligations to provide Training Data Summary documentation — refusal may indicate they cannot meet their own requirements.

What is model lineage and why does it matter for procurement?

Model lineage documents the chain from base model through fine-tuning and adaptation. A restrictive licence at any point flows downstream to the final model. Only 15.4% of models on Hugging Face declare any base model relationship — without it, you cannot confirm the model you are deploying is legally unencumbered.

How does the EU Cyber Resilience Act interact with AI-BOM requirements?

The EU Cyber Resilience Act mandates SBOM documentation for software products with digital elements. For AI-enabled software, this creates convergence with AI-BOM requirements — organisations must produce both traditional SBOMs and AI-specific documentation. Both SPDX and CycloneDX are recognised by the OpenSSF as viable CRA-compliance formats.

Can I use open-source tools instead of enterprise platforms for AI-BOM generation?

Yes. The OWASP AIBOM Generator, GUAC, BomCTL, and Protobom provide a viable open-source path. The OWASP AIBOM Generator produces CycloneDX-format AI-BOMs from Hugging Face model metadata with field completeness scoring. More integration work than enterprise platforms, but a solid starting point for teams building from scratch.

Where to go next

An AI-BOM is one tool in a larger governance picture. Understanding what to demand from vendors gets you past the first gate — but the licence risk that makes that demand necessary runs through every layer of your AI supply chain. For the broader supply-chain licensing landscape — how permissive-washing, training data ambiguity, and regulatory pressure interact — see the complete overview.