If you’ve ever managed software dependencies at scale, you know the feeling. A late-night Slack message. A transitive dependency buried four levels deep — it carries a vulnerability, and it’s already in production everywhere. AI supply chains carry the same risk. Quieter, less visible, and already inside more stacks than most engineering leaders realise.
Here’s the thing. A permissive licence label on a model repository does not mean the underlying training data carries those same rights. The MIT or Apache-2.0 tag is a metadata field. It’s not a legal grant. It describes how the model weights are distributed — not what rights were secured over the data used to train them.
A 2025 audit of 124,278 AI supply chains found that only 5.75% of applications preserved compliant licence notices from their upstream models. By the time an AI component reaches your production stack, roughly 94% of the required legal documentation has already vanished.
This article maps the three-tier AI supply chain — dataset, model, application — explains how licence risk compounds at each layer, and closes with a first-pass audit checklist for production model approval. For the full scope of open AI supply-chain licensing risk, see our pillar overview. For the foundational definition of what permissive-washing actually means, start with our ART001 explainer.
How Do AI Licences Work Across the Three Supply Chain Tiers?
The AI supply chain has three distinct layers — dataset, model, and application. Licence obligations are supposed to propagate through each one. In practice, they almost never do.
At the dataset layer, training data is collected under specific terms: Creative Commons variants, public domain dedications, commercial restrictions, or the legally ambiguous category of “publicly available” web content. Whatever restrictions exist at this layer define what downstream uses are legally permitted — regardless of any label applied later.
At the model layer, a model inherits the obligations of the most restrictive licence in its training corpus. A single restrictively licensed dataset in a training run taints the entire model’s legal posture. Documented or not.
At the application layer, any product integrating that model inherits both the model’s stated licence terms and every unresolved obligation from every upstream dataset. Each integration adds another layer of legal uncertainty.
If you work with npm or pip, you already understand how this works. A licence issue in a transitive dependency propagates through every package that depends on it. You have package-lock.json, requirements.txt, and automated scanners to track that dependency tree. In AI, there’s no equivalent. Training data licences are almost never bundled with the resulting model weights. The chain of custody breaks at the first training run.
Why Do Attribution Obligations Almost Never Make It Downstream?
They fall apart at scale. A 2025 audit (arXiv:2602.08816) of 124,278 dataset-to-model-to-application supply chains found:
- Only 27.59% of models preserved compliant licence notices from their upstream datasets. Three out of four models silently drop required attribution.
- Only 5.75% of applications preserved compliant licence notices from the models they integrated. About 94% of required legal notices disappear before production.
Multiply those failure rates and the probability of a downstream application retaining complete, legally compliant attribution from its original training data comes out at roughly 1.59%. Below 2%.
Why does this happen? Model training pipelines simply aren’t designed to carry licence metadata. Model cards are the primary documentation vehicle, but they’re manually authored and frequently incomplete — only 7.1% of Hugging Face models declare their training datasets at all. Application developers consuming a model rarely look beyond the top-level licence tag, because every other software component they work with is covered by automated tooling. That tooling doesn’t exist for AI training provenance.
Why Is “Publicly Available” Training Data Not the Same as “Openly Licensed” Training Data?
This is the misconception behind most AI licence risk at the dataset layer. “Publicly accessible on the internet” and “freely usable for any purpose including AI training” are legally distinct categories. A significant chunk of the current AI industry has been treating them as the same thing.
Copyright exists by default. A blog post, a news article, a StackOverflow answer — all copyrighted at the moment of creation, regardless of whether they require a login to access. Public visibility does not waive the creator’s rights. Many of the largest pre-training datasets are derived from Common Crawl and similar web scrapes — vast quantities of copyrighted material collected without explicit licence grants. A dataset may be published with an open licence applied to the collection, but that licence cannot grant rights the publisher did not hold.
If your vendor tells you their model was trained on “publicly available” data, that is not an answer to the provenance question. It’s a deflection of it. This is one of several compounding factors across the open AI supply-chain licensing risk landscape that make dataset-level scrutiny non-negotiable.
What Is Data Laundering and How Does It Enter Your AI Stack Invisibly?
Data laundering — sometimes called licence laundering — is how the “publicly available” problem gets into your supply chain even when the model’s documented training datasets appear to have clean licences.
Here’s how it works. A dataset aggregator collects text, images, or code from sources with various licence restrictions. Rather than verifying that every constituent source permits relicensing, the aggregator publishes the combined dataset under a single permissive licence at the collection level. Downstream model trainers consume it trusting the aggregator’s label. The original restricted content is now embedded in a corpus that looks clean.
EleutherAI‘s Common Pile project was created as a direct response to this problem — verify licensing at the level of individual constituent works, not at the collection level. The result is an 8TB corpus where every source is either public domain or Open Definition-compliant, and the Comma language model trained on it performs on par with or outperforms Llama 2 and Qwen3. Legally clean training is technically achievable. It’s currently just rare.
Data laundering is difficult to detect at the procurement layer. A model card may accurately identify its training datasets by name, and still be inheriting laundered content from those datasets with no visible signal. For the foundational definition of this problem, see our complete permissive-washing explainer.
What Is Shadow AI and Why Is It a Compliance Gap You Cannot See?
Shadow AI is the AI-era equivalent of shadow IT: AI tools and models your team is actively using that have never been through formal procurement review, compliance vetting, or IT governance. The licence exposure they create is real — and invisible to any audit.
A developer fine-tunes a Hugging Face model on company data without checking the base model’s licence. A team adopts an AI coding assistant without reviewing its training data. An engineer integrates an open-weights model labelled “MIT” without verifying that an actual licence file exists — which, given the statistics above, it probably does not. These aren’t hypothetical edge cases. They’re the norm when teams are iterating on models faster than any oversight process can track.
Without a policy requiring AI model review before deployment, you have no visibility into what licence obligations your team has already accepted on your behalf. For more on building that governance layer, see our guide to AI Bills of Materials.
What Does Clean Data Provenance Actually Look Like — and How Do You Ask for It?
Clean data provenance means a verifiable, documented record of where every component of a training dataset came from, what licence it was collected under, and that the licence actually permits the downstream use. Not a blanket collection-level declaration. A source-level verified audit trail.
The open weights versus open source distinction matters here. An open-weights model releases weights for download but may not disclose training data composition or provenance. An open-source AI model requires disclosure of training data, training code, and methodology in addition to weights. Most models on Hugging Face are open-weights. When a vendor cannot disclose training data provenance, the documentation may simply not exist.
The EU AI Act‘s Article 53(1)(d) gives enterprise buyers a lever: GPAI model providers placing models on the EU market must publish a sufficiently detailed summary of training data content. A vendor who cannot produce that documentation is telling you something about their provenance practices.
Before approving any model, ask:
- Does your model card identify all training datasets by name?
- Can you provide licence documentation for each named training dataset?
- Has the training data been independently audited for licence compliance?
- Do you accept contractual liability for training data licence defects?
- Can you provide an AI Bill of Materials for this model?
A First-Pass Audit Checklist Before Integrating an AI Model Into Production
This is the minimum-viable entry point — not a substitute for formal tooling or legal review, but the baseline every team should clear before any model goes near production.
Step 1: Read the model card beyond the licence tag. Does the model card identify training datasets by name? Only 9.9% of Hugging Face models declare any training dataset. This step will fail for roughly 90% of models — and that failure is a risk signal.
Step 2: Verify the licence file exists and matches the tag. Is there an actual LICENCE file in the repository, or just a metadata field in the README? 93.4% of models lack a dedicated licence file. Absence is the norm — but it is still a compliance gap.
Step 3: Check upstream dataset licences. For each training dataset named in the model card, check its licence independently. If the model card does not name datasets, you cannot perform this check — which should prompt escalation.
Step 4: Look for attribution notices. Does the model repository include copyright notices for upstream datasets? 96.5% of datasets and 95.8% of models lack required licence text. Expect to find nothing — and treat that absence as a documented risk.
Step 5: Ask about fine-tuning data. If the model has been fine-tuned, what data was used? 24% of parent-to-child model relationships have different licensing between child and parent, and fine-tuned models frequently drop licence provenance entirely.
Step 6: Check for restrictive use clauses. Does the licence contain restrictions not typical of the label it carries? LLaMA 2’s user-count threshold, non-commercial clauses, output-usage restrictions — none of these are permissive terms regardless of the metadata tag. Read the full licence text.
Step 7: Document your findings. Record what you verified, what you could not verify, and what risk you’re accepting. “We checked and found gaps” is a defensible position. “We assumed the label was accurate” is not.
The AI supply chain does not propagate licence information reliably. Manual verification at the point of integration is, currently, the minimum standard.
For formal tooling and structured governance, see our guide to AI Bills of Materials. For the full picture of open AI supply-chain licensing risk, see the comprehensive overview.
Frequently Asked Questions
What is permissive-washing in AI?
Labelling AI artefacts with permissive licence tags like MIT or Apache-2.0 while omitting the required legal documentation — licence text, copyright notices, upstream attribution — that makes the label enforceable.
Does an MIT licence on a Hugging Face model mean the training data is also MIT-licensed?
No. The licence tag describes the terms under which model weights are distributed. It says nothing about the licences governing training data, which may include copyrighted material or data with no explicit licence at all. Only 7.1% of models declare their training datasets.
What is the difference between open weights and open source AI?
Open weights means model weights are available for download. Open source AI requires disclosure of training data, training code, and methodology in addition to weights. Most models on Hugging Face are open-weights — which determines whether independent provenance verification is even possible.
How does licence risk compound across the AI supply chain?
Risk introduced at the dataset layer transfers to any model trained on that data, and then to any application integrating that model. With only 27.59% of models preserving compliant dataset notices and only 5.75% of applications preserving compliant model notices, the probability of complete attribution in a production application is below 2%.
What is an AI Bill of Materials (AI-BOM)?
A structured inventory of all components in an AI system — training datasets, model weights, training code, and dependencies — analogous to a Software Bill of Materials (SBOM). It enables verifiable licence and provenance tracking across the AI supply chain.
What is data laundering in AI training datasets?
When copyrighted or restrictively licensed content is collected into an intermediary dataset that applies an incorrect permissive licence to the entire collection. Downstream consumers trust the intermediary’s label and unknowingly inherit legal exposure from the original restricted sources.
What should I ask an AI model vendor about data provenance before procurement?
Ask whether the model card identifies all training datasets by name, whether licence documentation exists per dataset, whether the training data has been independently audited for licence compliance, whether the vendor accepts contractual liability for licence defects, and whether they can provide an AI Bill of Materials.
What is shadow AI and why is it a licence compliance risk?
AI tools and models in active use within an organisation that have never been through formal procurement or governance review. No one has verified the licence terms, training data provenance, or attribution obligations of the models in use — so the liability accumulates invisibly.
Does the EU AI Act require AI model providers to disclose training data?
Yes. Article 53(1)(d) requires providers of GPAI models on the EU market to publish a sufficiently detailed summary of training data content. Enterprise buyers can use this as a procurement lever when vetting vendors.
Can I rely on fair use to avoid AI training data licence issues?
Fair use as a defence for AI training on copyrighted data remains unsettled in the US, with over 50 cases pending as of early 2026. Relying on an untested legal defence as your primary risk mitigation is not a defensible governance position.
What is the Common Pile and why does it matter?
A pre-training dataset created by EleutherAI with 8TB of texts verified for legal compliance at the level of individual constituent works. The Comma language model trained on it performs on par with or outperforms Llama 2 and Qwen3 — demonstrating that legally clean training is achievable.