Business

SaaS

Technology

•

Mar 2, 2026

Permissive-Washing in AI Explained — Why Open Source Labels Cannot Be Trusted

When a package in your npm registry says MIT, you trust it. The licence file is there, the copyright notice is there — the whole compliance payload ships with the dependency. That mental model works for traditional software. AI artefacts are different. The label and the legal grant are not the same thing.

New research from Queen’s University found that 96.5% of permissively-labelled datasets and 95.8% of permissively-labelled models on Hugging Face lack the documentation required to make those labels legally enforceable. The researchers coined a term for it — permissive-washing. When that documentation is missing, the artefact reverts to all-rights-reserved status regardless of what the tag says.

This article explains what permissive-washing is, why a model card badge is not a licence, and what five checks your team can run before adopting any open AI component. It is part of a broader look at open AI supply-chain licensing risk — but this is where the explanation has to start.

What does “permissive-washing” actually mean?

Permissive-washing is what happens when an AI artefact — a model, a dataset, a fine-tuned checkpoint — gets labelled with a permissive licence (MIT, Apache-2.0, BSD-3-Clause) while leaving out the documentation that makes the label mean anything.

That documentation is the full licence text, the copyright notice, and upstream attribution notices. Without it, Jewitt, Rajbahadur, Li, Adams, and Hassan call it a missing “compliance payload.” The licence grant is void.

It is rarely intentional. AI platforms make it easy to pick a licence tag from a dropdown, and that selection gets treated as equivalent to granting a licence. It is not. But the effect is the same either way: downstream users believe they have rights they do not legally hold.

When you install from npm or pip, the licence file is included by convention — its absence would be flagged. AI platforms have no equivalent norm. 53.5% of Hugging Face datasets carry MIT or Apache-2.0 labels. 96.5% lack the documentation those labels require.

Why is a licence badge on a Hugging Face model not a legal licence?

A Hugging Face model card has a license: field in its YAML header. That field is metadata — a machine-readable tag declaring intent. It is not a legal instrument. It does not grant you anything.

A legal licence grant requires three things to actually exist in the repository: the full licence text in a file, a valid copyright notice identifying a rights holder, and (for Apache-2.0) a NOTICE file preserving upstream attributions. When you select “MIT” from the Hugging Face dropdown, none of those things are created or verified. You get a populated metadata field. That is all.

Licence metadata ≠ licence grant. A tag is a claim. A licence is a legal document.

SPDX identifiers — the standard format Hugging Face uses for licence tags — were designed as cataloguing tools. They express what licence applies. They are not the licence.

GitHub at least has a norm around including a LICENSE file. Hugging Face has no equivalent. Widely-used models including sentence-transformers/all-MiniLM-L6-v2 — Apache-2.0, over 3,700 likes — lack the full licence text needed to legally rely on the declared terms. That model enables 907 downstream applications. Not one can satisfy the licence’s conditions.

What did the research find — how widespread is permissive-washing?

The arXiv:2602.08816 study constructed 124,278 supply chains across three tiers — dataset to model to application. The numbers reflect the ecosystem, not a cherry-picked sample.

In the permissively-labelled subset: only 3.5% of datasets include a complete licence text file. Only 3.0% include a valid copyright notice. Full compliance — both conditions at once — is 2.3% for datasets and 3.2% for models. Fewer than one in thirty permissively-labelled artefacts is actually compliant.

Attribution propagation is worse. Only 27.59% of models preserved attribution from their training datasets, and only 5.75% of applications preserved notices from the models they incorporated.

Compare that to GitHub applications in the same study, where 91.9% include a complete licence text. The gap compounds at every tier.

What do MIT, Apache-2.0, and BSD-3-Clause actually require — and what do AI projects skip?

Developer shorthand for permissive licences is “do what you want.” That shorthand leaves out a word: “if.” Permissive licences say do what you want if you meet these conditions. The conditions are real.

MIT requires two things: include the full licence text with all copies, and include the copyright notice. Two conditions. Both routinely absent from AI artefacts.

Apache-2.0 adds more: full licence text, copyright notice, a NOTICE file preserving upstream attributions, and a statement of changes for modified files. There is also an explicit patent grant — but that grant only activates when the conditions are met. As Heather Meeker, an open source licensing attorney, puts it: “Open source licenses allow you to do anything you want with the licensed code, with conditions that mostly trigger on distribution.” Fail to meet those conditions and you lose not just the copyright permission but the patent protection too.

BSD-3-Clause requires the full licence text in both source and binary redistributions, the copyright notice in both, and a non-endorsement clause. All routinely omitted.

When the conditions are not met, the licence grant does not activate.

What does “default copyright” mean when there is no valid licence file?

Default copyright is the legal state that applies to any creative work when no valid licence has been granted. The creator holds all rights. Nobody else may copy, modify, or distribute the work without explicit permission. Under the Berne Convention, this applies in all major jurisdictions — US, EU, UK, Australia — automatically.

When an AI artefact carries a permissive licence tag but lacks the compliance payload, the grant is void. A team using a “MIT-labelled” model without a LICENSE file has no more legal right to use it than if the model had no label at all.

Here is the scenario. Your team finds a Hugging Face model with an MIT badge, solid benchmarks, and 2,000 likes. You integrate it and ship. The repository has no LICENSE file and no copyright notice. The original creator retains full copyright. If they assert it, you have no defence. The exposure does not expire. It persists.

Default copyright is the current reality for over 95% of permissively-labelled AI artefacts right now.

Why does attribution almost never make it downstream — the 5.75% problem?

Attribution propagation is the requirement that upstream copyright notices and licence texts be preserved at each stage of the supply chain — dataset to model to application. It is how the conditions of permissive licences are supposed to travel with the work.

In traditional software, package managers track dependency trees. A licence scanner like FOSSA or Sonatype surfaces the obligations you have inherited. AI artefact platforms have no equivalent. The three-tier chain is essentially untracked.

27.59% of models preserved compliant attribution from their training datasets. 5.75% of applications preserved compliant attribution from the models they used. The platforms leave attribution tracking to the uploader. Uploaders do not track it.

For your team, this means: even if you verify the licence on the model you are adopting, you inherit unverified obligations from every dataset and component upstream. AI Bills of Materials (AIBOMs) are emerging to address this. They are not yet standard in most teams. The structural reasons how licence risk compounds across your AI stack explains why attribution failure is a feature of the system, not a series of individual mistakes.

What should you check before trusting an open model licence label?

These are five-minute checks, not a legal review. They catch the overwhelming majority of permissive-washing cases and give you a clear stop signal before you commit to an artefact.

Check 1: Look for a LICENSE or LICENCE file in the repository root. Not the model card metadata field. The actual file. If it is absent, the licence grant is not established.

Check 2: Verify there is a copyright notice. “Copyright (c) [Year] [Name]” identifying a real rights holder. MIT, Apache-2.0, and BSD-3-Clause all require this. It is commonly absent even when a licence file exists.

Check 3: For Apache-2.0 models, check for a NOTICE file. This is the mechanism by which upstream attributions are preserved. If it is missing, you cannot satisfy the attribution requirement.

Check 4: Review the model card’s training data section. If it lists datasets, spot-check those datasets using Checks 1 and 2. The 5.75% propagation figure means your upstream almost certainly has problems you cannot see from the model level alone.

Check 5: Treat any failure as a stop signal. Failing any one of these checks means the artefact has no valid licence — legally, regardless of what the badge says.

The Jewitt et al. audit found only 3.2% of permissively-labelled models satisfy Checks 1 and 2 simultaneously. Apply the checklist to every model you evaluate — it is faster than assuming and being wrong.

The EU Cyber Resilience Act already requires SBOMs for digital products, and SCA tooling can automate parts of this. Until you have that in place, manual checks are your first line of defence. This five-check process is one entry point into the full landscape of AI supply-chain licensing risk — covering everything from foundational definitions to regulatory obligations.

For the next step in understanding how this risk propagates, how AI licence risk compounds across your dataset-model-application stack maps what happens when a non-compliant artefact enters a production pipeline. If you are already evaluating specific models for commercial use, Llama, Mistral, DeepSeek and Qwen licence terms compared for commercial use applies these checks to the four dominant open model families. The full series starts at the open AI supply-chain licensing risk overview.

Frequently asked questions

What is the arXiv permissive-washing paper and where can I read it?

arXiv:2602.08816 by Jewitt, Rajbahadur, Li, Adams, and Hassan (Queen’s University, February 2026). The first large-scale empirical study of licence compliance across the AI supply chain — 6,664 models, 3,338 datasets, 28,516 applications.

Does an MIT licence on a Hugging Face model mean I can use it freely in my product?

Not unless the repository includes the full MIT licence text and a valid copyright notice. Without both, the MIT label is metadata only — not a legal grant. The artefact is under default copyright, and you have no legally defensible right to use, modify, or distribute it.

Can my company get sued for using an AI model without a proper licence file?

Yes. If the model lacks valid licence documentation, it is under default copyright. Using it commercially without the rights holder’s explicit permission is copyright infringement. The original creator retains all rights and can enforce them at any time.

What is the difference between a permissive licence and a custom AI licence like Llama’s?

Permissive licences (MIT, Apache-2.0, BSD-3-Clause) are standardised instruments with specific, well-understood conditions. Custom AI licences — Meta’s Llama licence, BigScience RAIL — are bespoke and vary by model. Meta’s Llama 2 licence blocks use by companies with over 700 million monthly active users. Custom licences require case-by-case review; permissive licences require compliance payload verification.

How does licence compliance for AI models compare to compliance for traditional open-source software?

Traditional software has mature tooling: package managers track dependencies, licence scanners (FOSSA, Sonatype) automate compliance, community convention demands a LICENSE file. AI artefacts have none of that. No enforcement, opaque dependency chains, no standard tooling tracing obligations from dataset to application.

What does “open source AI” actually mean from a legal standpoint?

There is no settled definition. The OSI’s evolving definition requires access to training data and model weights, but actual training datasets remain optional in current drafts. In practice, “open source AI” often means “publicly downloadable” — not “OSI-approved with full compliance.”

What is an AI Bill of Materials (AIBOM) and do I need one?

An AIBOM is a structured inventory of all AI models, datasets, and dependencies in your system, including their licence status and provenance. It extends the SBOM concept to AI artefacts. The EU Cyber Resilience Act already requires SBOMs for digital products in EU markets, and the EU AI Act is moving toward requiring transparency documentation that an AIBOM would directly support.

What happens if I discover a permissive-washing problem in a model I have already deployed?

Three options: contact the original creator and request proper licence documentation, replace the model with a compliant alternative, or negotiate a separate commercial licence. The longer the model stays in production without a valid licence, the greater the exposure.

Are there AI models with genuinely clean licensing that I can use without legal review?

Some exist, but they are the minority — only 3.2% of permissively-labelled models in the Jewitt et al. audit were fully compliant. Apply the verification checklist in this article to any model you evaluate. The checks take minutes; the risk of skipping them is real.

Why do AI model creators skip licence documentation if the requirements are straightforward?

Most creators are researchers or developers, not lawyers. Hugging Face makes uploading straightforward and licence tag selection optional — nothing enforces a LICENSE file. The community norm became “tag it and ship it,” mirroring early GitHub culture before licence scanners became standard.

How does the EU AI Act affect AI licence compliance requirements?

It requires GPAI model providers to publish a summary of training data (Article 53(1d)) and implement a policy to comply with EU copyright law (Article 53(1c)). It does not directly mandate licence compliance, but its provenance documentation requirements make unverified AI licensing a compliance gap for EU market deployments.

Permissive-Washing in AI Explained — Why Open Source Labels Cannot Be Trusted

What does “permissive-washing” actually mean?

Why is a licence badge on a Hugging Face model not a legal licence?

What did the research find — how widespread is permissive-washing?

What do MIT, Apache-2.0, and BSD-3-Clause actually require — and what do AI projects skip?

What does “default copyright” mean when there is no valid licence file?

Why does attribution almost never make it downstream — the 5.75% problem?

What should you check before trusting an open model licence label?

Frequently asked questions

What is the arXiv permissive-washing paper and where can I read it?

Does an MIT licence on a Hugging Face model mean I can use it freely in my product?

Can my company get sued for using an AI model without a proper licence file?

What is the difference between a permissive licence and a custom AI licence like Llama’s?

How does licence compliance for AI models compare to compliance for traditional open-source software?

What does “open source AI” actually mean from a legal standpoint?

What is an AI Bill of Materials (AIBOM) and do I need one?

What happens if I discover a permissive-washing problem in a model I have already deployed?

Are there AI models with genuinely clean licensing that I can use without legal review?

Why do AI model creators skip licence documentation if the requirements are straightforward?

How does the EU AI Act affect AI licence compliance requirements?

Related Articles

The AI Job Replacement Calculator

SaaS Are Moving to Usage-based Pricing to Survive AI

Metric of the moment – Sales Velocity, and how to use it to boost sales

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG