Business

SaaS

Technology

•

Mar 2, 2026

Adding AI Licence Compliance to Your Existing Engineering Workflow

Your CI/CD pipeline already handles Software Composition Analysis (SCA) for open-source dependencies. It scans packages, evaluates licences, flags violations, and blocks bad merges before they reach production. That capability took years to build, and it works.

The problem is that AI artefacts — models, datasets, AI-generated code — now sit entirely outside those existing gates. When your team pulls a model from Hugging Face, adds a fine-tuned weights file to a container, or uses GitHub Copilot to write a function, none of that activity passes through the licence compliance checks you already have in place.

This matters because open AI supply-chain licensing risk is not theoretical. Permissive-washing — where a model carries an Apache 2.0 or MIT label in its repository metadata but the actual permissions granted fall short — means you cannot trust repository metadata alone. A 2026 empirical study of 760,460 Hugging Face models found that 52% of LLM supply chains exhibit at least one licence conflict.

This article walks you through extending what you already have. A four-step pre-integration model review. Snippet scanning for AI-generated code. AI-BOM generation in your CI/CD pipeline. And SBOM lifecycle management. We cover both the enterprise tooling path (FOSSA, Sonatype) and the open-source path (GUAC, BomCTL, Protobom).

Why Is Your Existing SCA Workflow Already 80% of the Way to AI Licence Compliance?

SCA already does what you need — for traditional code. It identifies open-source components, evaluates licences, flags violations, and integrates into your CI/CD pipeline. The principles are identical for AI artefacts. The gap is not philosophical; it is a coverage gap.

Standard SCA tools scan declared dependencies in package manifests. They do not scan model weights in a container, dataset licence files in a model card, or inline code that an AI coding assistant generated from its training data. There are three specific gaps to understand:

Model licences: RAIL, OpenRAIL, Llama Community Licence, and custom addenda to Apache 2.0 are not in the licence databases traditional SCA tools use.

Training data provenance: An AI model is trained on datasets that may carry copyrighted material or conflicting licences. The model inherits those obligations, but manifest-based scanning cannot surface them.

AI-generated code snippets: When a developer uses GitHub Copilot or Cursor, the resulting code has no package manifest entry. If Copilot reproduced a GPL-licensed snippet, your pipeline will never know — until someone else’s counsel does.

The 2026 OSSRA report tells you how widespread this is: 97% of organisations use open-source AI models in development, yet only 54% evaluate AI-generated code for IP and licensing risks. That 46% gap accumulates legal exposure that surfaces at the worst moments — M&A due diligence, a regulatory audit, a copyright claim. For a complete overview of the full landscape of AI licensing risk — including the commercial, legal, and regulatory dimensions beyond what SCA tools address — the pillar article covers each domain in full.

One prerequisite before you start: address shadow AI — undeclared model usage already deployed in production. You cannot govern what you have not discovered.

What Are the Four Steps in a Pre-Integration Review for Any AI Model?

Before automated tooling runs, a manual four-step review catches licence risks that machines cannot assess. This gate runs before engineering time is invested — 15 to 30 minutes per model, preventing weeks of remediation downstream.

Step 1: Model Card Review

Model cards document intended uses, limitations, training data, and licence declaration. In practice, model card quality remains low — ambiguity and incompleteness are the norm. Look for: the intended use statement (does it cover your use case?), out-of-scope uses (is your application listed?), training data description (are sources named and documented?), and the licence declaration.

Step 2: Acceptable Use Policy Review

This is where permissive-washing surfaces most commonly. A model’s repository metadata may show MIT or Apache 2.0, but the Acceptable Use Policy (AUP) — a separate document — may prohibit commercial distribution, restrict specific industries, or cap maximum active users. A model can carry an Apache 2.0 label while the AUP makes it unusable for commercial production. The label tells you nothing about the AUP. Read it separately. Every time.

Step 3: Licence File Verification

Verify the actual licence file. A model labelled Apache 2.0 must have an Apache 2.0 licence file with no modifications — company-released models frequently add custom addenda that negate the permissive grant. Do not rely on the metadata tag.

Step 4: Training Data Provenance Check

Assess whether the training datasets have documented origins and licences that support commercial use downstream. Look for named datasets, licence terms for each, and whether those terms permit commercial derivative works. A model trained on datasets that prohibit commercial use passes that restriction downstream to you. Where training data is entirely undisclosed, that is a risk signal that belongs in your integration decision.

How Do You Extend SCA Tooling to Cover AI Artefacts Using FOSSA, Sonatype, or GUAC?

Both implementation paths produce the same outcome: visibility into AI artefact licences and automated policy enforcement in your existing CI/CD pipeline.

Enterprise Path: FOSSA and Sonatype

FOSSA supports AI model licence scanning as an extension of its existing SCA platform. Declare AI model artefacts alongside traditional package dependencies, set licence policies to include AI-specific licence types (RAIL, OpenRAIL, Llama Community Licence), and configure policy-driven model approval gates using FOSSA’s three-tier classification (approve / flag / deny). A January 2026 partnership with SCANOSS means snippet detection now operates within the FOSSA policy gate — AI-generated code goes through the same policy enforcement as any other dependency.

Sonatype provides shadow AI detection — scanning container registries and dependency manifests to identify undeclared model usage deployed without approval. Worth running before you do anything else.

Open-Source Path: GUAC, BomCTL, and Protobom

GUAC (Graph for Understanding Artifact Composition) ingests SBOM, SLSA, and vulnerability data into a queryable graph database, providing supply chain tracing and dependency visibility for AI artefacts at no licence cost. BomCTL and Protobom are OpenSSF command-line utilities built for CI/CD integration — BomCTL handles generation, validation, and transformation; Protobom handles format-agnostic conversion between SPDX, CycloneDX, and emerging schemas.

What Is Snippet Scanning and When Does Your Team Need It for AI-Generated Code?

If your team uses GitHub Copilot or Cursor, snippet scanning is not optional. It is required to maintain the validity of your existing licence compliance programme.

Here is the legal risk: if Copilot generates code that reproduces a GPL-licensed function from an open-source project in its training data, your codebase contains copyleft-licensed code with no licence notice and no compliance activity. The 2026 OSSRA report identifies “licence laundering” — AI assistants generating snippets derived from copyleft sources without retaining original licence information — as a key driver of year-over-year increases in licensing conflicts.

Heather Meeker, open-source licence expert, puts it plainly: “Snippet scanning is becoming essential because of the need to identify small, matching fragments of code that might originate from open source projects. The defensible path is to choose paid tools, enable guardrails, use snippet scanning, and apply your existing licence policies to AI outputs.”

The integration point is the pull-request workflow: snippet scanning triggers on every PR and blocks merge if copyleft-licensed snippets are detected without appropriate licence handling.

When scanning detects a violation, you have four remediation paths:

Remove the snippet: Delete the flagged code and write a replacement independently.
Rewrite it: Implement the same logic from scratch using a clean-room approach.
Relicense via dual-licensing: If the original project offers a commercial licence, obtain it.
Replace it: Source the functionality from a dependency with a compatible licence.

How Do You Generate AI-BOMs in Your CI/CD Pipeline?

An AI Bill of Materials (AI-BOM) is the governance artefact this compliance workflow produces — a structured, machine-readable inventory of all AI artefacts with their provenance, licence terms, and version metadata. Unlike a traditional SBOM, it captures model identity and version, training data sources and their licences, fine-tuning parameters, framework dependencies (PyTorch, TensorFlow, LangChain), and the relationships between models, data, services, and infrastructure. For a deeper treatment, see What an AI Bill of Materials Is and What to Demand from Vendors.

Two machine-readable standards are available. SPDX 3.0 has ISO backing (ISO/IEC 5962) and dedicated AI and Data profiles — that carries significant weight in procurement and regulatory audits. CycloneDX ML-BOM (version 1.7) has broader OWASP ecosystem support, standardised as ECMA-424. OpenSSF’s Protobom provides lossless conversion between both.

For open-source AI-BOM generation in CI/CD: BomCTL handles generation and validation; OWASP AIBOM Generator supports Hugging Face-hosted models and produces CycloneDX output. The CI/CD integration pattern is straightforward — generate the AI-BOM for any AI artefact in the build, validate it against the appropriate schema, and store the signed artefact. Schema validation is the gate. If the AI-BOM fails validation, the build fails. GUAC then ingests the generated AI-BOMs alongside traditional SBOMs into a unified dependency graph.

How Do You Keep AI-BOMs Current as Models and Dependencies Change?

An AI-BOM generated once at integration time has a short useful life. Models update, datasets change, CVEs are disclosed. This is where many teams stop short of a defensible compliance posture.

Five events require AI-BOM regeneration outside the normal build cycle: upstream model version bump, dataset update announcement, CVE disclosure affecting a model dependency, regulatory requirement change, and a fine-tuning or retraining event.

Tamper-evident AI-BOMs require cryptographic signing — ECDSA or Ed25519 produces a verifiable artefact that cannot be altered without invalidating the signature. SBOMit (OpenSSF) manages the end-to-end lifecycle. Dependency-Track (OWASP) provides continuous SBOM analysis with real-time findings.

For any product with digital elements sold in the EU, the Cyber Resilience Act mandates that AI-BOMs be kept up to date and retained for at least ten years. For a full treatment of EU AI Act and Cyber Resilience Act supply chain obligations, there is a dedicated article that covers the detail.

When Does Fine-Tuning an Open Model Make Your Team a Provider Under the EU AI Act?

This is where an engineering decision directly determines regulatory status. Worth understanding before you start.

The EU AI Act‘s GPAI provisions include a one-third compute threshold: fine-tune a General-Purpose AI model using more than one-third of the original training compute, and your organisation is reclassified as a GPAI provider under Article 53, triggering transparency, documentation, and training data copyright summary obligations. Compare your planned fine-tuning FLOPs to the original model’s published training FLOPs. Where training compute is undisclosed, use the fallback threshold: one-third of 10^23 FLOPs for standard GPAI models.

Domain adaptation fine-tuning — continued training on large domain-specific datasets — is the scenario most likely to cross the threshold. Calculate your fine-tuning FLOPs before assuming you are below the line.

LoRA (Low-Rank Adaptation) uses far less compute — orders of magnitude less in typical deployments. It is the safer approach for teams wanting model customisation without triggering provider obligations. Document the FLOPs calculation regardless — it belongs in the AI-BOM and serves as your defence against future regulatory inquiry.

How Do You Codify This Workflow into an Engineering Standard That Survives Team Turnover?

A compliance workflow that exists only in the heads of the people who built it does not survive a team reorganisation. Document it as an internal engineering standard — a version-controlled policy that survives team turnover, vendor changes, and audit scrutiny.

The standard should cover: the four-step pre-integration review procedure (who performs it, decision criteria, documentation produced); approved and denied licence classifications including AI-specific types (RAIL, OpenRAIL, Llama Community Licence); the snippet scanning policy (PR-level blocking, remediation workflow); AI-BOM generation and lifecycle requirements (pipeline step, output format, signing process, five update triggers); the fine-tuning threshold assessment procedure; and a quarterly review cadence.

Also specify what to request from upstream AI model providers contractually: training data source disclosure, update notification obligations, indemnification terms, and data retention commitments aligned with your CRA obligations. The Cyber Resilience Act already requires this kind of contractual arrangement with software suppliers — extend it to AI model providers.

Frequently Asked Questions

What is permissive-washing in AI model licensing?

Permissive-washing is when an AI model carries a permissive open-source licence label (Apache 2.0, MIT) in its repository metadata, but the underlying legal rights grant is insufficient for commercial downstream use. ML-specific licences often impose additional restrictions — limiting commercial use, prohibiting use of model output to train competing models — that fall well outside what the label implies.

Can standard SCA tools detect licence risks in AI-generated code?

No. Standard SCA tools scan declared dependencies in package manifests — they cannot detect code fragments that AI coding assistants like GitHub Copilot or Cursor generate by reproducing open-source training data. Snippet scanning is the separate capability you need.

What is the difference between an SBOM and an AI-BOM?

An SBOM inventories traditional software components. An AI-BOM extends this to cover AI-specific artefacts: models, datasets, training parameters, fine-tuning configurations, and their provenance and licence terms. Both use SPDX or CycloneDX formats, but AI-BOMs require additional fields for model and data lineage.

Which is better for AI-BOM generation: SPDX 3.0 or CycloneDX ML-BOM?

Both are valid. SPDX 3.0 has ISO backing (ISO/IEC 5962) and dedicated AI and Data profiles. CycloneDX ML-BOM (v1.7) has broader OWASP ecosystem support, standardised as ECMA-424. Choose based on your existing tooling. OpenSSF’s Protobom enables lossless conversion between both, so you are not locked in.

How much does snippet scanning slow down a CI/CD pipeline?

FOSSA Snippet Scanning completes most scans in under five minutes — comparable to existing linting or static analysis steps.

What happens if we fine-tune a model and the upstream provider updates it?

It triggers an SBOM lifecycle update event. Regenerate your AI-BOM, re-evaluate your fine-tuning compute ratio, and verify that the licence terms have not changed.

Do LoRA adapters count toward the EU AI Act one-third compute threshold?

LoRA uses significantly less compute — typically orders of magnitude less — and in most practical scenarios will not cross the one-third threshold. Still, document your FLOPs calculation and retain it for audit purposes. The threshold applies to cumulative compute across all fine-tuning rounds.

What is GUAC and why is it recommended for budget-constrained teams?

GUAC (Graph for Understanding Artifact Composition) is an OpenSSF open-source tool that ingests SBOM, SLSA, and vulnerability data into a queryable graph database, providing supply chain tracing and dependency visibility for AI artefacts at no cost.

How do I detect shadow AI — unsanctioned model usage already in production?

Scan container registries for model weight files, review dependency manifests for AI framework imports (PyTorch, TensorFlow, LangChain), and check network egress logs for calls to external model APIs. Sonatype offers automated shadow AI detection as a platform feature. Run this audit before implementing the compliance workflow — you cannot govern what you have not discovered.

What contractual clauses should we request from upstream AI model providers?

Request: training data source disclosure, update notification obligations, indemnification terms for downstream licence claims, and data retention commitments aligned with your CRA obligations.

Is the four-step pre-integration review necessary if we use automated SCA tools?

Yes. Automated SCA tools operate after integration. The pre-integration review runs before engineering time is invested and catches risks automated tools cannot assess: AUP restrictions that exist outside machine-readable licence files, and training data provenance gaps that no structured automated tooling can process.

How often should we regenerate our AI-BOMs?

Regenerate on every build that changes an AI component. Five event-based triggers also require out-of-cycle regeneration: upstream model version bump, dataset update announcement, CVE disclosure affecting a model dependency, regulatory requirement change, and fine-tuning or retraining event. Retain signed copies for a minimum of ten years under CRA Article 13 requirements.

The Workflow as an Engineering Dependency

Legal ambiguity in AI licences is now an engineering dependency risk. When your team selects a model, the licence terms it carries propagate into your product, your supply chain obligations, and potentially your regulatory status. Repository metadata alone cannot be trusted to surface that risk — permissive-washing ensures it will not.

The workflow in this article — pre-integration review, snippet scanning, AI-BOM generation, and lifecycle management — extends what your team already knows. The principles are the same as SCA. The tooling integrates with your existing pipeline. The engineering standard codifies the process so the knowledge does not walk out the door when people do.

For the broader context on the open AI supply-chain licensing risk this workflow addresses — including the regulatory and commercial landscape driving these obligations — start with the pillar article. The workflow described here is the operational implementation of the governance framework it establishes.