AI coding assistants are shipping code faster than most engineering teams have updated their security processes. A December 2025 analysis of 470 real-world pull requests found that AI-co-authored code introduces cross-site scripting vulnerabilities at 2.74 times the rate of human-written code. The code is still being merged. The review processes have not changed to match.
This article explains the methodology behind the 2.74x figure, translates it into review hour estimates and SAST configuration changes, and gives you a practical framework for adjusting sprint security capacity when AI is writing part of your codebase. For the broader picture, start with the vibe coding security landscape. This article is about the numbers and the process changes that follow from them.
Where Does the 2.74x Number Come From and How Was It Measured?
The 2.74x figure comes from CodeRabbit‘s December 2025 study of 470 open-source GitHub pull requests — 320 AI-co-authored and 150 human-only — drawn from real production repositories, not synthetic benchmarks.
Vulnerability density is the count of security flaws per pull request. That is the right unit because it mirrors how review workload is actually allocated — a reviewer is assigned to a PR, not to a line count.
CodeRabbit categorised each PR using commit metadata, co-author annotations, and tool-specific signatures. “AI-co-authored” means a developer used an AI coding assistant and committed the output after review — not unreviewed, machine-only submissions. The 2.74x finding applies to code that passed through developer review.
Here is where teams get tripped up: the 2.74x multiplier is XSS-specific. Cross-site scripting (CWE-80) appeared in AI-co-authored PRs at 2.74 times the rate of human-authored PRs. The same study found 1.7x more issues overall. These two figures are frequently conflated — they are not the same thing. 2.74x is the XSS-class density; 1.7x is the blended rate across all categories. Both matter, but they carry different implications for how you configure your SAST tooling.
Worth noting: CodeRabbit is both the researcher and a code review vendor. But the sample was public open-source code, and the finding has been independently corroborated.
Veracode‘s 2025 GenAI Code Security Report tested more than 100 LLMs across 4 million-plus code scans. Forty-five per cent of AI-generated code failed OWASP Top 10 tests. In the Spring 2026 update, XSS (CWE-80) had a 15% pass rate — the persistent worst performer. Larger models produce no better outcomes than smaller ones. The density issue is structural, not a model version problem.
For the broader research context behind the 2.74x finding — including the 91.5% vulnerability rate and the arXiv academic study — see the research synthesis article.
What Does 2.74x Actually Mean for Your Code Review Process?
The multiplier translates directly into review hours, and the maths is uncomfortable.
At 8 minutes per SAST finding — triage, reproduce, confirm, classify, document — a human-authored PR generating 4 XSS findings takes about 32 minutes. An equivalent AI-co-authored PR generating 11 XSS findings takes about 88 minutes. That is 2.74x the review time per PR. Reviewer capacity has to grow proportionally or findings start to pile up faster than you can clear them.
The compound effect is worse when both CodeRabbit findings land at once. Reviewers are dealing with 1.7x more total findings, with 2.74x more in XSS specifically. A queue of AI-co-authored PRs carries a skewed distribution that puts the heaviest pressure on exactly the most security-sensitive categories.
Faros.ai found that on teams with high AI coding adoption, engineers merge 98% more pull requests — but PR review time increases 91%. Writing code is no longer the bottleneck. Reviewing it is. By June 2025, AI-generated code was adding more than 10,000 new security findings per month across studied repositories — a 10x jump from December 2024. That accumulates fast when your review process has not been built to handle the volume.
The Lovable 48-day BOLA exposure is production evidence that this is not theoretical. It is what you get when review processes have not kept pace.
What SAST and DAST Configuration Changes Does the 2.74x Finding Require?
Two quick definitions, because these tools are more often encountered than configured at scale.
SAST (Static Application Security Testing) analyses source code without executing it — it catches vulnerability patterns at commit time. DAST (Dynamic Application Security Testing) tests the running application to find exploitable paths against a live build. Both need reconfiguration when the expected finding rate has increased 2.74x for XSS and 1.7x overall. The tooling is not wrong; it is calibrated for human-authored code density.
For SAST, the primary change is rule set prioritisation, not tool replacement. Semgrep is the recommended first-line tool — open-source, supports 30-plus languages via YAML-based custom rules, integrates into CI/CD pipelines, and runs targeted rule packs as mandatory merge gates.
At 2.74x XSS density, configure these Semgrep rule packs as hard-fail gate conditions — PRs that trigger findings cannot merge without security team sign-off:
- XSS and DOM-XSS (CWE-80) — the highest-density category from the CodeRabbit study
- SQL injection (CWE-89) — 18% of AI-generated code still fails this at scale
- Command injection and path traversal — AI models frequently concatenate strings rather than use parameterised statements
- Hardcoded credentials — AI models sometimes embed credentials in generated code
At higher alert volume, the risk is over-suppression: teams tuning away entire vulnerability classes to manage alert fatigue simultaneously tune away real XSS findings. Suppress only specific rule IDs with confirmed false-positive patterns.
For DAST, StackHawk provides CI/CD-integrated testing against REST, GraphQL, SOAP, and gRPC APIs. At 1.7x overall issue density, runtime-dependent vulnerabilities — authentication bypasses, session handling flaws, broken access control — become more frequent. Expand DAST coverage to every PR touching AI-generated modules.
Java teams should pay particular attention here: Java has a 71% failure rate in Veracode’s Spring 2026 data — the worst of any language. Weight CWE-89 (SQL injection) alongside CWE-80 (XSS) as dual mandatory gate conditions.
Does It Matter Whether Your Team Uses Cursor, GitHub Copilot, or Lovable?
Yes, tool type matters. There is a measurable risk spectrum, and the vulnerability density differs across it.
The 2.74x finding was produced from AI-co-authored PRs where a developer still reviewed and committed the output. That means 2.74x is the measured density at the AI-assisted tier — Cursor and GitHub Copilot. It is not zero risk, as some teams assume. The multiplier applies to code that developers reviewed and approved before commit.
Cursor and GitHub Copilot both retain the developer review layer. The developer sees the generated code, can reject or modify it, and commits only what they accept. Both operate at the 2.74x XSS density baseline — the lower end of the risk spectrum, not the safe end.
Full vibe coding platforms — Lovable, Bolt.new — remove the mandatory developer code review layer before deployment. The density at this tier is logically higher than 2.74x, because the review layer that catches security issues before commit is absent.
Escape‘s scan of 5,600 publicly deployed vibe-coded applications found more than 2,000 high-impact vulnerabilities, 400 exposed secrets (API keys, database credentials, tokens), and 175 instances of PII including medical records and payment data. Bolt.new ships with row-level security off by default.
So here is the practical policy implication. AI-assisted coding via Cursor or Copilot is manageable with the SAST/DAST reconfiguration and sprint capacity adjustments in this article. Full vibe coding platforms require stricter controls or outright restriction for customer-facing or data-handling applications — the review layer that contains the density effect is absent, and nothing substitutes for it.
For context on how the full cluster of evidence frames the risk, see the pillar article.
When Human Review Capacity Runs Out: The Case for Autonomous Penetration Testing
At some volume — and that volume arrives faster than most teams expect — human review cannot keep pace. At 2.74x XSS density and 1.7x overall, each AI-heavy sprint produces more security debt than an equivalent human-authored sprint. The review queue grows faster than it can be cleared.
Autonomous penetration testing is the economic response. It is continuous security assessment using AI agents to test a deployed application for exploitable vulnerabilities at a rate human testers cannot match within a sprint cycle. It runs against the deployed production environment — catching vulnerabilities that passed SAST and DAST gates, including those introduced by integration patterns or runtime configuration.
Escape raised $18 million specifically for this market. Lovable subsequently partnered with Aikido to bring automated pentesting to its platform. The market is confirming that the human review constraint is real.
Where it sits in the pipeline:
- SAST at the PR gate — catches known vulnerability patterns in static code before merge
- DAST against the staged build — tests the running application for exploitable paths before production
- Autonomous penetration testing in production — continuous coverage, catching what passed the upstream gates
It complements SAST/DAST, it does not replace them. Autonomous testing catches broken authentication flows, exposed data through legitimate endpoints, and logic errors that only appear at runtime. For small teams without dedicated AppSec staff, start by blocking AI-generated PRs that include new dependencies without human approval, then layer in autonomous testing as volume grows.
How Do You Adjust Sprint Security Capacity When AI Is in the Coding Loop?
Sprint security capacity planning has to treat the 2.74x density multiplier as an input variable, not background noise. If you do not recalculate security review workload when AI adoption increases, you get a growing review debt that compounds across sprints. The leading indicator is review queue length — it surfaces before anything shows up in incident data.
Here is a practical four-step framework.
Step 1: Estimate current AI code density. What percentage of new commits this sprint are AI-assisted? If 30% are AI-assisted, 30% of SAST finding volume arrives at 2.74x the human baseline XSS rate.
Step 2: Calculate adjusted review hours. Use a blended density multiplier:
Blended multiplier = (AI_density_% × 2.74) + ((1 − AI_density_%) × 1)
At 30% AI density: (0.30 × 2.74) + (0.70 × 1) = 1.52. Apply that to your sprint’s security review allocation.
Step 3: Set a review capacity ceiling. If adjusted hours exceed available reviewer time, you have two options: reduce AI-generated code volume in the sprint, or escalate the overflow to autonomous penetration testing. Do not absorb the excess — that is how review debt accumulates.
Step 4: Assign code ownership explicitly. The developer who accepted the AI-generated code owns its security review. Without explicit ownership, responsibilities diffuse, findings get deferred, and the pipeline loses accountability.
If security is already a constraint, the 2.74x finding means your existing bottleneck gets worse with every AI-assisted sprint. Prioritise SAST reconfiguration — Semgrep gates as hard-fail conditions — before expanding AI coding adoption. The tools need to be in place before volume increases.
Only 12% of organisations apply the same security standards to AI-generated code as to traditional code — a gap that accounts for why 74% cannot provide security provenance data for AI code. That gap closes at the sprint planning level.
The governance-level policy response — who owns the risk, what review requirements apply before production deployment, and how to structure accountability — is covered in how the density math translates into governance policy.
Frequently Asked Questions
What is vulnerability density and why does it matter for AI-generated code?
Vulnerability density is the count of security flaws per unit of code — per pull request, per file, or per thousand lines. The CodeRabbit 470-PR study found AI-co-authored PRs produce XSS vulnerabilities at 2.74 times the density of human-authored PRs. That means the same review process misses proportionally more flaws unless you adjust for it.
Does the 2.74x figure apply to all types of security vulnerabilities or just XSS?
The 2.74x multiplier specifically covers XSS (cross-site scripting, CWE-80). The same study found a separate 1.7x rate for all issue types combined. These figures are frequently conflated — 2.74x is the XSS-specific finding, not the universal multiplier.
Why haven’t newer AI models improved at writing secure code?
Veracode tested more than 100 LLMs across 4 million-plus scans and found no meaningful improvement. AI models produce plausible-sounding but incomplete security controls — a pattern known as AI hallucination. Security pass rates remain stuck at approximately 55% regardless of model size. The vulnerability density issue is structural.
Is AI-generated code in open-source projects as risky as AI-generated code in enterprise codebases?
The CodeRabbit study drew from open-source GitHub repositories, which may have more active reviewer communities than enterprise environments. Enterprise AI code without equivalent review scrutiny may carry higher density than the 2.74x baseline, not lower.
Which OWASP Top 10 categories are most elevated in AI-generated code?
Veracode’s 4 million-scan study found 45% of AI-generated code fails OWASP Top 10 tests. XSS (CWE-80) has an 85% failure rate in the Spring 2026 update — the persistent worst performer. Java-generated code has a 71% failure rate. SQL injection and command injection are the other priority categories for SAST configuration.
Does using GitHub Copilot or Cursor make my application less secure?
Not automatically, but AI-co-authored code — including code produced with Cursor and Copilot — introduces XSS vulnerabilities at 2.74 times the rate of human-written code. The tools do not guarantee insecure output, but they shift the probability distribution in a way that requires adjusted review processes.
What is the difference between Semgrep and a standard SAST tool for AI-generated code?
Semgrep is an open-source SAST tool with customisable rule sets and a low false-positive rate when properly configured. It lets teams run targeted XSS and injection rule packs as mandatory merge gates — the configuration most relevant given the 2.74x XSS density finding.
How do I know if my team’s AI coding volume has exceeded our security review capacity?
The leading indicator is review queue length: if findings are accumulating across sprints rather than being closed within the sprint they are raised, review capacity has been exceeded. Calculate the blended density multiplier and compare the implied review hours against available security reviewer time per sprint.
What is autonomous penetration testing and when does it make sense?
Autonomous penetration testing uses AI agents to test deployed applications for exploitable vulnerabilities at machine speed — covering attack paths, authentication flows, and integration surfaces that SAST/DAST gates may miss. It makes sense when AI coding output volume has grown past the point where manual penetration testing can provide adequate coverage within sprint cadence.
Is vibe coding riskier than using AI coding assistants for security purposes?
Yes — there is a measurable risk spectrum. The 2.74x finding was produced from code where developers retained review control. Full vibe coding platforms like Lovable and Bolt.new remove that layer entirely. Escape’s scan of 5,600 vibe-coded applications found more than 2,000 high-impact vulnerabilities and 400 exposed secrets, suggesting the density at the full vibe coding tier is substantially higher than 2.74x.
How should I handle SAST false positives at 2.74x vulnerability density?
At higher density, alert volume increases — both real findings and false positives. Avoid broad suppression rules that silence entire vulnerability classes. Triage by CWE class instead: suppress only specific rule IDs with confirmed false-positive patterns, and require explicit security team sign-off for any new suppression in XSS or injection categories.