The 91.5% statistic has been doing the rounds since Q1 2026. As a headline it hits hard: nearly every vibe-coded application has a security vulnerability. What the headline drops is the qualifier that makes the number meaningful — and the five independent studies that give it credibility despite its source.
Vibe coding — natural-language-directed AI code generation deployed with minimal human review, coined by Andrej Karpathy in February 2025 and Collins Word of the Year for 2025 — has gone from novelty to engineering norm faster than the security research could keep up. The research is catching up now. When Veracode‘s 4 million scans, Georgia Tech‘s CVE tracker, CodeRabbit‘s 470-PR meta-analysis, Kingbird’s audit of 200+ apps, and Escape‘s 5,600-app production scan all point the same direction, dismissing the finding means dismissing all five independently. This article is part of our comprehensive the full vibe coding security picture, which covers all seven dimensions of vibe coding’s security reality.
What does “91.5% of vibe-coded apps have vulnerabilities” actually mean — and how was it measured?
The 91.5% figure is an app-level prevalence metric, not a code-line rate. It answers “how many applications contain at least one qualifying flaw” — not “what percentage of individual lines of code are insecure.”
The precise qualifier is “at least one AI hallucination-related flaw.” Strip that qualifier and you change the meaning entirely. A single exposed secret in a 50,000-line codebase is enough to make an app count in the 91.5%.
The source is Kingbird Solutions, a security audit firm, from a Q1 2026 internal audit of over 200 vibe-coded applications. Kingbird is a commercial source, not an independent academic institution, and its methodology is partially disclosed: flaw categories and sample size are named, but sampling method and app selection criteria are not published. That’s worth noting. It doesn’t invalidate the finding, but it does mean it shouldn’t stand alone.
The 40–62% code-level vulnerability rate is consistent with the 91.5% app-level rate because apps are made up of many components — any single component flaw triggers inclusion. As Kingbird puts it: “The 40 to 62 percent range applies to the code itself. The 91.5 percent is specifically about hallucination flaws in production vibe-coded apps.”
What is an AI hallucination flaw, and why does it produce the same vulnerability types repeatedly?
In security contexts, an AI hallucination flaw is code that runs and compiles but is built on false security assumptions. The AI invents a security function that doesn’t exist, assumes a framework enforces permissions it doesn’t enforce, or uses a deprecated security library because its training data predates the deprecation. The flaw is latent — it doesn’t surface during functional testing. It surfaces when an adversary probes the false security assumption.
Here’s a common pattern: AI-generated code calls a verifyUser() function that looks plausible but never checks object ownership. Authentication runs; authorisation never does. The app ships with a BOLA vulnerability no functional test will surface.
That mechanism is why the vulnerability distribution is predictable and recurring rather than random.
BOLA (Broken Object Level Authorization) — also called IDOR in older OWASP taxonomy — is the most prevalent access control vulnerability in vibe-coded apps. API endpoints authenticate the user but fail to verify ownership of the requested object. Automated scanning often misses it because it requires understanding business logic. The Lovable breach as the primary 2026 case study — 48 days of live exposure allowing any free account holder to access another user’s source code and database credentials — is the clearest real-world illustration of what that looks like in production.
Credential exposure follows from training data patterns that normalise secrets in code. Escape found 400 exposed secrets across 5,600 scanned apps. ShipSafe found 45% of scanned AI-generated repos had hardcoded secrets.
Injection flaws show an interesting asymmetry: SQL injection (CWE-89) achieves an 82% security pass rate in Veracode’s testing because parameterised query patterns dominate training data. XSS (CWE-80) achieves only 15%, and log injection (CWE-117) only 13% — both require tracking data flow across multiple function calls, which is beyond pattern-matching.
The hallucination mechanism also explains why newer, more capable models haven’t improved security pass rates despite improving syntax correctness. Better models produce better-functioning code using the same false security assumptions. The sophistication improves; the misconceptions persist.
Do four independent studies actually reach the same conclusion — or are they measuring different things?
Four independent studies, different methodologies, different sample populations. Consistent conclusions. Each study’s limitations are different — which is exactly why consistency across those limitations is the evidentiary case.
Veracode (4 million scans): Tested 150+ LLMs across 80 coding tasks and 4 languages. Finding: only 55% of AI code generation tasks produce secure code — “stubbornly stuck at approximately 55%” across the 2025–2026 testing cycles. Syntax correctness has climbed above 95%. The gap between “works” and “works securely” is widening. Limitation: controlled test conditions, not production.
One signal worth acting on for enterprise teams: Java achieves only a 29% security pass rate versus 62% for Python. If your stack is Java-heavy, that is a material per-task risk differential. Not marginal.
CodeRabbit (470-PR meta-analysis): Analysis of 470 GitHub pull requests found AI co-authored code produces XSS (CWE-80) vulnerabilities at 2.74 times the rate of equivalent human-written code. For a deeper look at the CodeRabbit 2.74x finding and its methodology, the companion article covers the engineering implications. Limitation: primary source not independently verifiable.
Georgia Tech SSLab / Vibe Security Radar: Tracks CVEs attributable to AI coding tools across CVE.org, NVD, GitHub Advisory Database, OSV, and RustSec. As of March 2026: 74 confirmed CVEs, with 35 disclosed in March alone — up from 6 in January. Georgia Tech estimates confirmed CVEs represent 10–20% of the actual total. That’s a floor, not a ceiling. Limitation: attribution requires metadata; most AI-assisted code leaves none.
Escape (5,600-app production scan): 5,600 publicly deployed vibe-coded apps, DAST and AI pentesting. Finding: 2,000+ high-impact vulnerabilities, 400 exposed secrets, 175 instances of exposed PII. This is the production-validation study — the same patterns found in controlled conditions exist in deployed, live applications. Limitation: publicly deployed apps may self-select for lower security quality.
Each study’s limitations are independent. Consistency across independent limitations is the evidentiary case.
How do the 40–62% code-level rate and the 91.5% app-level rate fit together — and which number matters more for risk assessment?
These two numbers measure fundamentally different things. Don’t conflate them.
The 40–62% range is a code-level vulnerability rate: what percentage of individual AI-generated code samples contain at least one detectable vulnerability? Veracode sits at approximately 45%, Cloud Security Alliance at 62%.
The 91.5% figure is an app-level prevalence rate: how many complete vibe-coded applications contain at least one AI hallucination-related flaw across all components? Because a production app is made up of many components, even a moderate code-level rate compounds to near-certain app-level prevalence. The two numbers aren’t in tension — they measure different levels of abstraction.
Use the right number for the right question. The 91.5% answers “is our app likely to have a problem?” The 40–62% answers “what proportion of our commits need security review?”
The most operationally useful single number is Veracode’s 55% security pass rate: roughly one in two AI code generation tasks produces vulnerable output. Security review structured as exception-handling is calibrated for a pass rate above 90%. At 55%, it must be a mandatory step for every component. The broader security context of AI code generation — including how these statistics play out in a real platform incident — is best understood through the Lovable breach, which demonstrated in production terms exactly what a 55% pass rate means when it reaches hostile actors. For a complete vibe coding security overview, the pillar resource frames the full evidence base.
What does the SecureVibeBench 23.8% result mean for AI code security in practice?
SecureVibeBench (arXiv 2509.22097) is a peer-reviewed academic benchmark for evaluating secure code generation by AI agents — and its result is the most practically significant single number in the Q1 2026 research for anyone who wants to stress-test these claims.
Rather than generating code from scratch, SecureVibeBench uses vulnerability-introducing commits (VICs) — real commits that introduced CVEs into OSS-Fuzz and ARVO projects — to reconstruct the exact contexts where developers originally introduced vulnerabilities. The agent must solve the task without replicating the historical error. Evaluation covers three dimensions: functional correctness, security correctness, and SAST detection via Semgrep. 105 C/C++ test cases, 41 real projects.
The result: SWE-agent with Claude Sonnet 4.5 achieved a correct-and-secure rate of 23.8%. Other agents performed worse. That’s a 76.2% failure rate for the best available agent under controlled academic conditions, against known vulnerability types.
The benchmark covers memory-related C/C++ vulnerabilities, not web application patterns — so don’t transpose the 23.8% directly to BOLA or XSS. What it validates is the general finding: functional correctness and security correctness remain uncoupled. AI agents have gotten better at producing code that runs. They haven’t gotten better at producing code that runs safely.
Semgrep, used as the SAST oracle in SecureVibeBench, is the validated starting point for vibe-coded codebases. For what SAST and DAST configuration changes are needed when teams use AI coding assistants, the companion article covers the full guidance.
What do these numbers mean for assessing your team’s vibe-coded codebase?
With a 45–55% chance of vulnerable output per code generation task, security review cannot be structured as exception-handling. It has to be a mandatory step in the CI/CD pipeline for every AI-generated component — not a spot-check applied to suspicious-looking code.
SAST configuration requires specific adjustment for AI-generated codebases. Standard configurations are tuned for human-written code dominance. AI-generated code has higher vulnerability density per unit — the same configuration that handles human-written code adequately may be under-scaled. Semgrep, validated by SecureVibeBench, is the recommended starting point. Full SAST reconfiguration guidance is covered in a deeper look at the CodeRabbit 2.74x finding and its methodology.
DAST addresses what SAST cannot: runtime behaviour of deployed applications. For teams that have already deployed vibe-coded code, Escape demonstrates the DAST plus AI pentesting agent approach for continuous production scanning.
Prioritise BOLA and credential exposure first. BOLA can’t be fully automated — it requires understanding the authorisation model to verify that ownership is being checked, not just authentication. Credential exposure requires scanning across the full git history, not only current state.
For teams with Java-heavy stacks: a 29% security pass rate versus 62% for Python means AI-generated Java code carries more than double the per-task vulnerability probability. Applying the same review process uniformly across languages leaves you under-resourced for Java.
And for worst-case context — these vulnerability patterns are not abstract risk. VECT ransomware resulted in the confirmed breach of Mercor — a $10 billion AI recruiting startup — with 4 TB of data exfiltrated via the LiteLLM supply chain compromise. For what happens when the same vulnerability patterns reach hostile actors, the VECT Ransomware article covers the full timeline.
Frequently Asked Questions
Is the 91.5% vulnerability rate a real statistic or vendor marketing?
It’s sourced from Kingbird Solutions, a security audit firm, via a Q1 2026 internal audit of 200+ vibe-coded applications. Commercial research, not academic — scepticism is warranted. Sampling method and app selection criteria are not fully disclosed.
That said, the finding is consistent with four independent studies: Veracode’s 4 million scans, CodeRabbit’s 470-PR analysis, Georgia Tech’s CVE tracking, and Escape’s 5,600-app production scan. The credibility rests on convergence, not on Kingbird alone. And the qualifier “at least one AI hallucination-related flaw” is load-bearing: not “91.5% of code lines are vulnerable” — “91.5% of apps had at least one qualifying flaw of any severity.”
Why hasn’t AI-generated code become more secure as the models have improved?
Veracode’s longitudinal data — 150+ LLMs tested across 2025–2026 — shows no improvement in security pass rate despite significant improvement in functional correctness. The same approximately 55% persists across model generations.
Better models produce better-functioning code using the same false security assumptions baked into training data. Security correctness requires tracking data flow across function boundaries and modelling threat behaviour — a different kind of reasoning than pattern-matching, however sophisticated. Improving the pattern doesn’t fix the assumption.
What does “hallucination flaw” mean in security context versus in general AI usage?
In general AI usage, hallucination refers to factually incorrect text. In security context, an AI hallucination flaw is code that runs and compiles correctly but is built on false security assumptions — invented functions, unenforced framework permissions, deprecated libraries used as current. The flaw doesn’t surface during functional testing; it surfaces during an attack.
What is the difference between BOLA and IDOR?
BOLA (Broken Object Level Authorization) is the current OWASP API Security Top 10 term. IDOR (Insecure Direct Object Reference) is the older OWASP Web Application Top 10 term for the same flaw. An attacker manipulates an object identifier in an API request to access another user’s data. The API authenticates but does not authorise. BOLA is the preferred modern term.
Which AI coding tools produce the most CVEs according to the Georgia Tech research?
Of 74 confirmed AI-attributed CVEs tracked by Georgia Tech’s Vibe Security Radar as of March 2026: Claude Code 49 (11 critical), GitHub Copilot 15 (2 critical). The imbalance is partly methodological — Claude Code always leaves a co-author git signature; Copilot’s inline suggestions leave no trace. The confirmed figure represents 10–20% of the actual total; the remainder is unattributable.
Does vibe coding security risk vary by programming language?
Yes. Veracode’s Spring 2026 study: Java 29% security pass rate, Python 62%, C# 58%, JavaScript 57%. For Java-heavy stacks, that gap should directly influence review prioritisation — it’s more than double the failure rate.
How do I know if my team’s vibe-coded app already has these vulnerabilities?
Start with credential scanning across the full git history — tools like GitLeaks or GitHub’s built-in secret scanning find hardcoded secrets in 45% of ShipSafe-scanned AI-generated repos. BOLA and access control failures require manual review against the authorisation model. Semgrep with AI-code-specific rulesets provides first-pass SAST coverage; Escape provides production-level DAST validation. VibeCheck (notelon.ai) offers a free initial scan of GitHub repos or live sites.
What does “security pass rate” mean, and why is 55% low?
Veracode’s security pass rate is the percentage of AI code generation tasks that produce no detectable vulnerability when analysed by SAST — per-task, not per-line. At 55%, one in two tasks produces vulnerable output. Syntax correctness exceeds 95%. That 40-percentage-point gap is the central finding: “works” and “works securely” are not the same thing, and at 55%, review must be systematic, not exception-handling.
Are there sectors where vibe coding security risk is higher?
Financial services and healthcare report the lowest vibe coding adoption — 34% and 28% — suggesting regulated industries are already self-selecting toward caution. That won’t be enough. EU AI Act high-risk obligations take effect 2 August 2026; California S.B. 53 and the New York RAISE Act add further requirements. The 91.5% vulnerability rate in unsystematically reviewed AI-generated code creates documentable audit exposure in any compliance-sensitive environment.
What is SecureVibeBench, and why is 23.8% the right number to cite for AI security benchmarking?
SecureVibeBench (arXiv 2509.22097) evaluates whether AI coding agents produce code that is both functionally correct and security-safe. 105 test cases from real C/C++ vulnerabilities in OSS-Fuzz and ARVO — real contexts, not synthetic examples. The 23.8% correct-and-secure rate is the best performance by any agent tested (SWE-agent with Claude Sonnet 4.5). Right number for a sceptical audience: peer-reviewed, reproducible, vendor-independent, tests both dimensions simultaneously.
What is the Escape production scan, and why does it matter alongside the controlled studies?
Escape.tech scanned 5,600 publicly deployed vibe-coded applications: 2,000+ high-impact vulnerabilities, 400+ exposed secrets, 175 instances of exposed PII. Its significance is production validation — the same patterns from controlled studies exist in live, deployed applications. The 400+ exposed secrets are the most immediately actionable finding: live credentials, accessible right now.
Do SAST tools need to be reconfigured for AI-generated code, and why?
Standard SAST configurations are tuned for human-written code. AI-generated code has higher vulnerability density per unit — the same configuration may be under-scaled. Reconfiguration priorities: lower thresholds for credential exposure; add rules for BOLA and access control patterns; increase XSS and log injection coverage. Full guidance is in the companion article on the CodeRabbit 2.74x finding.