OX Security described AI coding tools as an “army of talented junior developers—fast, eager, but fundamentally lacking judgment”. They can implement features rapidly, sure. But they miss architectural implications, security concerns, and maintainability considerations.
Vulnerable code reaches production faster than your teams can review it—this deployment velocity crisis is the real challenge with AI coding tools. Implementation speed has skyrocketed while review capacity remains static.
This article explores specific anti-patterns in AI-generated code with examples showing before/after comparisons. We examine cognitive complexity and maintainability concerns, explain why traditional code review processes miss AI-specific quality issues, and compare how different AI coding assistants differ in output quality. This analysis is part of our comprehensive guide on understanding the shift from vibe coding to context engineering.
OX Security analyzed 300+ repositories to identify 10 distinct anti-patterns that appear systematically in AI-generated code. We’re going to walk through these patterns, explain their impact via cognitive complexity metrics, and demonstrate why traditional code review processes miss AI-specific quality issues.
Different AI tools—Copilot, Cursor, Claude Code—exhibit these patterns differently based on context window constraints. And there’s a practical way to constrain AI output quality using test-driven development.
Let’s get into it.
What are the most common anti-patterns in AI-generated code?
OX Security’s research covered 50 AI-generated repositories compared against 250 human-coded baselines. They found 10 distinct anti-patterns that go against established software engineering best practices.
These aren’t random errors. They’re systematic behaviours that show how AI tools approach code generation.
The patterns break down by how often they occur:
Very High (90-100% occurrence):
- Comments everywhere
- Refactoring avoidance
- Edge case over-specification
High (80-90%):
- By-the-book fixation (rigid adherence to literal prompt interpretation)
- Bugs déjà-vu (repeating same errors across sessions)
Medium (40-70%):
- The lie of unit test code coverage (high coverage, shallow logic)
- Phantom bugs (non-existent edge cases that clutter code)
- Vanilla style (no architectural patterns)
- It worked on my machine syndrome
- The return of the monoliths
Francis Odum, a cybersecurity researcher, put it well: “Fast code without a framework for thinking is just noise at scale”.
How does the “army of juniors” metaphor explain AI code behaviour?
Junior developers can write syntactically correct code that solves immediate requirements. But they miss long-term maintainability, security implications, and system-wide coherence.
AI exhibits the same limitation. Strong pattern matching for common scenarios. Weak architectural judgment for edge cases and system integration.
Here’s the key insight: AI implements prompts directly without considering refactoring opportunities, architectural patterns, or maintainability trade-offs. It just adds what you asked for.
The metaphor explains why refactoring avoidance occurs 80-90% of the time. AI doesn’t think “this new feature would fit better if I restructured the existing authentication module first.” It just adds the new feature wherever you asked.
There’s one key difference though. AI doesn’t learn from mistakes within a session or across projects, unlike actual juniors who improve over time.
The implication for your team? Position AI as implementation support while humans focus on architecture, product management, and strategic oversight. Organisations must fundamentally restructure development roles.
What is cognitive complexity and why does it matter for AI code?
Cognitive complexity measures how difficult it is to read and understand code, considering nesting, conditional logic, and flow. Unlike cyclomatic complexity—which counts linearly independent code paths—cognitive complexity focuses on human comprehension difficulty.
AI-generated code often has high cognitive complexity despite passing traditional metrics like unit test coverage. The “comments everywhere” anti-pattern causes increased cognitive load. Same with edge case over-specification—each hypothetical scenario adds mental overhead.
Cognitive complexity scores above 15 typically indicate code that requires significant mental effort to understand. Here’s what different AI tools actually generated when measured:
- Loveable AI agent: max cognitive complexity of 22, 8 highly complex functions
- Replit AI agent: max cognitive complexity of 20, 8 highly complex functions
- Cursor: max cognitive complexity of 10, 9 highly complex functions
- Claude Code: only 2 highly complex functions with max complexity of 9
Static analysis tools like SonarQube can measure cognitive complexity automatically, giving you objective quality metrics. During code review and pull requests, high-complexity functions receive targeted attention.
High cognitive complexity creates maintainability debt. Code becomes harder to debug, extend, and refactor over time. The costs compound.
How do context window constraints affect AI code quality?
Context window is the token limit determining how much code and conversation history an AI model can process simultaneously.
Tool comparison:
- Copilot: file-specific context (smallest)
- Cursor Normal: 128K tokens
- Cursor Max: 200K tokens with dynamic scaling
- Claude Code: consistent 200K tokens
When context fills, AI loses architectural understanding and creates inconsistent implementations across files. Cursor may reduce token capacity dynamically for performance, shortening input or dropping older context to keep responses fast.
Context blindness manifests as duplicated logic, inconsistent naming conventions, parallel implementations of the same functionality, and failure to maintain architectural patterns.
Example: AI reimplements authentication logic in multiple files because it can’t retain the original implementation beyond its context limit. You end up with three different approaches to the same problem scattered across your codebase.
Larger context windows provide better architectural coherence in large codebases but don’t eliminate the fundamental limitation. Code duplication percentage serves as a context blindness indicator. Track it.
What is “vibe coding” and how does it create technical debt?
Vibe coding is an AI-dependent programming style popularised by Andrej Karpathy in early 2025. Developers describe project goals in natural language and accept AI-generated code liberally without micromanagement.
The workflow: initial prompt → AI generation → evaluation → refinement request → iteration until “it feels right.” The developer shifts from manual coding to guiding, testing, and giving feedback about AI-generated source code.
It prioritises development velocity over code correctness, relying on iterative refinement instead of upfront planning.
This creates technical debt through:
- Lack of refactoring (each iteration adds code without improving structure)
- Architectural inconsistency (session-based decisions without global coherence)
- Accumulated complexity from edge case over-specification
OX Security experimented with vibe coding a Dart web application. New features progressively took longer to integrate. The AI coding agent never suggested refactoring, resulting in monolithic architecture with tightly coupled components.
The trade-off: faster prototyping and feature implementation versus long-term maintainability costs and increased cognitive complexity.
A 2025 Pragmatic Engineer survey reported ~85% of respondents use at least one AI tool in their workflow. Most are doing some variation of vibe coding.
It’s best suited for rapid ideation or “throwaway weekend projects” where speed is the primary goal. For production systems, you need constraints. Learn how to transition your development team from vibe coding to context engineering for sustainable AI development.
Why do traditional code reviews miss AI-specific quality issues?
Traditional code review focuses on line-by-line inspection for syntax errors, style violations, and obvious bugs.
AI code appears syntactically correct and often has high unit test coverage, passing superficial review criteria. But traditional code review cannot scale with AI’s output velocity.
The numbers tell the story. Developers on teams with high AI adoption complete 21% more tasks and merge 98% more pull requests, but PR review time increases 91%.
Individual throughput soars but review queues balloon. This velocity gap forces teams into a false choice between shipping quickly and maintaining quality.
Reviewers miss:
- Subtle hallucinated code (non-existent functions that appear plausible)
- Architectural inconsistencies across files (context blindness symptoms)
- Cognitive complexity degradation (deeply nested logic that “works” but is unmaintainable)
AI can confidently invent a function call to a library that doesn’t exist, use a deprecated API with no warning. A human reviewer might assume the non-existent function is part of a newly introduced dependency, leading to broken builds.
You need a multi-layered review framework:
Layer 1: Automated Gauntlet
- Linters
- Formatters
- Static analysis
- Security scanners
Layer 2: Strategic Human Oversight
- Architecture review
- Business logic verification
- Context-aware analysis
When AI meaningfully improves developer productivity alongside proper review processes, code quality improves in tandem. 81% of developers who use AI for code review saw quality improvements versus 55% without AI review.
For practical implementation strategies, see our guide on building quality gates for AI-generated code.
How do GitHub Copilot, Cursor, and Claude Code differ in code quality?
All three tools exhibit the 10 anti-patterns. But context window size affects severity of context blindness and architectural inconsistency issues.
Context Window Comparison:
- Copilot: file-specific context
- Cursor Normal: 128K tokens
- Cursor Max: 200K tokens with dynamic scaling
- Claude Code: consistent 200K tokens
Use Case Strengths:
- Copilot: easy to begin using because it integrates with familiar IDEs, excels at inline completions and file-specific tasks
- Cursor: ideal for large-scale refactoring in GUI environment with familiar VS Code interface
- Claude Code: best for multi-step workflows, deep repository reasoning, and automation via CLI
Model Support:
- Copilot: OpenAI Codex with multi-model support (Claude 3 Sonnet, Gemini 2.5 Pro)
- Cursor: GPT, Claude, Gemini models
- Claude Code: built on Claude Sonnet/Opus models
GitHub Copilot remains the most widely adopted with approximately 40% market share and over 20 million all-time users.
Consider codebase size—larger projects benefit from bigger context windows. Think about workflow preference: GUI versus CLI, IDE-based versus terminal-first development.
Tool choice doesn’t eliminate anti-patterns but affects their severity and detectability. For a comprehensive comparison helping you select the right toolkit, read our analysis of comparing AI coding assistants and finding the right context engineering toolkit.
How can test-driven development constrain AI code quality?
TDD workflow with AI: Write test defining expected behaviour → Prompt AI to implement code satisfying the test → Run test to verify correctness → Refactor AI output if needed.
Tests act as constraints preventing anti-patterns:
Refactoring Avoidance: Tests force interface stability during restructuring. You can refactor implementation details while tests ensure behaviour stays consistent.
Edge Case Over-Specification: Tests define actual requirements, not hypothetical scenarios. If you didn’t write a test for OAuth integration, AI won’t add it.
Hallucinated Code: Non-existent functions fail tests immediately. No ambiguity.
TDD encourages smaller, focused functions that pass specific tests rather than monolithic implementations. This directly reduces cognitive complexity.
Typical quality gate implementations include:
- Test coverage thresholds
- Complexity limits (cyclomatic: 10, cognitive: 15)
- Maintainability index minimum 70
- Zero high-severity vulnerabilities
- No hardcoded secrets
The trade-off: TDD slows initial development velocity compared to vibe coding but reduces technical debt accumulation and review burden.
Generating working code is no longer the challenge; ensuring that it’s production-ready matters most. AI copilots can quickly produce functional implementations, but speed often masks subtle flaws.
FAQ Section
What is “insecure by dumbness” in AI-generated code?
The phenomenon where non-technical users develop and deploy production applications without cybersecurity knowledge. Neither developers nor AI assistants possess knowledge to identify what security measures to implement or how to remediate vulnerabilities. The resulting code is not insecure by malpractice or malicious intent, but rather insecure by ignorance.
How many security alerts are typical in AI-augmented development?
According to OX research, organisations were dealing with an average of 569,000 security alerts at any given time before AI adoption. With AI accelerating deployment velocity, the alert volume increases proportionally while remediation capacity remains constant, creating an unsustainable detection-led security approach.
Can I use AI code in production safely?
Yes, with appropriate guardrails: automated security scanning, static analysis for complexity metrics, and focused human review on architecture and business logic. Human review on every AI-generated pull request prevents automated scanning tools from missing logical flaws. Deploying AI code faster than quality assurance can scale creates the primary risk.
Why does AI code avoid refactoring?
AI implements prompts directly without considering existing code structure opportunities. It lacks the human developer instinct to recognise “this new feature would fit better if I restructured the existing authentication module first.” Each prompt generates additive code rather than integrative improvements, leading to 80-90% occurrence of refactoring avoidance.
What is the difference between cyclomatic and cognitive complexity?
Cyclomatic complexity counts linearly independent code paths, a structural metric. Cognitive complexity measures human comprehension difficulty by weighting nested control structures and complex logic patterns. Cognitive complexity evaluates how difficult it is to read and understand code, giving insight into maintainability.
How do I know if my team is doing vibe coding?
Indicators include: rapid iteration cycles with minimal upfront planning, acceptance of AI code with minor tweaks rather than architectural review, high deployment velocity with increasing bug reports, lack of refactoring in commit history, and developers describing workflows as “I asked the AI to add X and it worked”.
Which AI coding tool has the largest context window?
Claude Code and Cursor Max mode both offer 200K-token context windows. However, Claude Code maintains consistent capacity across sessions while Cursor may dynamically reduce tokens for performance. Copilot operates primarily on file-specific context, significantly smaller than repository-wide awareness tools.
What are hallucinated code patterns and how do I detect them?
Hallucinated code occurs when AI generates functions, methods, or APIs that appear plausible but don’t actually exist. The library itself is usually correct, and functionality seems to belong in the library, but it simply doesn’t exist. Detection requires systematically verifying every function call references real library methods. Use IDE error checking and validate against official documentation.
Can AI tools understand our existing architecture patterns?
AI tools can recognise patterns within their context window but lack persistent understanding across sessions. They may follow patterns in currently loaded files but won’t maintain architectural consistency when context exceeds token limits. This leads to context blindness and parallel implementations of existing functionality.
Should I worry about technical debt from AI coding?
Yes, but the concern is deployment velocity, not AI quality per se. AI code accumulates technical debt similarly to junior developer code through refactoring avoidance and edge case over-specification, but reaches production faster than traditional review can process. Implement automated quality gates and focused architectural review to manage this risk.
How much faster can I develop with AI coding assistants?
Development velocity varies by task complexity and tool proficiency. Research shows significant productivity gains, but OX Security research indicates AI enables code to reach production faster than human review capacity can scale. The bottleneck shifts from implementation speed to quality assurance throughput.
What metrics should I track for AI code quality?
Priority metrics: Cognitive complexity scores via SonarQube or similar static analysis, refactoring frequency in commit history to detect avoidance patterns, code duplication percentage as context blindness indicator, security alert volume and remediation time, and ratio of automated versus human-detected issues in review. Defect density in production reveals real-world reliability.