Insights Business| SaaS| Technology Vibe Citing and the Collapse of Peer Review at the World’s Top AI Conference
Business
|
SaaS
|
Technology
Mar 20, 2026

Vibe Citing and the Collapse of Peer Review at the World’s Top AI Conference

AUTHOR

James A. Wondrasek James A. Wondrasek
Graphic representation of the topic Vibe Citing and the Collapse of Peer Review at the World's Top AI Conference

At NeurIPS 2025 — the world’s most prestigious AI venue — GPTZero found 100 fabricated citations in 51 accepted papers. Peer review caught none of them.

GPTZero is the AI content detection company. They scanned 4,841 accepted papers from the Conference on Neural Information Processing Systems and came up with a name for what they found: “vibe citing.” This article is part of our series on what AI slop is and where it shows up — and academic peer review is now one of its most consequential vectors.

If the research your team uses to evaluate AI tools and benchmark vendor claims contains fabricated evidence, your decisions are built on sources that don’t exist. Here’s how it happened, why peer review failed, and what you should actually do about it.

What is vibe citing — and who coined the term?

Vibe citing is when AI-hallucinated citations end up in academic papers. References that look plausible but point to works that don’t exist. Invented authors, fabricated titles, fake DOIs, arXiv IDs pointing nowhere.

GPTZero’s Head of Machine Learning, Alex Adams, coined the term as a riff on “vibe coding” — Andrej Karpathy’s name for AI-assisted programming by feel rather than comprehension. The researcher isn’t reading and synthesising sources. They’re letting an LLM generate plausible-sounding references and calling it done.

That distinction from ordinary citation errors matters. A typo in a page number can be checked against the real paper. Vibe citations reference papers that don’t exist at all. The LLM generates syntactically correct, genre-appropriate references — author names that could be real, titles that sound like legitimate ML papers, correctly formatted venue identifiers. They read as legitimate until you actually look them up.

This isn’t plagiarism. It isn’t data fabrication. It’s a third category of research misconduct, enabled by LLMs at scale, and invisible to the human eye.

What did GPTZero find in the NeurIPS 2025 papers?

The numbers are precise. GPTZero’s Hallucination Check tool scanned 4,841 of the 5,290 papers accepted by NeurIPS 2025 and found 100 confirmed hallucinated citations across 51 papers.

NeurIPS 2025 received 21,575 submissions and accepted 5,290 — a 24.52% acceptance rate. Each of those 51 affected papers cleared a competitive bar and still went out with fabricated sources.

The University of Chester’s arXiv paper (2602.05930) breaks down how hallucinated citations fail into five categories. Total Fabrication accounts for 66% of cases — the entire citation invented from scratch. Partial Attribute Corruption (27%) blends real elements with fabricated ones. Identifier Hijacking (4%) uses a valid DOI that points to an unrelated paper. Semantic Hallucination (1%) and Placeholder Hallucination (2%) — obvious template failures like “Firstname Lastname” — round things out.

The most significant finding: 100% of hallucinated citations exhibited multiple failure modes simultaneously. That’s what makes them so hard to catch. They defeat several verification checks at once, not a single obvious one.

GPTZero’s Hallucination Check verifies citations against Google Scholar, PubMed, arXiv, CrossRef, and DOI/URL validation — a multi-database cross-reference no human reviewer performs routinely. The tool catches 99 out of 100 flawed citations.

The trend line matters as much as the point-in-time finding. A December 2025 pre-print found the average number of objective mistakes per NeurIPS paper grew from 3.8 in 2021 to 5.9 in 2025 — a 55.3% increase that tracks directly with ChatGPT’s launch in November 2022.

How did fabricated citations get through peer review?

Peer reviewers are domain experts. Their job is to evaluate whether research claims hold up — not to audit citations. Nobody in the NeurIPS review process is formally tasked with verifying that every referenced work actually exists.

NeurIPS submissions grew 220% from 2020 to 2025 — from 9,467 to 21,575. GPTZero calls this the “submission tsunami.” A typical reviewer handles four to eight papers per cycle, each with 30 to 60 references. Manually verifying hundreds of citations per cycle isn’t feasible. And vibe citations defeat visual inspection — correct journal name formats, plausible author combinations, appropriate venues for the claimed year.

The quality failure runs in both directions. At ICLR 2026, authors withdrew papers after discovering their reviewers had used AI to write feedback. NeurIPS launched its Responsible Reviewing Initiative in 2025, acknowledging the problem — but it didn’t prevent the hallucinated citations. The structural conditions remain.

Is this an isolated incident or a growing pattern?

NeurIPS isn’t alone. Before the NeurIPS investigation, GPTZero had already identified more than 50 hallucinated citations in papers submitted to ICLR 2026. GPTZero names ICLR, NeurIPS, ICML, and AAAI as the top four ML and AI conferences — all facing the same pressures.

The Reuters Institute’s 2026 report frames academic contamination as part of a broader AI content integrity problem. And it goes well beyond academia. The US MAHA report had citation errors detected by GPTZero within a week of its release. GPTZero’s analysis of a 234-page Deloitte Australia report found 19 hallucinations in 141 citations — the case that ended in a $98,000 AUD refund.

The structural driver is publication pressure combined with paper mills, with an LLM filling the ghostwriting role faster and harder to detect than any previous method.

The long-term risk is propagation through the citation graph. Future papers citing contaminated papers inherit corrupt evidence chains.

The same problem is showing up in courtrooms

More than 800 errant legal citations attributed to AI have been flagged in US court filings, with attorney sanctions following.

The structural parallel is obvious. Judges and opposing counsel aren’t expected to proactively verify every cited case — the same mismatch as peer review. The same root cause (LLM hallucination producing correctly formatted references pointing to nothing real) produces the same failure in both contexts.

The legal community has moved faster to enforce consequences — sanctions, mandatory disclosure in some jurisdictions — and that trajectory is worth watching as a signal for where academic responses are likely to follow.

Why this matters if you rely on AI research to make technical decisions

Here’s the specific risk. You’re evaluating a vendor’s benchmark claims. You pull an academic paper to calibrate your evaluation. The methodology might be sound — but if the literature review contains hallucinated citations, the supporting evidence base is fabricated. You’re building your evaluation on sources that don’t exist.

Research papers inform architectural choices, model capability assessments, and stakeholder briefings. Each of those has a research dependency that may be compromised at the source.

The practical response is calibrated scepticism, not blanket dismissal. Check whether the paper was submitted to a venue with automated citation verification. Hallucinated citations cluster in literature reviews, not methodology — so papers with reproducible code carry lower risk. Run suspicious citations through Google Scholar or CrossRef. For a systematic look at evaluating AI detection tools for research and training data, including their reliability limits and where human review remains necessary, see our dedicated guide.

There’s an irony worth naming. The AI tools generating research papers are being evaluated, in part, by research partly generated by those same tools. The circularity compounds at every layer.

The same hallucination mechanism behind vibe citing is what drives model collapse in AI training pipelines — when synthetic content is recursively fed back into future training runs, the degradation compounds at every cycle.

Frequently asked questions

What exactly is vibe citing?

Vibe citing is when AI language models generate academic citations without verifying the referenced works actually exist. The term was coined by GPTZero’s Alex Adams, riffing on “vibe coding.” These aren’t minor formatting errors — they’re wholesale inventions that happen to look syntactically correct.

Is all of NeurIPS 2025 compromised?

No. GPTZero found 100 hallucinated citations in 51 of 4,841 papers — approximately 1.05% of papers. The NeurIPS Board noted that incorrect references don’t necessarily invalidate the paper content. The concern is that the contamination is invisible to standard reading and review.

Can peer review be fixed to catch AI-generated citations?

The structural fix is automated citation verification at the submission stage, before peer review begins — similar to how plagiarism checkers now operate. ICLR has begun requiring disclosure and is coordinating with GPTZero. Policy statements without mandatory automated checking aren’t going to cut it.

What is the difference between vibe citing and just making a mistake?

Ordinary citation errors can be checked against the real paper. Vibe citations reference papers that don’t exist. GPTZero’s methodology excludes obvious spelling mistakes and dead URLs as plausibly human — vibe citing is specifically AI-generated, holistic fabrication.

How do hallucinated citations actually look in a paper?

A typical vibe citation reads like this: an author name that could be real, a title that sounds like a plausible ML paper, a venue such as “NeurIPS 2023” or “ICLR 2022,” and a DOI or arXiv ID that either leads nowhere or points to an unrelated paper. Total Fabrication (66% of cases) involves the entire reference being invented; Partial Attribute Corruption (27%) blends real elements with fabricated ones; Identifier Hijacking (4%) attaches real DOIs to wrong papers — and 100% of cases exhibit multiple failure modes simultaneously.

What is GPTZero’s Hallucination Check tool?

GPTZero’s Hallucination Check is an automated citation verification service that checks references against Google Scholar, PubMed, arXiv, CrossRef, and DOI/URL validation databases. It catches 99 out of 100 flawed citations and was the instrument used to scan 4,841 NeurIPS 2025 papers.

Why does the submission volume at NeurIPS matter?

NeurIPS submissions grew 220% from 9,467 in 2020 to 21,575 in 2025, stretching reviewer capacity across more papers with less experience per review. Citation verification — never formally required — becomes even less likely under that load.

Are AI-written reviews by peer reviewers also a problem?

Yes. At ICLR 2026, authors discovered their reviewers had used AI to write feedback, leading to paper withdrawals. The failure runs in both directions.

Does this problem only affect AI conferences?

No. The same hallucination mechanism produced 800+ fabricated legal citations in US court filings with attorney sanctions, errors in the US MAHA government report, and Deloitte’s $98,000 AUD refund. The consistent cross-domain pattern confirms this is an LLM deployment issue, not an academic one.

How should you assess whether a paper’s citations are trustworthy?

Check whether the paper was submitted to a venue with automated citation verification. Papers with reproducible code and experimental results carry lower risk — fabricated citations tend to concentrate in literature review sections, not methodology. Run citations that look suspicious through Google Scholar or CrossRef manually.

Will papers with hallucinated citations be retracted from NeurIPS 2025?

NeurIPS’s LLM policy designates hallucinated citations as grounds for revocation — but enforcement for already-accepted papers is less clear than pre-acceptance detection. ICLR’s policy is explicit about rejection. Post-publication correction in conference proceedings is structurally harder than in journals.

Is this an academic problem or does it affect technology decisions directly?

If vendor benchmark claims are supported by citations from contaminated papers, the evidence base for your technology decisions is compromised. The Deloitte case is one documented instance. As AI research informs more procurement decisions, the contamination risk moves upstream into technology governance. For the full scope of the AI slop problem — from content farms and search degradation through to model collapse and strategic response — our overview covers the full landscape.

AUTHOR

James A. Wondrasek James A. Wondrasek

SHARE ARTICLE

Share
Copy Link

Related Articles

Need a reliable team to help achieve your software goals?

Drop us a line! We'd love to discuss your project.

Offices Dots
Offices

BUSINESS HOURS

Monday - Friday
9 AM - 9 PM (Sydney Time)
9 AM - 5 PM (Yogyakarta Time)

Monday - Friday
9 AM - 9 PM (Sydney Time)
9 AM - 5 PM (Yogyakarta Time)

Sydney

SYDNEY

55 Pyrmont Bridge Road
Pyrmont, NSW, 2009
Australia

55 Pyrmont Bridge Road, Pyrmont, NSW, 2009, Australia

+61 2-8123-0997

Yogyakarta

YOGYAKARTA

Unit A & B
Jl. Prof. Herman Yohanes No.1125, Terban, Gondokusuman, Yogyakarta,
Daerah Istimewa Yogyakarta 55223
Indonesia

Unit A & B Jl. Prof. Herman Yohanes No.1125, Yogyakarta, Daerah Istimewa Yogyakarta 55223, Indonesia

+62 274-4539660
Bandung

BANDUNG

JL. Banda No. 30
Bandung 40115
Indonesia

JL. Banda No. 30, Bandung 40115, Indonesia

+62 858-6514-9577

Subscribe to our newsletter