Business

SaaS

Technology

•

Mar 5, 2026

Why Security Awareness Training Is Not Enough to Stop AI Voice Fraud

Here’s the number your security training budget can’t answer: humans detect deepfake audio approximately 48% of the time. Worse than a coin flip. For deepfake video it drops to around 24.5%. An arxiv systematic review examined more than twelve empirical studies and found the same failure pattern across different populations, languages, and voice cloning systems. A NIH-indexed Monash University review confirmed it — “human judgement of deepfake audio is not always reliable.”

The standard response to AI voice fraud is to schedule more security awareness training. On the surface, that seems reasonable. Training has improved employee behaviour against email phishing, and phishing is the category most organisations have put money into. But AI voice fraud is a different class of attack. It exploits a perceptual limitation, not a knowledge gap. You can train employees to understand what vishing is. You cannot train their auditory cortex to detect synthesis artefacts below the human perceptual threshold.

Vishing attacks surged 442% between the first and second halves of 2024 (CrowdStrike 2025 Global Threat Report). The ShinyHunters/Scattered Spider campaign compromised more than 760 organisations — including Google, Cisco, Wynn Resorts, and Harvard University — through IT helpdesk impersonation calls. Deepfake-enabled fraud losses are projected to reach $40 billion by 2027. For the full AI voice fraud threat picture, the economics are very much in the attacker’s favour.

This article looks at the empirical limits of human detection, evaluates the technical controls available, and identifies where process controls need to pick up the slack when both fail.

Can trained employees actually detect a deepfake voice or video call?

No — not reliably. The research is consistent even when conditions favour the listener. Under controlled laboratory settings with participants primed to listen for fakes, detection accuracy reaches 60–73%. Watson et al. (2021) found an average detection accuracy of 42%. Barnekow et al. (2021) found participants correctly identified a cloned voice in only 37% of cases. Frank et al. (2024) found rates of 50–60%, described as “slightly above chance.”

The lab numbers are already bad. Real-world conditions make them worse.

A real vishing call gives the target a ringing phone in the middle of a workday, a caller claiming to be a senior executive or IT support, and a request loaded with urgency and authority. The arxiv study found meaningful degradation in detection accuracy under divided attention. Add time pressure and the social cost of being wrong about a legitimate caller, and whatever marginal advantage a trained employee had disappears completely.

For deepfake video it’s even worse. Human detection accuracy sits at approximately 24.5% (Keepnet Labs 2026). A 2025 iProov study found only 0.1% of participants correctly identified all fake and real media presented to them.

The accessibility of the attack makes it worse again. Modern voice cloning requires as little as three seconds of audio to generate a replica with 85% voice match accuracy. Executive voices aren’t hard to come by — earnings calls, conference presentations, podcast appearances, and LinkedIn video are all free training data for any attacker who bothers to look. For more on why cloned voices are so convincing, the technical mechanisms explain the perceptual reality.

An employee who completes every training module, scores 100% on the vishing quiz, and genuinely understands how voice fraud works still has roughly even odds of detecting a high-quality voice clone under real call conditions. Training addresses what they know. It does not change what their ears can perceive.

Why does security awareness training fail specifically against AI voice fraud?

Training fails against AI voice fraud because it addresses a knowledge deficit while the attack exploits a perceptual limitation. These are different problems that require different solutions.

This isn’t a blanket attack on security awareness training. SAT genuinely works against email phishing. When the threat cues are visual, cognitive, and verifiable — check the sender domain, hover the link, look for urgency and grammatical anomalies — training improves detection. The employee can pause, examine the email, run through a checklist, and make a considered decision.

Voice fraud removes every one of those checkpoints. The call happens in real time. There’s no link to hover, no domain to inspect, no opportunity to pause and verify while the caller waits. The employee must decide under social pressure, in the moment, based on whether the voice sounds genuine. That’s an auditory perceptual task, not a cognitive knowledge task.

The residual risk numbers confirm it. Organisations running regular vishing simulation programmes find approximately 33% of trained employees still disclosing sensitive information under pressure despite strong warnings (Keepnet Labs). That’s not a training failure in the conventional sense — it’s a ceiling that further training doesn’t lower.

To be clear: training is not useless. It has real value as part of a defence stack — it teaches employees to follow verification protocols, escalate suspicious requests, and stay sceptical. The problem is treating it as the primary defence when the underlying detection capability is demonstrably below the threshold required.

What does detection technology actually achieve — and what are its limits?

Current detection technologies — liveness detection, behavioural biometrics, and AI voice analysis — each address a specific attack surface. None provides comprehensive real-time protection against AI voice fraud on its own.

Liveness detection requires real-time physical actions — turn your head, hold up three fingers, blink — to confirm a live human rather than a synthetic video feed. It works well against pre-recorded deepfake video. It’s less effective against real-time AI-generated feeds, which are improving fast. And it doesn’t apply to voice-only calls at all.

Behavioural biometrics analyses micro-patterns of user behaviour — keystroke dynamics, scroll speed, device handling — to distinguish genuine users from synthetic fraud. The accuracy figures are impressive: 98.7% against synthetic identity fraud (Innovify/BIIA data). The catch is that this applies to digital session analysis. It won’t tell you the caller claiming to be your CFO is synthetic.

AI voice detection tools are an emerging category. Commercial tools claiming 96–98% accuracy in laboratory conditions drop to 50–65% in real-world deployments. Research by CSIRO found that leading tools collapsed below 50% accuracy when confronted with deepfakes produced by systems they hadn’t been trained on. The adversarial adaptation cycle is ongoing.

Detection technology adds a complementary layer, not a replacement for process controls. The gap between detecting synthetic account activity at 98.7% and detecting a live synthetic voice call in real time remains substantial — and unsolved at commercial scale.

Hardware MFA vs SMS-based MFA — which resists voice social engineering?

Hardware-key MFA (FIDO2/passkeys) resists voice social engineering structurally, not probabilistically. SMS-based MFA fails because it was never designed for an adversary who can hold a convincing real-time conversation.

In a vishing call, the attacker already has valid credentials. The only barrier left is the MFA step. They trigger an MFA prompt, then instruct the target to read the SMS code aloud to “verify your identity.” The target reads the code. Credential compromised. No malware required, no technical sophistication — just a convincing voice and a cooperative target.

MFA fatigue (prompt bombing) is the push-notification equivalent. The attacker floods the target with approval requests until one gets approved out of frustration. The Uber breach of September 2022 followed this exact pattern. MFA fatigue accounts for 14% of security incidents in the 2025 Verizon DBIR — a rising SMB threat because push MFA is the default for many cloud services.

FIDO2 hardware keys eliminate these attack vectors at the cryptographic layer. The private key never leaves the hardware authenticator. Authentication challenges are bound to a specific domain — a fake login page cannot receive a valid authentication response because the domain binding fails automatically. There’s nothing to read aloud, no push notification to approve, no code to share.

Google’s deployment of FIDO security keys across 85,000+ employees produced zero successful phishing attacks. Microsoft has extended phishing-resistant MFA to 92% of its workforce. CISA designates FIDO2/WebAuthn as one of only two approved phishing-resistant authentication implementations.

The SMB cost concern is manageable. Hardware security keys cost $25–50 per unit. For a 200-person company, you’re not doing full coverage straight away — you’re protecting the 20–30 highest-risk users: executives, finance team, IT administrators, and helpdesk staff. That targeted deployment costs $500–1,500 and eliminates the highest-value attack surface. One important operational note: legacy authentication fallbacks need to go once hardware keys are deployed. If SMS codes remain as a backup option, attackers will use them.

What can vishing simulations realistically do for your organisation?

Vishing simulations test whether employees follow verification protocols under pressure. That’s valuable — and it’s also the full extent of what they can achieve. Simulations can’t test whether employees are better at detecting synthetic voices, because that’s not a trainable skill at the accuracy levels the threat requires.

Well-run simulation programmes can achieve up to 90% attack recognition rates as measured by protocol adherence (Keepnet Labs). But that same data shows 33% of trained employees still disclosing information despite strong warnings — a floor that repeated simulation doesn’t lower. The detection gap persists because the problem is perceptual, not procedural.

There’s also a negative-return risk in detection-focused training. Research found that deepfake detection training improved accuracy by around 20%, but participants also became more anxious and less confident — measurable psychological cost without a corresponding improvement in practical outcomes. Some participants overestimated their detection capability, creating false confidence that actually weakened overall verification practices.

The right framing for simulation programmes is as process-gap identification tools. Run them quarterly, targeting finance and helpdesk teams specifically — these are the people handling the highest-value requests. Measure protocol adherence: did the employee follow the callback procedure? Did they escalate the request? Use results to fix process gaps, not to assess whether employees can spot a synthetic voice.

If training and detection aren’t enough, what fills the gap?

Process controls — specifically out-of-band verification, dual-authorisation procedures, and structured escalation protocols — are the compensating layer where human and technical detection both have documented failure modes.

Callback verification is the highest-leverage single vishing prevention control, identified consistently by security researchers including Vectra. Any sensitive request received by phone — wire transfer, credential reset, access change — must be independently verified by hanging up and calling back on a pre-registered, separately confirmed number. The mechanism is channel disruption: the attacker controls the incoming call; callback verification transfers control to the target. An attacker impersonating the CEO cannot receive calls on the CEO’s pre-registered number. The attack breaks at the channel layer without requiring the target to detect anything.

The Arup case shows exactly why this matters. A finance worker was tricked into wiring $25 million in a deepfake video conference call while actively attempting to verify the request. But verification happened within the channel the attacker controlled. Out-of-band verification would have broken the attack regardless of how convincing the deepfake was.

Dual-authorisation for financial transfers eliminates the single point of failure that CEO fraud exploits. Wire transfer instructions above a defined threshold require two separate authorisations from different individuals. The attacker who successfully impersonates the CEO — bypassing detection, bypassing training — still cannot authorise a transfer unilaterally.

The complete defence stack: security awareness training (verification protocols, escalation behaviour) + technical controls (behavioural biometrics for session-layer fraud, phishing-resistant MFA for credential attacks) + process controls (callback verification, dual authorisation). No single component is sufficient. Together, they are resilient because no single failure cascades to a catastrophic outcome.

The threat is scaling. Vishing surged 442%. Financial services organisations face average losses of $603,000 per deepfake incident. The default response — train your people harder — is empirically insufficient against an attack that defeats human detection at the perceptual level. This is one piece of the wider AI social engineering landscape — from attack mechanics through to legal exposure — that demands a layered response. Building that process layer without a dedicated security team is what we look at next.

Frequently Asked Questions

Can humans tell the difference between a real voice and an AI-generated voice?

Not reliably. Multiple studies find human detection of deepfake audio ranges from 37–73% in controlled laboratory settings. Under real-world conditions — divided attention, authority pressure, urgency — detection rates fall further. A 2025 iProov study found only 0.1% of participants correctly identified all fake and real media. Training improves awareness of vishing as a threat category but does not meaningfully improve auditory discrimination.

How much audio does an attacker need to clone someone’s voice?

Three seconds of clear audio is sufficient to create a voice clone with an 85% voice match to the original speaker. Higher-quality clones require only 10–30 seconds. Sources of executive audio include quarterly earnings calls, conference presentations, podcast appearances, LinkedIn videos, and media interviews — all publicly accessible.

What is MFA fatigue and how does it work?

MFA fatigue, also called prompt bombing or push bombing, is an attack where an adversary with valid credentials repeatedly triggers push-notification MFA approval requests until the target approves one out of frustration. No technical bypass is required — only persistence. The attack often includes a vishing call impersonating IT support: “Approve the MFA notification to stop the alerts.” MFA fatigue accounts for 14% of security incidents per the 2025 Verizon DBIR.

Is security awareness training completely useless against AI voice fraud?

No. Training has genuine value as part of a layered defence — it teaches employees to follow verification protocols, escalate suspicious requests, and maintain scepticism under pressure. The argument is that training alone is insufficient as the primary defence because it cannot overcome the perceptual limitation that places human detection accuracy below reliable thresholds. Thirty-three percent of trained employees still disclose sensitive information under vishing pressure despite strong warnings — a floor that further training does not lower.

What is the difference between FIDO2 and SMS-based MFA?

SMS MFA sends a one-time code that can be read aloud during a vishing call, intercepted via SIM swapping, or captured through SS7 protocol exploitation. FIDO2 uses asymmetric cryptography bound to a specific domain and hardware device: the private key never leaves the authenticator, authentication fails automatically on phishing sites, and there is no code to read aloud. CISA designates FIDO2/WebAuthn as one of only two approved phishing-resistant authentication implementations.

How effective is behavioural biometrics at detecting deepfake fraud?

Behavioural biometrics achieves 98.7% accuracy against synthetic identity fraud using four-modal fusion analysis. This high accuracy applies to digital session analysis — detecting automated or synthetic account activity. Its applicability to real-time voice call fraud detection is limited because the analysis operates at the device interaction layer, not the voice layer. It will not detect a synthetic caller in a live conversation.

What is callback verification and why is it important?

Callback verification is a procedural control requiring that any sensitive request received by phone be independently verified by hanging up and calling back on a pre-registered, separately confirmed number. It breaks the attacker’s control of the communication channel — the mechanism that makes vishing impersonation viable. Security researchers identify it as the single highest-leverage vishing prevention control because it works regardless of how convincing the attacker is.

How often should we run vishing simulations?

Quarterly is the recommended cadence, targeting finance and helpdesk teams as the employees handling the highest-value requests. Measure protocol adherence — did the employee follow the callback procedure, escalate the request, resist authority pressure? — rather than detection accuracy. Use results to fix process gaps, not to grade employees on their ability to detect synthetic voices.

What did the ShinyHunters/Scattered Spider campaign demonstrate about vishing risk?

The campaign compromised more than 760 organisations through IT helpdesk impersonation vishing calls targeting SSO credentials. Confirmed victims included Google, Cisco, Wynn Resorts, and Harvard University. Operators were paid $500–$1,000 per successful call using pre-written scripts. It demonstrated the scale at which vishing can be industrialised — compromising organisations with established security programmes.

How much does it cost to deploy FIDO2 hardware keys for an SMB?

Hardware security keys cost $25–50 per unit. For a 200-person company, the initial deployment covers 20–30 high-risk users — executives, finance team members, IT administrators, and helpdesk staff. This targeted deployment costs $500–1,500 and addresses the highest-value attack surface. Full workforce deployment can follow on a longer timeline as budget allows.

Why Security Awareness Training Is Not Enough to Stop AI Voice Fraud

Can trained employees actually detect a deepfake voice or video call?

Why does security awareness training fail specifically against AI voice fraud?

What does detection technology actually achieve — and what are its limits?

Hardware MFA vs SMS-based MFA — which resists voice social engineering?

What can vishing simulations realistically do for your organisation?

If training and detection aren’t enough, what fills the gap?

Frequently Asked Questions

Can humans tell the difference between a real voice and an AI-generated voice?

How much audio does an attacker need to clone someone’s voice?

What is MFA fatigue and how does it work?

Is security awareness training completely useless against AI voice fraud?

What is the difference between FIDO2 and SMS-based MFA?

How effective is behavioural biometrics at detecting deepfake fraud?

What is callback verification and why is it important?

How often should we run vishing simulations?

What did the ShinyHunters/Scattered Spider campaign demonstrate about vishing risk?

How much does it cost to deploy FIDO2 hardware keys for an SMB?

Related Articles

Survive Disasters by Getting the Basics of Business Continuity Right

The AI Job Replacement Calculator

How To Convince The C-suite to Invest in a Development Team Extension

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG