Insights Business| SaaS| Technology How AI Voice Cloning and Deepfake Technology Actually Works
Business
|
SaaS
|
Technology
Mar 5, 2026

How AI Voice Cloning and Deepfake Technology Actually Works

AUTHOR

James A. Wondrasek James A. Wondrasek
Graphic representation of AI voice cloning and deepfake technology

In January 2024, an employee at Arup in Hong Kong joined a video call with what appeared to be his CFO and several senior colleagues. He authorised 15 wire transfers. The total: $25.6 million USD. Every person on that call was a deepfake.

Most coverage of this topic stops at “AI can clone voices now.” That’s not useful. It doesn’t tell you how the pipeline works — from a 30-second clip pulled off a podcast to a live fraudulent call that bypasses every intuitive check a victim has. Without understanding the mechanics, you can’t make an honest assessment of the risk.

Voice cloning is one layer. Deepfake video, caller ID spoofing, and Dark LLMs scripting the conversation combine into an attack package that requires no specialist skills. This article walks through each layer — including the detection accuracy data that explains why human vigilance alone doesn’t hold up.

For the broader strategic context, read what every business needs to know about AI-enabled social engineering threats.

What is AI voice cloning and how does the technology actually work?

AI voice cloning is a machine learning technique. It analyses the acoustic characteristics of a person’s speech — pitch, cadence, tone, rhythm, filler word patterns — and uses that to generate synthetic speech that sounds like them. Feed the model a clean audio sample, give it text input, and it outputs speech in the target’s voice.

The architecture works by converting raw audio into spectrograms — visual representations of sound frequencies over time. An encoder-decoder model (or a diffusion-based system) learns the unique vocal fingerprint in those spectrograms. NCC Group’s research describes this as disentangling linguistic content (what is being said) from identity content (who is saying it). The model holds the “who” constant while generating whatever “what” the attacker needs.

Three distinct capabilities get lumped under “voice cloning,” and it’s worth separating them:

For live vishing calls, real-time voice conversion is the variant that matters — the attacker can respond, adapt, and improvise. Cloud infrastructure is what changed the threat landscape. You don’t need to understand the ML. You need an audio sample and a subscription.

How much audio does an attacker need to clone someone’s voice?

Far less than most people assume. McAfee research found that 3 seconds of audio produces an 85% voice match. ThreatLocker puts high-fidelity cloning at 30 seconds. NCC Group trained a convincing clone from just a few minutes of publicly available samples. These aren’t contradictions — they reflect different quality thresholds on a sliding scale.

Where does that audio come from? No hacking required. PurpleSec’s analysis of the Arup attack spells out the sources: LinkedIn profile videos, conference presentations, earnings calls, interview recordings, corporate marketing videos. For any executive who has done a podcast, presented at a conference, or posted a video — that material already exists and is accessible to anyone with a browser.

Audio quality matters more than duration. Clean, isolated speech is more valuable than hours of recording in a crowded room. The Biden deepfake robocall used to disrupt the 2024 New Hampshire primary cost $1 and took less than 20 minutes.

What is the voice clone creation pipeline from source audio to live fraudulent call?

This is the piece most coverage skips. Here is the actual pipeline, stage by stage.

Stage 1 — Source Audio Collection. The attacker identifies targets — typically financial decision-makers and the executives who’d plausibly contact them — and harvests audio from public sources. No credentials required.

Stage 2 — Audio Isolation and Preprocessing. Raw audio is cleaned using noise-gating, equalisation, and background removal. NCC Group’s team partly automated this with ML-based speaker identification. Output: isolated, clean speech in the target’s voice.

Stage 3 — Neural Network Training. The cleaned audio is converted to spectrograms. A self-supervised model extracts the voice embedding — a mathematical fingerprint of the target’s vocal identity — and fine-tunes a pre-existing model against it. With cloud GPU instances, this step drops from hours to minutes.

Stage 4 — Synthetic Speech Generation. In text-to-speech mode, the attacker types a script and generates audio in the target’s voice. In real-time speech-to-speech mode, the attacker speaks and the output stream transforms their voice live.

Stage 5 — Deployment in Live Attack. The cloned voice is routed through a virtual audio device into the attack channel — a phone call via VoIP, or directly into Microsoft Teams or Google Meet. Caller ID spoofing displays the target’s real number on the victim’s screen. The victim sees the correct number, hears the familiar voice, and has no instinctive reason to doubt either signal.

The entire pipeline can be run by a single person with no ML background. Understanding how cheap the toolchain has become is worth its own read.

How do deepfake video calls work — and can entire meetings be faked?

Yes. The Arup attack proved it operationally.

Attackers collect executive video footage from public sources, then train face-swap and lip-sync models. Tools like DeepFaceLab are open source. PurpleSec notes that realistic deepfakes can be generated in approximately 45 minutes. GANs or diffusion models synthesise facial movements; lip-sync models match mouth movements to the cloned audio; a virtual camera driver injects the synthetic feed directly into Zoom or Teams.

What made the Arup attack effective was the multi-participant setup. The meeting featured the CFO and multiple senior colleagues — creating false consensus, establishing authority through numbers, and building urgency through a “confidential” framing. No software vulnerability was exploited. The attackers leveraged AI video and audio to impersonate trusted individuals across all 15 transactions before the fraud was discovered.

Deepfake video attacks are more resource-intensive than voice-only vishing but deliver higher-value outcomes. For how these attacks target specific business functions, that’s the logical next step.

What are Dark LLMs and why do criminals prefer them over ChatGPT?

Dark LLMs are purpose-built criminal language models — not jailbroken mainstream tools, but systems built from the ground up with no safety training. They run behind Tor, ignore safety rules by design, and are sold as subscription services.

WormGPT launched in June 2023, built on GPT-J-6B, promoted as “a ChatGPT alternative for blackhat.” FraudGPT followed days later, sold by “CanadianKingpin12.” The same vendor advertised DarkBERT, DarkBARD, and DarkGPT. Outpost24 documented pricing: WormGPT from $90/month, FraudGPT at $90–$700 depending on term.

Group-IB data via The Register shows dark web AI mentions are up 371% since 2019. And the distinction from jailbroken mainstream models matters. Jailbreaking ChatGPT requires ongoing prompt engineering against actively improving safety filters. Dark LLMs generate malicious content reliably by design — phishing emails, pretext scripts, malware code, synthetic persona backstories. In a voice cloning attack, the Dark LLM is the scripting layer: contextually convincing cover stories in any language, no English skills required.

What is a synthetic persona and how is it different from voice cloning?

Voice cloning impersonates a real person you already trust. A synthetic persona is a fabricated identity — someone who never existed — built to establish a new trust relationship from scratch.

The components: an AI-generated face, a synthetic voice, a fabricated employment history, and a matching social media presence. Group-IB documented that complete synthetic identity kits sell for approximately $5 on dark web marketplaces. A cloned executive voice only works if the victim already trusts that executive. A synthetic persona fabricates a “new colleague” or “vendor rep” — the trust is built through the interaction itself.

BIIA’s 2026 data shows synthetic identities were used in 21% of first-party frauds detected in 2025, and deepfake files grew from roughly 500,000 in 2023 to 8 million in 2025. This is mainstream fraud infrastructure. And the reason it spreads so fast is that humans are reliably poor at catching it.

How convincing are deepfake voices and video — can humans actually tell the difference?

The numbers do the talking. Humans detect deepfake audio at approximately 48% accuracy — worse than a coin flip. For deepfake video, DeepStrike puts human accuracy at 24.5% — roughly one in four. A 2025 iProov study found that only 0.1% of participants correctly identified all fake and real media they were shown.

The confidence gap compounds this. Approximately 60% of people believe they can spot a deepfake. Actual performance is near random. That overestimation is what keeps people trusting instinct over procedure.

Why is detection so hard? Generative models — GANs and diffusion models — are trained specifically to minimise detectable artefacts. Open-source deepfake detectors can see accuracy fall by as much as 50% against new in-the-wild deepfakes not in their training data. There’s also the Liar’s Dividend: as deepfake awareness spreads, authentic audio and video can be dismissed as synthetic. Real evidence gets denied as AI-generated.

These detection rates mean human vigilance cannot be the primary defence. The Arup fraud revealed the gap: no mandatory out-of-band verification protocol was in place. That’s the shift — from training people to spot fakes, to designing procedures that remove the detection requirement entirely. For the full AI-enabled social engineering threat landscape, the picture is bigger than any single attack type.

FAQ

Can AI really clone my CEO’s voice from a podcast?

Yes. A single podcast appearance provides more than enough source audio. Cloud-based voice cloning tools require as little as 3–30 seconds of clear speech. Any executive who has spoken publicly — podcast, conference, earnings call, LinkedIn video — has already provided sufficient material.

Is it true criminals can make a fake voice with just a few seconds of audio?

McAfee research found that 3 seconds produces an 85% voice match. Higher fidelity requires 20–30 seconds. Either threshold is met by most publicly available recordings of business leaders.

Can video calls be faked now too — even with multiple people on screen?

Yes. The 2024 Arup attack demonstrated a multi-person deepfake conference where the CFO and multiple executives were all synthetic. The fraud totalled $25.6 million and was discovered only through routine post-transaction follow-up.

What does a deepfake voice call sound like?

Perceptually indistinguishable from real speech in most cases. Humans detect deepfake audio at approximately 48% accuracy. Modern voice cloning tools produce natural-sounding output with realistic cadence and filler words, and phone audio compression hides minor artefacts.

How long does it take to create a voice clone?

NCC Group completed their proof-of-concept on a consumer laptop GPU. With cloud APIs renting GPU compute by the hour, training compresses further. A single person with no ML background can run the full pipeline in well under a day.

What is the difference between a voice clone and a synthetic persona?

A voice clone replicates a real person’s voice. A synthetic persona is a fabricated identity that never existed — AI-generated face, synthetic voice, backstory. Voice cloning exploits existing trust; synthetic personas create new fake people to build fresh trust from scratch.

Are Dark LLMs the same as jailbroken ChatGPT?

No. Jailbreaking requires ongoing prompt engineering against safety filters. Dark LLMs like WormGPT and FraudGPT are purpose-built with no safety training, reliably generating malicious content by design. Criminal SaaS at $90–$110/month depending on the tool.

Why is caller ID spoofing combined with voice cloning so effective?

It simultaneously eliminates both intuitive checks: “Is this their number?” and “Does this sound like them?” Both answers appear to be yes, which removes the victim’s instinctive reason to pause.

How much does it cost an attacker to build a complete voice clone attack?

The full toolchain is cheap. Cloud voice cloning access, a Dark LLM subscription, caller ID spoofing, VoIP infrastructure. Synthetic identity kits sell for approximately $5. The Biden deepfake robocall cost $1 and took less than 20 minutes. A complete attack package runs well under $100.

What is the Liar’s Dividend and why does it matter?

The second-order effect of deepfake proliferation: authentic audio and video can be dismissed as AI-generated. Real evidence gets denied as synthetic. It erodes trust in all digital communications — not just the communications that are actually fake.

AUTHOR

James A. Wondrasek James A. Wondrasek

SHARE ARTICLE

Share
Copy Link

Related Articles

Need a reliable team to help achieve your software goals?

Drop us a line! We'd love to discuss your project.

Offices Dots
Offices

BUSINESS HOURS

Monday - Friday
9 AM - 9 PM (Sydney Time)
9 AM - 5 PM (Yogyakarta Time)

Monday - Friday
9 AM - 9 PM (Sydney Time)
9 AM - 5 PM (Yogyakarta Time)

Sydney

SYDNEY

55 Pyrmont Bridge Road
Pyrmont, NSW, 2009
Australia

55 Pyrmont Bridge Road, Pyrmont, NSW, 2009, Australia

+61 2-8123-0997

Yogyakarta

YOGYAKARTA

Unit A & B
Jl. Prof. Herman Yohanes No.1125, Terban, Gondokusuman, Yogyakarta,
Daerah Istimewa Yogyakarta 55223
Indonesia

Unit A & B Jl. Prof. Herman Yohanes No.1125, Yogyakarta, Daerah Istimewa Yogyakarta 55223, Indonesia

+62 274-4539660
Bandung

BANDUNG

JL. Banda No. 30
Bandung 40115
Indonesia

JL. Banda No. 30, Bandung 40115, Indonesia

+62 858-6514-9577

Subscribe to our newsletter