- Frogomo AI
- Posts
- Voice AI just got good enough to fool you
Voice AI just got good enough to fool you
Here's what's actually happening

Hey friend,
I called a dentist's office last week to book a cleaning.
The person who answered was friendly. Professional. Asked the right questions. Confirmed my insurance. Found a slot that worked with my schedule.
Totally normal interaction.
Except halfway through, something felt... off.
The responses were too smooth. No "umms." No background noise. No hold music while they checked the calendar.
So I asked: "Am I speaking to a real person?"
Three-second pause.
"I'm an AI assistant helping with scheduling today. Would you like me to transfer you to a human team member?"
I sat there for a second, genuinely impressed.
"No, you're doing fine. Let's finish booking."
And we did. Seamlessly.
That's when I realised I needed to pay closer attention to what's happening in this space.
(Spoiler: what I found was wilder than I expected.)
This Market Is Growing Stupidly Fast
Let me throw a number at you.
The AI voice agent market was worth about 2.4 billion in 2024.
By 2034? Projected to hit 47.5 billion.

That's not steady growth. That's a 34.8% compound annual growth rate for a decade straight.
Why now? Why not five years ago, when everyone was hyping voice assistants?
One reason: latency finally got solved.
For a phone conversation to feel natural, the AI needs to respond within 300-500 milliseconds. Any slower and it feels robotic. Awkward. Like talking to someone with a bad connection.
Previous AI couldn't hit that window consistently. The technology existed, but it was too slow to fool anyone.
Then GPT-4o dropped in May 2024 with response times of 232-320 milliseconds.
(I know, I know. More GPT-4 hype. But this one actually matters.)
For the first time, talking to AI on the phone felt like talking to a person. Not a good impression of a person. Actually conversational.
That single breakthrough unlocked everything else.
How This Actually Works (The Non-Boring Version)
The tech runs on a three-step pipeline. I'll keep this quick because you don't need a PhD to understand it.
Step 1: Speech-to-Text
You talk. The AI converts your voice to text. Takes about 100-300 milliseconds.
(Fun fact: the best systems can transcribe faster than you can finish your sentence.)
Step 2: The Brain
An LLM like GPT-4o reads what you said, figures out what you meant, and writes a response. This is the slowest part - 350-700 milliseconds - and accounts for about 70% of total delay.
Step 3: Text-to-Speech
The AI's text response gets converted back into a natural-sounding voice. Another 75-200 milliseconds.
Total pipeline: under 800 milliseconds if everything's optimised. Under 500 for the really good setups.
The phone network itself adds another 100-200 milliseconds because, well, phone infrastructure is old.
That's it. That's the whole system.
The magic isn't any single piece - it's that all three pieces finally got fast enough at the same time.
The Companies Seeing Actual Results
Klarna - the buy-now-pay-later company - has an AI assistant handling 2.3 million conversations per month.
Two. Point. Three. Million.
They claim it's doing the work of 700 full-time customer service agents. They're projecting 40 million in profit improvement from this.
Cedars-Sinai Medical Centre deployed voice AI for scheduling and test results. Cut their call volume in half. 94% user satisfaction.
(Patients liked talking to the robot more than waiting on hold. Go figure.)
A hospital in Kansas reduced check-in times from 4 minutes to 10 seconds. Their pre-registration rate doubled from 40% to 80%. Read more
A debt collection company in Mexico - this one's wild - collected 6 million dollars using only AI agents. No humans on the phones. Six months.
(I had to read that twice.)
Restaurant chains using voice AI report 50% more phone reservations and 200 hours saved monthly. One platform claims its clients see 760% ROI from labour savings alone.
The pattern is consistent: businesses deploying this stuff are seeing 50-85% cost reductions on phone operations.
The Platforms Actually Worth Looking At
The market has gotten crowded. Let me save you some research time.
If you're technical and want full control: Vapi
Developer-focused. Maximum customization. Modular architecture. But you'll need coding skills, and the realistic cost is 0.13-0.33 per minute after adding all the components together.
(The advertised 0.05/minute is... optimistic.)
If you need enterprise compliance: Retell AI
This is what regulated industries are using. SOC 2, HIPAA, GDPR certified. Transparent pricing at 0.07/minute with no platform fees. Over 3,000 businesses, including healthcare providers and financial services.
Probably the most balanced option if you want control without building everything yourself.
If you're not technical at all: Synthflow
Drag-and-drop builder. Starts at 29/month for 50 minutes. Native integrations with HubSpot and Salesforce. White-label options if you're an agency.
This is the "I don't want to touch code" option. And it's legitimately good.
If voice quality is everything: ElevenLabs
The voice cloning people. Their synthetic voices are so good that 65% of consumers can't tell they're AI.
(Sixty-five per cent. That's not a typo.)
But ElevenLabs only handles the voice part - you need to pair it with other platforms for full phone functionality.
Where This Tech Still Fails
I'm not going to pretend this is magic. The failure modes are real, and you need to know about them.
Hallucination is genuinely scary.
Cornell research found that Whisper (OpenAI's speech recognition) fabricated sentences in about 1% of transcriptions. That sounds small until you realize 38% of those hallucinations contained explicit harms - the model inserted words like "terror," "knife," and "killed" that were never spoken.
One engineer reported hallucinations in 50% of 100+ hours of transcriptions.
(For medical or legal applications, this is a nightmare scenario.)
Accents are still a problem.
Lab-condition accuracy claims of 95-99% drop to about 62% in real-world testing with regional dialects and non-native speakers. Human transcribers hit 99%.
The AI works great if you sound like the training data. Less great if you don't.
Emotional intelligence is basically nonexistent.
An upset customer who needs empathy will get technically correct but emotionally tone-deaf responses. The AI can detect that someone sounds frustrated, but doesn't really understand what that means.
This is why the smart deployments use hybrid models - AI for routine stuff, humans for anything emotional or complex.
Complex conversations break down.
Great for "schedule an appointment for Tuesday." Unreliable for "I'm having a complicated situation with my account and need someone to really understand what happened."
Structured, predictable = AI handles it fine. Ambiguous, contextual, requires judgment = probably needs a human.
The Legal Stuff You Actually Need to Know
(This part is boring, but it could save you from expensive mistakes.)
The FCC ruled in February 2024 that AI-generated voices count as "artificial or prerecorded voice" under the Telephone Consumer Protection Act.
Translation: You need prior consent before making AI voice calls. You must disclose your identity at the start. You need opt-out mechanisms within 2 seconds for telemarketing. Violations carry a penalty of $500-1,500 in penalties per violation.
Per. Violation.
At scale, that adds up fast.
California requires disclosure of AI interaction. Utah requires proactive consumer disclosure. Several other states have pending legislation.
If you're in healthcare, HIPAA applies. A 2025 proposed update eliminates "addressable" specifications - nearly all requirements become mandatory with just 240 days to implement.
Best practice: just tell people they're talking to AI. Even where it's not legally required. Builds trust, reduces legal risk, and honestly, most people don't care as long as their problem gets solved.
What's Coming Next
Emotion detection is getting real.
ElevenLabs' latest voices can naturally sigh, whisper, laugh, and react emotionally. Hume AI can predict emotions from voice alone. Early deployments show 25% reduction in escalations when AI detects and responds to frustration.
(Still not true empathy. But getting closer to faking it convincingly.)
Latency keeps shrinking.
Some systems are hitting 160 milliseconds total. The target is sub-200 on consumer-grade hardware. We're not there yet, but the trajectory is clear.
Multimodal is coming.
Voice plus vision plus screen-sharing. Google's already demoing Gemini Live with real-time voice and camera. By 2026, about 30% of AI models will handle multiple input types simultaneously.
To know more about the AI voice trend: Click here
The big prediction: 70% of businesses plan to adopt voice AI by the end of 2025. Gartner says conversational AI will cut contact centre labour costs by 80 billion by 2026.
Whether those numbers land exactly or not, the direction is obvious.
If You Want to Try This
Lowest barrier to entry: Synthflow at 29/month. No code required. You can have a basic voice agent running in the afternoon.
Best for serious business use: Retell AI. Compliance certifications, reasonable pricing, good balance of control and usability.
If you want to build custom: Start with Vapi or the open-source Pipecat framework. Budget for the learning curve.
Budget realistically: 0.10-0.20 per minute on managed platforms, plus setup costs (500-2,000 for simple deployments, way more for enterprise).
Start with low-risk use cases: after-hours coverage, appointment scheduling, FAQ handling. Get comfortable before trying anything complex.
And seriously - monitor for hallucinations in anything high-stakes. This tech is good, but it's not infallible.
The dentist appointment I booked with that AI?
Showed up, got my cleaning, everything was exactly as scheduled.
No miscommunication. No double-booking. No issues.
The AI did its job better than plenty of human receptionists I've dealt with.
That's not me being cynical about humans. That's me acknowledging that for simple, structured tasks, the robots are getting genuinely good.
The question isn't whether this technology works. It does.
The question is whether you figure out how to use it before your competitors do.
P.S. - I didn't mention this above, but the voice cloning capabilities are getting wild.
ElevenLabs can clone a voice from just 1-5 minutes of audio. The implications for customer service (your brand's voice, literally) and for fraud (obvious concerns) are both significant.
P.P.S. - If you're in an industry that handles sensitive phone calls - healthcare, legal, financial - the compliance stuff is more complex than I could cover here.
Happy to go deeper if there's interest. Just reply and let me know.
If this was useful, forward it to someone who's still making customers wait on hold.
And if you're not subscribed yet, fix that. I break down AI and automation stuff every week - practical, no hype, occasionally funny.
See you next time.
Reply