When a customer calls an AI voice agent, 4 steps happen in under 800 milliseconds. This article explains each step and why AgenteUno achieves natural conversations in Spanish.
The voice pipeline
Audio → STT (Speech-to-Text) → LLM (Brain) → TTS (Text-to-Speech) → Audio
Each step adds latency. The goal is to keep the total under 1 second so the conversation feels natural.
Step 1: Speech-to-Text (STT)
STT converts the customer's voice into text. It's the agent's ear.
Technologies used:
- Deepgram Nova-2: 200ms latency, excellent in Spanish
- Whisper (OpenAI): More accurate but slower (~500ms)
- Google Cloud STT: Good multilingual support
Challenge in Spanish: Regional accents (Mexico, Argentina, Spain) require specifically trained models. A generic STT confuses similar-sounding words and colloquialisms.
AgenteUno uses models optimized for both Peninsular and Latin American Spanish with error rates below 5%.
Step 2: LLM (the brain)
Once we have the text, the LLM decides what to respond. This is where the agent's intelligence lives.
What it processes:
- The customer's transcription
- Full conversation context
- Business knowledge base
- System instructions (personality, restrictions)
Speed: We use models optimized for low latency. The LLM generates responses in ~200ms for short phrases.
Step 3: Text-to-Speech (TTS)
TTS converts the LLM's response into audio. It's the agent's voice.
What matters:
- Naturalness: Not sounding robotic. Modern voices are nearly indistinguishable from humans
- Prosody: Intonation, rhythm, pauses. Spanish has very distinctive prosody
- Streaming: TTS starts speaking before finishing the entire phrase generation
AgenteUno Spanish voices: 4 native voices (2 female, 2 male) with neutral accent and regional variants.
Step 4: Audio output
Generated audio is sent to the customer in real-time via WebRTC or PSTN telephony. Codec quality and network conditions affect the final experience.
The role of latency
| Total latency | Experience |
|---|---|
| < 500ms | Imperceptible, like talking to a human |
| 500-800ms | Acceptable, slight pause |
| 800-1200ms | Noticeable, customer perceives "thinking" |
| > 1200ms | Poor experience, customer hangs up |
AgenteUno optimizes each step to keep total latency below 800ms.
Advanced features
Interruptions (barge-in)
The customer can interrupt the agent at any time. The agent detects the customer speaking, stops its response and listens.
Sentiment detection
The agent analyzes voice tone to detect frustration, urgency or satisfaction and adapts its response.
Human transfer
If the agent detects a situation requiring human intervention, it transfers the call with a context summary.
Is your business still responding manually?
AgenteUno automates WhatsApp, voice, chat and more — set up in minutes.
Try it free →Try it now
Automate your business support in minutes
Set up your AI agent for WhatsApp, voice, chat and more — no code, no waiting.