All Posts
Latency in conversational AI: a testing guide for sub-second response
.png)
.png)
Your users know when a conversation is off before they can say why. Someone pauses a beat too long before answering, and they start filling the silence themselves. The face across from them goes still for half a second, and something in their read of the interaction shifts.
Human beings are wired to detect conversational timing at a resolution most engineering teams don't test for, and that gap between what gets measured and what gets felt is where conversational AI deployments quietly underperform.
A 700ms pause in a text chat reads as thinking. In a voice call, the same 700ms reads as dead air. In a video conversation, it reads as vacancy: a face that should be reacting but isn't. Same latency, three different failures. That asymmetry is the core problem with how teams measure conversational AI performance today. Most are testing the wrong number, in the wrong modality, against the wrong threshold.
Text, voice, and video are architecturally distinct, and each introduces latency variables the previous one doesn't have. Text is a pipeline problem with inference, streaming, and rendering. Voice adds an audio stack and a conversational flow layer. Video adds rendering, behavioral continuity during listening, and a perception layer that neither text nor voice requires.
Most vendors quote a single headline figure, usually time-to-first-token (TTFT) or average response time. What users experience is the sum of every stage in the pipeline, and that sum is almost always higher than the headline. A system with a 120ms median (P50) response time might still spike into multi-second delays at P99 under peak traffic, meaning roughly 1 in 100 interactions delivers a completely different experience. P50 is what vendors report. P95 and P99 are where conversations break down in production.
TTFT tells you something useful about a text system but almost nothing about a video system where the AI Persona's face went blank for 400ms before the response even started. Each modality introduces latency variables the previous one doesn't have, and testing must account for all of them.
The text pipeline is the simplest: large language model (LLM) inference, streaming token output, and rendering in the browser. What matters in practice is perceived responsiveness, which user experience decisions shape as much as infrastructure does.
Streaming wins on perception even when total latency is identical. Many production implementations chunk output at natural language boundaries so the interface feels continuous rather than bursty.
Typing indicators matter too: a controlled study with 209 participants found that a 2.3-second delay with a typing indicator produced satisfaction scores statistically indistinguishable from instant response (M=5.62 vs. M=5.67 on a 7-point scale). Without the indicator at the same delay, satisfaction dropped to M=4.40.
Many hosted frontier models still deliver first tokens in roughly 0.5-1.5 seconds depending on model size, provider, and load, which remains 2-3x above the 200ms conversational ideal. Text is forgiving, but the techniques that work here don't translate to voice or video where conversational rhythm is real-time.
Voice adds two things text doesn't have: an audio pipeline and a conversational flow problem. Well-tuned production systems often aim for roughly 100-200ms for speech-to-text (STT), a few hundred milliseconds for LLM TTFT, and under roughly 150ms for text-to-speech (TTS) first audio. What's underreported is the conversational flow layer: before any pipeline stage runs, the system must decide the user is done speaking.
There are three approaches to that decision:
This decision point is the most underreported latency source in voice AI. A system benchmarking 400ms on the STT-LLM-TTS pipeline but adding 700ms of silence detection delivers a 1,100ms experience: the benchmark says fast; the user hears dead air.
Video inherits all of the voice's complexity and adds rendering, behavioral continuity, and perception. Real-time conversational video, live bidirectional conversation, is architecturally distinct from asynchronous video generation tools that produce clips offline from text prompts rather than sustaining live dialogue.
Research comparing the approaches documents a 25x speedup in frame throughput and over 100x latency improvement for real-time distilled models. That distinction defines a category boundary: real-time conversational video infrastructure occupies a different design space from static video generation, and the latency requirements reflect it.
The video pipeline adds a rendering stage after TTS: facial behavior must keep pace with audio at a frame rate sustaining lip sync credibility. Industry standards for audio-visual synchronization set tight tolerances, with detectability thresholds at 45ms audio-leading and 125ms audio-lagging, and even smaller offsets can degrade perceived realism before users can explicitly describe what's wrong.
In a voice agent, the system goes quiet while the user speaks; computationally, it can idle. In a video conversation, the AI Persona, the face and presence layer, is visible the entire time, and a rendering pipeline that only activates during output will freeze during the listening phase.
Consider a voice agent waiting silently while a policyholder describes an auto accident: silence is fine. An AI Persona that freezes during that same pause communicates no acknowledgment, no patience, no presence, and the interaction loses the trust that face-to-face conversation is supposed to build. Full-duplex generation, producing active listening behavior while the user speaks, is what separates presence from vacancy.
Phoenix-4, Tavus's real-time facial behavior engine, generates continuous emotional expression and active listening behavior at 40fps, 1080p, including during the listening phase, keeping the AI Persona credible between turns and shortening the path to a resolved claim.
Voice systems transcribe audio and pass text to the LLM, a lossy step. Research on multimodal emotion recognition confirms that audio carries prosodic signals text can't represent, and that facial expressions provide complementary cues audio alone doesn't capture.
Raven-1, Tavus's multimodal perception system, fuses audio and visual signals and keeps that fused signal no more than 300ms stale, outputting natural language descriptions that let downstream LLMs reason over what the user is actually communicating.
In a leadership coaching session, a participant says "I think the team is doing fine" while their tone flattens and their gaze drops. Raven-1 perceives the incongruence, giving the downstream LLM the context to probe deeper rather than take the statement at face value. For the coaching provider, that's the difference between a session that surfaces real issues and one that stays politely on the surface.
Real-time conversational video closes a four-component loop: Sparrow-1 governs conversational timing, Raven-1 fuses perception signals, the LLM intelligence layer generates response content, and Phoenix-4 renders behavior. Each component has a latency budget, and failure in any one breaks the experience.
Sparrow-1, Tavus's conversational flow model, governs floor ownership by predicting conversational state at the frame level on raw audio. The LLM layer reasons about what to say next based on Raven-1's perception output, and Phoenix-4 renders the response with matching facial behavior. In Tavus's benchmark against leading turn-taking systems using 28 challenging real-world conversational samples, Sparrow-1 achieved 55ms median prediction latency, 100% precision and recall, and zero interruptions across all 28 samples, while competing approaches were forced to choose between waiting several seconds or cutting speakers off.
In a candidate screening call, an applicant says "I haven't done that at scale" while maintaining steady eye contact and an open posture. Sparrow-1 holds the floor open as they gather their next thought. Raven-1 interprets the delivery as candor rather than limitation, giving the LLM context to explore depth of experience rather than move on. Phoenix-4 renders responsive facial behavior informed by that perception, nodding subtly during a long pause, maintaining eye contact as the candidate works through a complex answer.
For the recruiting team running hundreds of screens per month, the closed loop means higher signal per conversation and fewer qualified candidates lost to a robotic experience.
Latency isn't an engineering metric in isolation; it's a unit economics problem. When latency degrades presence, completion rates drop and the cost of each successful outcome rises.
A healthcare intake system running 10,000 patient conversations per month needs low latency because every abandoned conversation is a scheduling gap, a repeated call, or a missed triage signal.
Teams evaluating conversational AI infrastructure should map expected conversation volume to the dollar value of each completed interaction, then test whether latency at P95 and P99 protects that value or erodes it. Pipeline stages like Knowledge Base retrieval (approximately 30ms) and cross-session Memories add minimal latency but significantly improve response quality, meaning teams should test with these features enabled, not just on bare inference.
The tests below focus on production failure points where users feel the system fall out of rhythm. Run them against your actual deployment configuration, not a local environment.
Run these as regression tests after any change to model choice, streaming configuration, retrieval, or transport.
One infrastructure note: Web Real-Time Communication (WebRTC)-based delivery introduces variable network jitter that lab setups won't capture. Research on geographic latency shows that a user in Singapore connecting to US-East infrastructure faces 200-250ms of round-trip network latency before any processing occurs.
Tavus exposes its Conversational Video Interface (CVI) through APIs, giving teams the ability to profile latency against their own deployment configurations. If a system only performs well on localhost and falls apart on cellular connections, the evaluation is incomplete.
Cross-linguistic research shows that human conversational timing is remarkably consistent across cultures, with response times clustering around 200ms. When an AI system misses that window, the people on the other end feel it viscerally, even if they can't name what went wrong.
Presence is the real threshold: the point where your user stops thinking about the technology and focuses on the conversation itself. In text, you earn it with streaming and fast inference. In voice, you earn it with precise conversational flow. In video, you earn it by closing the loop between perception and expression so completely that the system responds with the attentiveness of a present listener.
See how Tavus can help you achieve sub-second response in production. Book a demo.