All Posts

AI, News, and Ethics

Latency in conversational AI: a testing guide for sub-second response

Written by

The Tavus Team

publish date

April 17, 2026

Flight Log: 2/6/2026

Your users know when a conversation is off before they can say why. Someone pauses a beat too long before answering, and they start filling the silence themselves. The face across from them goes still for half a second, and something in their read of the interaction shifts.

Human beings are wired to detect conversational timing at a resolution most engineering teams don't test for, and that gap between what gets measured and what gets felt is where conversational AI deployments quietly underperform.

A 700ms pause in a text chat reads as thinking. In a voice call, the same 700ms reads as dead air. In a video conversation, it reads as vacancy: a face that should be reacting but isn't. Same latency, three different failures. That asymmetry is the core problem with how teams measure conversational AI performance today. Most are testing the wrong number, in the wrong modality, against the wrong threshold.

Why modality changes what you measure

Text, voice, and video are architecturally distinct, and each introduces latency variables the previous one doesn't have. Text is a pipeline problem with inference, streaming, and rendering. Voice adds an audio stack and a conversational flow layer. Video adds rendering, behavioral continuity during listening, and a perception layer that neither text nor voice requires.

Most vendors quote a single headline figure, usually time-to-first-token (TTFT) or average response time. What users experience is the sum of every stage in the pipeline, and that sum is almost always higher than the headline. A system with a 120ms median (P50) response time might still spike into multi-second delays at P99 under peak traffic, meaning roughly 1 in 100 interactions delivers a completely different experience. P50 is what vendors report. P95 and P99 are where conversations break down in production.

TTFT tells you something useful about a text system but almost nothing about a video system where the AI Persona's face went blank for 400ms before the response even started. Each modality introduces latency variables the previous one doesn't have, and testing must account for all of them.

Text

The text pipeline is the simplest: large language model (LLM) inference, streaming token output, and rendering in the browser. What matters in practice is perceived responsiveness, which user experience decisions shape as much as infrastructure does.

Streaming wins on perception even when total latency is identical. Many production implementations chunk output at natural language boundaries so the interface feels continuous rather than bursty.

Typing indicators matter too: a controlled study with 209 participants found that a 2.3-second delay with a typing indicator produced satisfaction scores statistically indistinguishable from instant response (M=5.62 vs. M=5.67 on a 7-point scale). Without the indicator at the same delay, satisfaction dropped to M=4.40.

Many hosted frontier models still deliver first tokens in roughly 0.5-1.5 seconds depending on model size, provider, and load, which remains 2-3x above the 200ms conversational ideal. Text is forgiving, but the techniques that work here don't translate to voice or video where conversational rhythm is real-time.

Voice

Voice adds two things text doesn't have: an audio pipeline and a conversational flow problem. Well-tuned production systems often aim for roughly 100-200ms for speech-to-text (STT), a few hundred milliseconds for LLM TTFT, and under roughly 150ms for text-to-speech (TTS) first audio. What's underreported is the conversational flow layer: before any pipeline stage runs, the system must decide the user is done speaking.

There are three approaches to that decision:

Silence detection, the simplest, waits 500-1,000ms of silence before concluding the user has finished, adding that full duration as lag.
Inter-pausal unit (IPU) based detection analyzes stretches of speech bounded by pauses to determine if the content so far constitutes a finished thought, more informed but still reactive.
Continuous floor-ownership prediction, the current state of the art, runs in parallel with speech and predicts at the frame level whether the speaker is done without waiting for silence, often shaving hundreds of milliseconds off turn transitions with fewer false interruptions.

This decision point is the most underreported latency source in voice AI. A system benchmarking 400ms on the STT-LLM-TTS pipeline but adding 700ms of silence detection delivers a 1,100ms experience: the benchmark says fast; the user hears dead air.

Real-time conversational video

Video inherits all of the voice's complexity and adds rendering, behavioral continuity, and perception. Real-time conversational video, live bidirectional conversation, is architecturally distinct from asynchronous video generation tools that produce clips offline from text prompts rather than sustaining live dialogue.

Research comparing the approaches documents a 25x speedup in frame throughput and over 100x latency improvement for real-time distilled models. That distinction defines a category boundary: real-time conversational video infrastructure occupies a different design space from static video generation, and the latency requirements reflect it.

Rendering at conversational speed

The video pipeline adds a rendering stage after TTS: facial behavior must keep pace with audio at a frame rate sustaining lip sync credibility. Industry standards for audio-visual synchronization set tight tolerances, with detectability thresholds at 45ms audio-leading and 125ms audio-lagging, and even smaller offsets can degrade perceived realism before users can explicitly describe what's wrong.

Behavioral continuity during listening

In a voice agent, the system goes quiet while the user speaks; computationally, it can idle. In a video conversation, the AI Persona, the face and presence layer, is visible the entire time, and a rendering pipeline that only activates during output will freeze during the listening phase.

Consider a voice agent waiting silently while a policyholder describes an auto accident: silence is fine. An AI Persona that freezes during that same pause communicates no acknowledgment, no patience, no presence, and the interaction loses the trust that face-to-face conversation is supposed to build. Full-duplex generation, producing active listening behavior while the user speaks, is what separates presence from vacancy.

Phoenix-4, Tavus's real-time facial behavior engine, generates continuous emotional expression and active listening behavior at 40fps, 1080p, including during the listening phase, keeping the AI Persona credible between turns and shortening the path to a resolved claim.

The perception layer

Voice systems transcribe audio and pass text to the LLM, a lossy step. Research on multimodal emotion recognition confirms that audio carries prosodic signals text can't represent, and that facial expressions provide complementary cues audio alone doesn't capture.

Raven-1, Tavus's multimodal perception system, fuses audio and visual signals and keeps that fused signal no more than 300ms stale, outputting natural language descriptions that let downstream LLMs reason over what the user is actually communicating.

In a leadership coaching session, a participant says "I think the team is doing fine" while their tone flattens and their gaze drops. Raven-1 perceives the incongruence, giving the downstream LLM the context to probe deeper rather than take the statement at face value. For the coaching provider, that's the difference between a session that surfaces real issues and one that stays politely on the surface.

Closing the loop

Real-time conversational video closes a four-component loop: Sparrow-1 governs conversational timing, Raven-1 fuses perception signals, the LLM intelligence layer generates response content, and Phoenix-4 renders behavior. Each component has a latency budget, and failure in any one breaks the experience.

Sparrow-1, Tavus's conversational flow model, governs floor ownership by predicting conversational state at the frame level on raw audio. The LLM layer reasons about what to say next based on Raven-1's perception output, and Phoenix-4 renders the response with matching facial behavior. In Tavus's benchmark against leading turn-taking systems using 28 challenging real-world conversational samples, Sparrow-1 achieved 55ms median prediction latency, 100% precision and recall, and zero interruptions across all 28 samples, while competing approaches were forced to choose between waiting several seconds or cutting speakers off.

In a candidate screening call, an applicant says "I haven't done that at scale" while maintaining steady eye contact and an open posture. Sparrow-1 holds the floor open as they gather their next thought. Raven-1 interprets the delivery as candor rather than limitation, giving the LLM context to explore depth of experience rather than move on. Phoenix-4 renders responsive facial behavior informed by that perception, nodding subtly during a long pause, maintaining eye contact as the candidate works through a complex answer.

For the recruiting team running hundreds of screens per month, the closed loop means higher signal per conversation and fewer qualified candidates lost to a robotic experience.

Why latency matters for the business

Latency isn't an engineering metric in isolation; it's a unit economics problem. When latency degrades presence, completion rates drop and the cost of each successful outcome rises.

A healthcare intake system running 10,000 patient conversations per month needs low latency because every abandoned conversation is a scheduling gap, a repeated call, or a missed triage signal.

Teams evaluating conversational AI infrastructure should map expected conversation volume to the dollar value of each completed interaction, then test whether latency at P95 and P99 protects that value or erodes it. Pipeline stages like Knowledge Base retrieval (approximately 30ms) and cross-session Memories add minimal latency but significantly improve response quality, meaning teams should test with these features enabled, not just on bare inference.

A testing framework across all three modalities

The tests below focus on production failure points where users feel the system fall out of rhythm. Run them against your actual deployment configuration, not a local environment.

Cold start versus warm session. Cold starts can spike to several seconds while warm sessions stay sub-second; test and report both separately. For video, cold start includes model loading and behavioral initialization.
Turn boundary test. Script 20 conversation turns with trailing vocalizations, mid-sentence pauses of 2-3 seconds, and at least one interruption. For voice, measure from last phoneme to first audio output; for video, add behavioral layer continuity between turns.
Percentile profiling. Run at least 100 turns under representative load and plot P50, P95, and P99. If P99 exceeds 2x P50, the system has a tail problem. For video, run with and without retrieval augmented generation (RAG) from the Knowledge Base active in the pipeline. In a healthcare intake deployment, that means testing the AI Persona both without retrieved procedure content and with a Knowledge Base query returning relevant protocol documents, so you isolate the latency cost of retrieval under real conditions. If the deployment uses Guardrails for compliance filtering, compare response latency with and without them active to understand the overhead. When Memories is enabled for cross-session context, measure whether persistent context retrieval affects P95 at scale.
Perception test. Record five conversations at different latency configurations and watch them back without audio. Note when the AI Persona's behavior stops matching conversational rhythm.

Run these as regression tests after any change to model choice, streaming configuration, retrieval, or transport.

One infrastructure note: Web Real-Time Communication (WebRTC)-based delivery introduces variable network jitter that lab setups won't capture. Research on geographic latency shows that a user in Singapore connecting to US-East infrastructure faces 200-250ms of round-trip network latency before any processing occurs.

Tavus exposes its Conversational Video Interface (CVI) through APIs, giving teams the ability to profile latency against their own deployment configurations. If a system only performs well on localhost and falls apart on cellular connections, the evaluation is incomplete.

Presence is the threshold

Cross-linguistic research shows that human conversational timing is remarkably consistent across cultures, with response times clustering around 200ms. When an AI system misses that window, the people on the other end feel it viscerally, even if they can't name what went wrong.

Presence is the real threshold: the point where your user stops thinking about the technology and focuses on the conversation itself. In text, you earn it with streaming and fast inference. In voice, you earn it with precise conversational flow. In video, you earn it by closing the loop between perception and expression so completely that the system responds with the attentiveness of a present listener.

See how Tavus can help you achieve sub-second response in production. Book a demo.

Phoenix-4: Real-Time Human Rendering with Emotional Intelligence

Phoenix-4 is the first real-time model to generate and control emotional states, active listening behavior, and continuous facial motion as a single, unified system. It is a real-time behavior generation engine, built from the ground up, that goes beyond photorealism to transform conversation data into emotionally responsive, context-aware facial expression and head motion with millisecond-level latency.

Eloi Du Bois

February 18, 2026

From random noise to real images: Understanding diffusion and flow matching

A clear intro to diffusion and flow-matching: data distributions, ODE vs SDE, and the path from Gaussian noise to realistic images/videos powering SOTA models.

Karthik Ragunath Ananda Kumar

September 22, 2025

Introducing the evolution of Conversational Video Interface – now with Emotional Intelligence

Introducing our new family of state-of-the-art AI models: Phoenix-3, Raven-0, and Sparrow-0. Together they bring Conversational Video Interfaces (CVI) to the next level, and power Charlie, our new demo persona.

Julia Szatar

March 6, 2025

Developer Account

PALs Account