All Posts

Research

Sparrow-1: Human-Level Conversational Timing in Real-Time Voice

Written by

Brian Johnson

publish date

January 13, 2026

Flight Log: 2/6/2026

Sparrow-1 is a specialized, multilingual audio model for real-time conversational flow and floor transfer. It predicts when a system should listen, wait, or speak, enabling response timing that mirrors human conversation rather than simply responding as fast as possible.

Despite major advances in LLMs and TTS, conversational AI still lacks reliable human-level timing. Traditional voice systems wait for silence, then respond. Sparrow-1 instead models conversational timing continuously. This allows it to respond quickly, even instantaneously when the speaker is clearly done, all while deliberately waiting when they’re not.

The difference is subtle but transformative: Sparrow-1 doesn't just respond as fast as possible. It responds at the moment a human listener would.

Timing Is the Hard Part

Conversation is not just an exchange of words. It is a real-time coordination task where participants continuously anticipate when to respond, drawing on rhythm, hesitation, intonation, and meaning at the same time. Sparrow-1 models this coordination directly, aligning its behavior with the timing patterns humans use subconsciously during dialogue.

Research in conversation analysis and psycholinguistics has identified several key categories of signals that govern conversational-flow:

Semantic completeness: whether an utterance constitutes a complete thought, question, or request that projects a relevant response.
Lexical structure: grammatical structure and speech act boundaries that create transition-relevance places.
Prosodic boundary markers: pitch contours, lengthening, and intensity changes that signal utterance completion.
Disfluencies and hesitation phenomena: filled pauses, false starts, and repairs that indicate ongoing cognitive processing.
Non-verbal cues are invisible to text: Transcription-based models discard sighs, throat-clearing, hesitation sounds, and other non-verbal vocalizations that carry critical conversational-flow information. Sparrow-1 hears what ASR ignores.
Overlap management: the negotiation of simultaneous speech, which occurs in approximately 40% of turn transitions.
Affective silences: pauses that carry emotional or pragmatic weight distinct from planning delays.

‍

When timing fails in conversational AI

When you talk to an AI with a human voice, you expect human timing. When timing breaks down, you notice immediately. Delayed responses, premature interruptions, and awkward pauses shatter the rhythm of natural dialogue.

Today’s voice AI sounds increasingly human, yet still feels mechanical in conversation. Systems on platforms like ChatGPT, Claude, and Grok decide when to speak using endpoint detection, waiting for silence thresholds before responding. They react to the absence of sound rather than conversational intent, leading to missed hesitation cues and poorly timed responses. The voice sounds real, but the interaction does not.

Human-level conversation requires more:

Near-instant response when intent is clear
Contextual fluidity that adapts to pacing and tone
Graceful handling of interruptions, backchannels, and overlapping speech

Most systems fall short since they treat conversational-flow as an afterthought, a threshold to tune rather than a problem to model.
Sparrow-1 takes a different approach: it models humanlike floor transfer with intent, timing, and tone, ensuring that conversational timing matches the realism of the voice delivering it.

‍

‍

What is Sparrow-1?

‍

Sparrow-1 is a conversational flow control model built for real-time conversational video in Tavus’s Conversational Video Interface. It treats timing as a first-class modeling problem rather than an artifact of endpoint detection, extending Sparrow-0 with a more capable architecture and richer supervision.

Most existing turn-taking systems are built around endpoint detection. They wait for speech to stop, apply silence thresholds, and then trigger a response. This reactive design introduces latency, misinterprets hesitation as turn completion, and fails to support natural conversational behaviors such as backchanneling, overlap, and interruption. Silence is treated as a proxy for intent, even though the absence of speech does not reliably signal that a speaker has yielded the conversational floor.

Sparrow-1 takes a different approach. Instead of asking whether speech has ended, it models who owns the conversational floor at every moment, allowing it to anticipate turn transitions rather than react to them.

Core Properties and Capabilities

Audio-native, streaming-first: Operates directly on continuous audio with persistent state, preserving prosody, rhythm, and timing cues that are lost in transcription-based systems.
Explicit floor ownership modeling: Predicts conversational floor ownership at frame-level granularity instead of relying on silence or fixed timeouts, enabling responses at the moment of handoff rather than after a delay buffer.
Trained on continuous utterances: Learns from real conversational streams where turn boundaries are probabilistic and context-dependent, reflecting the messiness of natural dialogue.
Designed for interruption and overlap: Actively reasons about hesitation, overlap, and mid-speech interruptions to decide whether to yield, pause, or continue speaking.
Speaker-adaptive in real time: Uses a recurrent architecture to converge on user-specific timing patterns within a single session, without explicit calibration or fine-tuning.
Optimized for latency and correctness: Responds immediately when intent is clear and deliberately waits when uncertainty remains, avoiding both interruptions and unnatural delays.
Enables speculative inference: Predicts floor transfer proactively, allowing downstream components to begin response generation before the user finishes speaking, committing or discarding output based on real-time floor predictions.

‍

A new architecture for conversational flow

Sparrow-1 is not a general language model or even strictly a turn-taking model. It is a timing and control system that governs when a conversational system should speak, wait, or get out of the way: a conversational-flow model

This distinction matters because conversational timing is not handled cleanly by most real-time voice architectures. Today, two dominant approaches exist:

End-to-end speech-to-speech models handle timing implicitly but are expensive, opaque, and difficult to control or customize. They achieve fluency by tightly coupling perception, reasoning, and generation, but sacrifice efficiency and controllability in the process.

Modular pipelines (ASR → LLM → TTS) are flexible and scalable but suffer from a coordination problem: timing decisions fall between components, with no dedicated mechanism for deciding when the system should speak.

Sparrow-1 fills this gap. By explicitly modeling conversational floor transfer as a standalone timing and control layer, it brings human-level conversational-flow to modular pipelines, preserving their flexibility while restoring the conversational feel users expect.

‍

Benchmarking Human Conversation

Conversational-flow systems are often evaluated on clean endpoints and average latency, but these metrics are not representative of the true human dance, and miss the failures that matter most in real conversation: cutting users off, waiting too long, or behaving inconsistently during hesitation.

To evaluate these cases, we benchmarked Sparrow-1 against representative industry approaches using 28 challenging real world audio samples of real conversations designed to expose hesitation, overlap, and ambiguous turn endings, rather than clean silence.

Benchmark Results

We evaluated each system on identical audio samples, measuring response latency, correct floor transfer (Precision/Recall), and interruptions.

Benchmark on 28 challenging conversational audio samples. Latency measured after user stops speaking.
Model	Precision	Recall	Interruptions	p50 Lat	Mean Lat
Sparrow-1	1.000	1.000	0	55ms	292ms
LiveKit	0.929	1.000	3	1504ms	1621ms
VAD-timeout	0.893	1.000	59	1002ms	1046ms
Deepgram	0.786	1.000	7	190ms	304ms
Sparrow-0	0.643	1.000	13	907ms	1019ms
Smart-Turn	0.536	1.000	21	237ms	611ms

‍

Interpreting the Results

Each system was evaluated on the same set of 28 real-world conversational samples. Performance was measured across response latency, correct floor transfer, and interruptions. Correct floor transfer was measured using precision and recall within a 400ms grace window that reflects human conversational tolerance.

Correct floor transfer is quantified using precision and recall, with a 400ms grace window that reflects the tolerance humans naturally allow in conversation. Detections occurring within 400ms before a speaker finishes are treated as correct, while earlier responses are classified as interruptions. Precision captures how often a system avoids cutting users off, while recall measures how reliably it responds when a turn is actually complete.

Across existing approaches, the benchmark exposes a consistent speed and correctness tradeoff. Conservative systems minimize interruptions by waiting for extended silence, but impose multi-second delays that feel unnatural in dialogue. More aggressive systems reduce latency by lowering detection thresholds, but interrupt users frequently. In practice, systems are forced to choose between being slow or being wrong.

These results show that this tradeoff is not inherent to conversation, but a consequence of endpoint-based turn-taking design.

‍

The Speed-Correctness Tradeoff

Existing systems force a choice between responsiveness and correctness:

Conservative approaches like LiveKit avoid most interruptions by waiting for extended silence, but impose unnatural delays. Median latency: 1504ms.
Aggressive approaches like Smart-Turn respond faster by lowering detection thresholds, but interrupt users frequently. Median latency: 237ms. Interruptions: 21 across 28 samples.

‍

Sparrow-1 Breaks the Tradeoff

Sparrow-1 avoids this compromise by responding quickly when a turn is complete and waiting when the user is still speaking, achieving both speed and correctness.

Metric	Sparrow-1
Precision	100%
Recall	100%
Interruptions	0
Median latency	55ms

This performance reflects a fundamentally different approach. Sparrow-1 treats conversational flow as continuous, frame-level floor ownership prediction, aligning its behavior with human conversational timing.

Performance and Latency

Human conversation optimizes for appropriateness, not speed. People respond quickly when intent is clear and wait when meaning is uncertain.

Because Sparrow-1 models conversational certainty directly, its response latency is dynamic. It responds in under 100ms when confident and waits during hesitation or trailing speech, typically producing response times of 200 to 500ms without multi-second delays.

This ability to be simultaneously fast and patient creates the perception of zero-latency conversation. The system responds not as quickly as possible, but at the moment it should.

‍

Modeling human-like turn-taking behavior

These design choices manifest as concrete runtime behaviors that govern how Sparrow-1 adapts, interrupts, and listens during live conversation. At runtime, turn-taking emerges from continuous speaker adaptation, interruption-aware control, and audio-native perception rather than fixed rules or thresholds. The result is behavior that closely matches how humans manage conversational flow in practice.

‍

Adaptation without fine-tuning

Sparrow-1 behaves as a meta in-context learner, adapting to individual speaking patterns continuously as a conversation unfolds. Using a recurrent architecture, each 40ms frame updates internal state that encodes prosody, pacing, historical turn timing, and response latency preferences.

Early in a conversation, the model operates with higher uncertainty. As evidence accumulates, predictions sharpen around user-specific patterns, producing progressive synchronization without explicit calibration.

‍

Interruption handling

Interruptions are treated as first-class conversational signals. Incoming speech during system output immediately pauses playback while the model continues evaluating floor ownership. If confidence rises, Sparrow-1 yields the turn. If not, it resumes speaking. This process distinguishes intentional interruptions from incidental overlap within tens of milliseconds without introducing delay.

‍

Listening beyond words

Sparrow-1 models conversational intent using acoustic and temporal cues that extend beyond lexical content: interpreting not just what is said, but how it is said:

Fillers and hesitations: Vocalizations such as "uh," "um," and partial restarts that signal cognitive load or turn-holding.
Trailing vocalizations: Soft completions, rising tones, or fading energy that indicate uncertainty or invite response.
Prosodic rhythm: Variations in pacing, pause structure, and intonation that distinguish finished thoughts from mid-utterance pauses.
Emotional cadence: Patterns in energy, timing, and speech continuity that reflect speaker engagement and conversational stance.

By incorporating these paralinguistic signals into its floor predictions, Sparrow-1 aligns with how humans naturally infer attention, hesitation, and intent during conversation: resulting in listening that feels responsive rather than reactive.

‍

Access and Closing

We built Sparrow-1 as part of a broader mission: teaching machines to participate in human conversation. Our Conversational Video Interface (CVI) powers AI experiences that look, sound, and interact like real people: and poor timing breaks that illusion faster than almost anything else.

In conversational AI, the uncanny valley is rarely about what the AI says. It's about when it says it. Responses that arrive too early feel rude; too late, artificial. In conversational video, these errors are amplified, reminding users they're speaking to a system rather than a partner.

We use Sparrow-1 to solve this at the level it must be solved: as a first-class timing and control system. By modeling conversational uncertainty directly and responding with human-like precision, it enables interactions that feel attentive, patient, and natural.

Sparrow-1 is now available to GA across the Tavus APIs and platform, and already powers conversational experiences in the Tavus PALs and enterprise deployments.

Try the demo at tavus.io and learn more in our docs.

‍

Phoenix-4: Real-Time Human Rendering with Emotional Intelligence

Phoenix-4 is the first real-time model to generate and control emotional states, active listening behavior, and continuous facial motion as a single, unified system. It is a real-time behavior generation engine, built from the ground up, that goes beyond photorealism to transform conversation data into emotionally responsive, context-aware facial expression and head motion with millisecond-level latency.

Eloi Du Bois

February 18, 2026

From random noise to real images: Understanding diffusion and flow matching

A clear intro to diffusion and flow-matching: data distributions, ODE vs SDE, and the path from Gaussian noise to realistic images/videos powering SOTA models.

Karthik Ragunath Ananda Kumar

September 22, 2025

Introducing the evolution of Conversational Video Interface – now with Emotional Intelligence

Introducing our new family of state-of-the-art AI models: Phoenix-3, Raven-0, and Sparrow-0. Together they bring Conversational Video Interfaces (CVI) to the next level, and power Charlie, our new demo persona.

Julia Szatar

March 6, 2025

Developer Account

PALs Account