All Posts

Research

AI body language: can machines read and produce nonverbal cues?

Written by

Tavus Team

publish date

May 8, 2026

Gaussian Splatting: Explained Through Code

A nod at the right moment tells someone you're listening. A slight furrow of the brow signals confusion before a word is spoken. These aren't decorative gestures; they're the infrastructure of human conversation, carrying meaning that words alone can't.

Research in Trends in Cognitive Sciences confirms that facial signals function as indicators of social actions and speaker intentions, not merely emotion displays. They shape how we interpret what someone means, not just how they feel.

That signal layer has been almost entirely absent from digital interactions. Text strips it out completely. The voice preserves tone but loses everything visual.

Even video calls degrade it: Stanford's Virtual Human Interaction Lab found that producing and interpreting nonverbal cues over video creates measurable cognitive fatigue. The result is that digital interactions can feel more transactional, especially when brands rely heavily on machine-mediated or minimal-contact channels.

For AI systems entering high-stakes conversations, this deficit is a concrete problem. AI capable of doing both sides of the nonverbal exchange, perceiving what a person communicates beyond words and producing appropriate responses in return, represents a meaningfully different category. That bidirectional capability is becoming an increasingly important frame for the field.

Why nonverbal cues matter in AI body language

Popular rules of thumb about nonverbal communication are often overstated outside their original context. But the narrower point still matters: when verbal and nonverbal signals diverge, nonverbal behavior can strongly shape how people interpret meaning and intent.

That dynamic plays out in documented ways across professional settings. In healthcare research, nonverbal behaviors such as eye contact, posture, smiling, mirroring, nodding, eye gaze, and eyebrow movement are associated with better patient-centered outcomes. A rapid review found that nonverbal communication strategies, including active listening, touch, and eye contact, were associated with improved patient-centered outcomes across all seven studies reviewed, though it noted limited experimental evidence for quantifying how specific combinations of cues, such as nodding, eye gaze, and eyebrow movement, influence those results.

For any AI system entering a live conversation, the consequence is practical: if the system can't perceive these signals from the human and can't produce appropriate signals in return, the interaction will feel flat regardless of how accurate the spoken content is.

How AI body language perceives nonverbal signals today

The field of AI body language perception has increasingly shifted toward multimodal fusion, where visual and auditory signals are processed together as an integrated stream. A 2025 survey on multimodal emotion recognition in conversations confirms that fusing facial expression, vocal tone, and linguistic content produces a more accurate understanding than any single channel alone.

Related work also shows that speech features can be converted into natural language prompts that large language models can reason over directly, rather than being reduced to simple emotion labels.

The accurate picture, however, requires honest framing. Facial expression recognition models can reach accuracy above 91% on individual target datasets, but cross-domain performance averages as low as 68.75% across multiple out-of-domain test sets. Rare emotional states like fear and disgust are significantly underrepresented in benchmark samples, making classifiers unreliable at exactly the moments that matter most.

Most systems are trained on English-language datasets, creating a structural validity problem for global deployments where emotional expression varies by culture.

These limitations don't mean perception is impossible. They mean the output format matters. Systems that produce nuanced, natural language descriptions of what they perceive ("the person seems hesitant, leaning back slightly with a furrowed brow") give downstream reasoning systems far more to work with than those that output a single label ("confused, 72% confidence").

The harder problem: AI body language nonverbal cues

Perception gets most of the attention, but production is where the field faces its steepest challenges. Teaching an AI system to generate contextually appropriate facial expressions, head movements, and listening behaviors in real time during a live conversation is a fundamentally different problem than analyzing a recorded clip after the fact.

Most commercial systems still rely on rule-based animation using pre-recorded gestures and scripted conditions. These produce rigid, low-diversity motion that doesn't adapt to conversational context.

Data-driven approaches produce better motion, but most require future speech context, making them incompatible with real-time interaction.

The most telling failure mode involves active listening. Research on audio-driven conversational AI systems has highlighted challenges in generating natural listener behavior during the other person's speaking turn. Natural listening should include subtle nods, timely blinks, shifts in eye gaze, and soft facial movements that reflect attention.

As of 2025, generating this behavior remains an active research challenge, with the generation of listener reactions treated as an open problem.

External academic work has explored real-time generation of expressive gestures and facial expressions in live dialogue. These efforts primarily focus on one-sided gesture generation, in which a system produces motion in response to its own speech.

The more complex scenario is dyadic interaction, where two participants influence each other's behavior in real time. That capability remains largely unexplored, according to a recent survey of the field.

This is precisely where real-time conversational video infrastructure parts ways from static, one-way video systems. Static video can look polished, but it can't respond to what's happening in the conversation. Producing nonverbal behavior that responds to the person you're talking to in the moment requires perception and expression to operate in a continuous loop.

Importance of perception in AI body language

The conversations where AI body language matters most are the ones where presence determines outcomes. A new employee practicing a difficult feedback conversation needs an AI Persona for coaching that looks engaged, not frozen. A patient explaining symptoms at 2 AM needs an interaction that feels genuinely attentive.

Tavus, a Human Computing research lab, addresses this deficit through a closed-loop behavioral stack: AI Personas that can see, hear, understand, and respond in real-time video interactions. Raven-1, Tavus's multimodal perception system, fuses audio and visual signals to produce natural-language descriptions of the user's emotional and attentional state. It tracks emotional shifts within a single turn and keeps context no more than 300ms stale.

Sparrow-1, the conversational flow model, governs timing through continuous floor-ownership prediction, with a median floor-prediction latency of 55ms. On benchmark, it achieves 100% precision, 100% recall, and zero interruptions, responding at precisely the moment a human listener would. That breaks the usual tradeoff between speed and correctness.

The large language model (LLM) layer reasons about what to say and how to respond based on that perceptual context. It draws on Memories to retain context across sessions and operates within Objectives and Guardrails that keep every response inside enterprise-defined boundaries.

Phoenix-4, the real-time facial behavior engine, renders emotionally responsive expressions while listening and while speaking. It produces 10+ controllable emotional states, active listening behavior, and emergent micro-expressions while the other person speaks.

In a compliance training scenario, that loop means the AI Persona for coaching delivers more than correct information. When a trainee hesitates mid-answer, Sparrow-1 holds the floor open rather than jumping in. Raven-1 perceives the hesitation and shifting gaze as signals of uncertainty, not disengagement.

The LLM adjusts its response to offer encouragement. Phoenix-4 produces a slight nod and an attentive expression, the kind of nonverbal signal that a good human coach would give without thinking about it.

For a health tech platform running post-discharge follow-up, the same loop handles a different emotional register. A patient says they've been "doing fine" since leaving the hospital, but their voice trails off, and their expression tightens. Raven-1 perceives the incongruence between the verbal response and the attentional signal.

Sparrow-1 holds the floor open rather than moving to the next intake question. The LLM, grounded in the organization's clinical Knowledge Base, surfaces a more specific follow-up prompt. Phoenix-4 holds a steady, attentive expression that signals the AI Persona is listening, not waiting.

The result is a conversation that catches what a rushed discharge call would miss.

Tavus's Conversational Video Interface (CVI) exposes this infrastructure through APIs and SDKs, allowing product teams to build these interactions into their own applications with white-label capability and Knowledge Base integration that grounds every response in organization-specific data.

What to evaluate before deploying

Product leaders considering AI systems with nonverbal capabilities should assess three dimensions honestly, especially if they're still exploring where conversational presence matters most in high-volume workflows.

Perception fidelity: Does the system fuse audio and visual signals, or analyze them separately? Cross-dataset accuracy matters more than within-dataset benchmarks, and natural-language output formats provide downstream systems with richer context than categorical labels.
Production quality under real-time constraints: Can the system generate appropriate nonverbal behavior while listening, not just while speaking? Active listening is the specific capability that separates conversational presence from animated playback.
Regulatory awareness: The EU AI Act's prohibition on certain unacceptable-risk practices, effective from February 2, 2025, includes restrictions on emotion detection in workplace and educational settings, with narrow exceptions for medical or safety purposes. Enterprise buyers in EU jurisdictions should map their specific use case against current requirements.

Taken together, those checks give product leaders a practical way to separate polished demos from systems that can hold up in live conversation.

These three dimensions connect. A system that perceives well but produces rigid responses wastes its perception. A system that produces beautiful facial animation without perceiving what the human actually needs is only putting on a show.

Tavus's behavioral stack is designed to keep all three dimensions connected.

The conversation that feels like a conversation

The gap between analyzing body language and engaging with it is the gap between observation and presence. Most AI systems today sit firmly on the observation side, treating nonverbal cues as data to extract from humans.

The AI Personas that will earn trust in high-stakes conversations are the ones that close the loop, perceiving what a person communicates beyond words and responding with attentive, contextually appropriate nonverbal behavior that makes someone feel genuinely heard. That's what presence means in a digital interaction: showing up, fully, for the person across the screen.

See it for yourself. Book a demo.

Phoenix-4: Real-Time Human Rendering with Emotional Intelligence

Phoenix-4 is the first real-time model to generate and control emotional states, active listening behavior, and continuous facial motion as a single, unified system. It is a real-time behavior generation engine, built from the ground up, that goes beyond photorealism to transform conversation data into emotionally responsive, context-aware facial expression and head motion with millisecond-level latency.

Eloi Du Bois

February 18, 2026

From random noise to real images: Understanding diffusion and flow matching

A clear intro to diffusion and flow-matching: data distributions, ODE vs SDE, and the path from Gaussian noise to realistic images/videos powering SOTA models.

Karthik Ragunath Ananda Kumar

September 22, 2025

Introducing the evolution of Conversational Video Interface – now with Emotional Intelligence

Introducing our new family of state-of-the-art AI models: Phoenix-3, Raven-0, and Sparrow-0. Together they bring Conversational Video Interfaces (CVI) to the next level, and power Charlie, our new demo persona.

Julia Szatar

March 6, 2025

Developer Account

PALs Account