All Posts

Research

A paired evaluation of Tavus and Anam in real-world conversational interaction

Written by

Jesse Rowe

publish date

May 6, 2026

Gaussian Splatting: Explained Through Code

Avatar Arena · Research Brief · May 2026

A paired evaluation of Tavus and Anam in real-world conversational interaction

In a paired evaluation of Tavus and Anam across 80 individual sessions of unscripted, real-world conversational interaction, Tavus was preferred 62.5% to 37.5%. Tavus won every head-to-head comparison question at p<0.05, and held statistically significant Likert advantages on six of seven measures of conversational and emotional quality.

ScenarioStory mode (open-ended conversation)

SampleN=80 paired sessions

DisclosureCommissioned by Tavus, with methodology, configurations, and analytic decisions documented in full to enable independent replication.

Key statistics

Overall preference

62.5%

vs 37.5% Anam
binomial p=0.033

H2H wins

4 of 4

all p<0.05

Likert significant

6 of 7

paired Wilcoxon
p<0.05

Forgot-AI score

+0.19 / −0.18

Tavus only positive provider

On the question "which AI would you want to talk to again," participants chose Tavus 65% to 35%. This was the strongest single signal in the study, p=0.010.

This evaluation tested both providers in unscripted, open-ended conversation, the kind of interaction that maps most directly to real-world avatar deployment in companionship, education, coaching, and customer support. Across 80 paired sessions, participants rated Tavus higher on every measure of conversational and emotional quality, and chose Tavus on every comparative question.

The findings:

Tavus was preferred on all four head-to-head questions, with margins of 62.5% to 65.0% and binomial p-values between 0.010 and 0.033.
Tavus held statistically significant Likert advantages on six of seven metrics: empathetic, natural conversation, lifelike, felt-human, enjoyment, and forgot-AI. Wilcoxon p ranged from 0.001 to 0.037, with effect sizes (Cohen's d_z) between 0.24 and 0.43.
The seventh metric, felt-understood, trended in Tavus's favor (paired Δ +0.35) but did not clear conventional significance (p=0.066).
Tavus was the only provider with a positive mean score on "I occasionally forgot I was talking to an AI" (+0.19 vs −0.18).

Methodology

How the study was designed and run

Pre-registration

This study was not pre-registered. Survey items, scenario design, and analysis plan were fixed before the first submission was collected, and no items or analytic decisions were modified during data collection.

Scenario

The evaluation used story mode, a real-world conversational scenario in which the human and the AI work together to accomplish a shared objective: building a story. The format resembles an interactive Mad Lib. The AI acts as a storytelling partner. It prompts the participant for narrative pieces such as a character name, a setting, or a turning point, and weaves the participant's contributions into an evolving narrative. There is no fixed plot, no winning condition, and no time limit beyond the typical two-to-three-minute session length. Participants can drive the story, follow the AI's lead, or trade off as they prefer.

Story mode was selected because it tests the same capabilities that determine avatar performance in real-world deployment: sustaining unscripted engagement across multiple turns, carrying context forward across the conversation, responding emotionally and creatively rather than just answering questions, and taking initiative when the participant pauses without dominating the exchange. These four capabilities apply directly to the scenarios the conversational-avatar industry is being built toward, including companionship, coaching, education, customer support, and creative collaboration.

This scenario was deliberately chosen over more constrained alternatives (such as 20 Questions). Constrained tasks measure narrow capabilities like response time and turn-taking, but they tell us little about how an avatar handles the open-ended, emotionally-aware interaction real users demand.

Survey instrument

After each call, participants rated seven Likert items on a scale from −2 (strongly disagree) to +2 (strongly agree).

Item	Statement
Enjoyed the experience	"I enjoyed talking to this AI human and found the overall experience positive."
Felt human (not avatar)	"The AI human felt like a human, not an avatar."
Forgot talking to AI	"I occasionally forgot I was talking to an AI."
Natural conversation	"I could talk to the AI human as I would in a natural conversation."
Behavior lifelike	"The AI human's behavior seemed natural and lifelike."
Felt understood	"I felt understood by the AI human."
Empathetic / emotionally aware	"The AI human was empathetic and emotionally aware."

After both calls, participants answered four forced-choice comparison questions plus one open-text question:

Overall, which AI human did you prefer?
Which one felt more like a real human?
Which one was easier and more natural to talk to?
Which one would you want to talk to again?
(Open-text) In your own words, what made your preferred pick stand out?

How it works, directly from Avatar Arena

Each session is three phases. P1 and P2 are live conversations; P3 is a short comparison. Total time is about three minutes. Email is the only field; no account or password.

Live call with Avatar A

An unscripted, one-to-one video call with the first anonymized avatar. Same interface regardless of vendor.

~90 s

Live call with Avatar B

The same call with a second anonymized avatar. Order over (A, B) is randomized per session.

~90 s

Head-to-head questions

Four pairwise questions: overall preference, more human, easier and more natural, would talk to again. Vendors are revealed only after submission.

~60 s

Participants

Participants were recruited via Prolific, an online research panel widely used for academic and industry studies. A pre-defined quality-control protocol was established before data collection began and applied uniformly to every submission in the batch.

Each participant completed two consecutive video calls in blind mode. Provider assignment was hidden during both calls and throughout the post-call surveys, and provider identity was disclosed only after the participant had submitted their final answers. This blinding was technical, not just procedural: provider names did not appear in the call interface, the survey UI, or any participant-facing text during the session.

Submissions were excluded from the analytic set if they met any of four pre-specified criteria, all of which were finalized in writing before the first submission was collected and were not modified during data collection:

Incomplete sessions. Either the Tavus call or the Anam call did not run to completion. A call was treated as complete only if it reached the natural end of the storytelling exchange or the participant explicitly ended the session through the interface, and only if the participant submitted post-call ratings for both providers.
Technical failures. Connection loss exceeding five seconds, audio dropouts that prevented the participant from hearing one or more of the AI's turns, or video freezes lasting more than five seconds. Sessions affected by any of these were flagged in real time by the platform telemetry and excluded regardless of the participant's eventual ratings.
Off-task responses. Post-call free-text answers that showed no engagement with the story-mode scenario, evaluated against a pre-specified rubric requiring at least one substantive reference to the storytelling task, the AI's behavior during the call, or the participant's experience of the conversation. Responses consisting only of single-word answers, gibberish, or content unrelated to the call were excluded.
Repeat attempts. Submissions from participants who had previously completed the study were excluded to prevent the same individual from contributing multiple data points. Repeat attempts were identified by matching Prolific participant ID and IP address. Only the first complete submission from any given participant was retained, and subsequent attempts were rejected automatically before any session data was recorded.

The protocol was authored before recruitment opened. The exclusion criteria were not adjusted during the batch, and no submissions were retroactively reclassified after initial review. Of 81 submissions in the batch, one was excluded under the off-task criterion, and the remaining 80 form the analytic set. The retention rate of 98.8% is consistent with what is observed on Prolific for paid sessions of this length and complexity.

Statistical methods

Paired Likert comparisons used the Wilcoxon signed-rank test (two-sided) on within-participant differences. Effect size is reported as Cohen's d_z = mean paired difference / SD of paired differences. Forced-choice comparisons used the exact binomial test against a null hypothesis of 50/50 split. Operational duration comparison used paired Wilcoxon. Significance threshold α = 0.05, two-sided. No multiple-comparison correction was applied; readers interested in family-wise correction can apply Holm or Bonferroni to the seven Likert tests using the p-values reported below.

Models and configurations

Both providers tested on their current production models

Each provider was configured per their own documented recommended settings. Configurations were not custom-tuned by Tavus to optimize either provider's performance.

Tavus · Phoenix-4 Pro

Avatar model: Phoenix-4 Pro, Tavus's most expressive replica tier, supporting full-face animation and emotional expressions.
Pipeline mode: Full pipeline, which is Tavus's documented default and recommended end-to-end configuration.
Perception layer: Raven-1, Tavus's contextual perception model for visual and audio understanding.
Conversational flow: Sparrow-1 turn detection model with turn_taking_patience: medium, replica_interruptibility: medium, voice_isolation: near.
LLM and TTS: Tavus default low-latency configuration.

Anam · Cara-3

Avatar model: Cara-3, Anam's current production model and the model Anam markets as #1 on third-party benchmarks.
Persona configuration: standard Anam persona configuration with name, avatarId, voiceId, and llmId set per Anam's documented quickstart pattern.
Brain / LLM: Anam's recommended brain model.
Voice: Anam's recommended default voice.
System prompt: matched in tone and length to the Tavus system prompt to control for prompt-design effects.

Voice and avatar persona were held to a similar gender, age, and background register across providers.

Results

Per-call Likert ratings, agreement rates, and head-to-head preference

Mean Likert ratings · all seven questions

Scale: −2 (strongly disagree) to +2 (strongly agree)

TavusAnam

Tavus leads numerically on every metric. The largest gaps are on empathetic (+0.66 paired Δ), natural conversation (+0.58), and behavior lifelike (+0.56). Forgot-AI is the only metric where either provider crosses into positive territory.

Agreement rates · share of responses at +1 or +2

Percent of participants who agreed or strongly agreed

TavusAnam

Tavus's agreement rates are above 60% on every metric except forgot-AI. Gaps to Anam are 10 to 26 percentage points. The largest single-question agreement gap is empathetic (77.5% vs 51.2%).

Head-to-head preference · all four comparison questions

Share of participants preferring each provider · dashed line = 50/50

TavusAnam

Tavus wins all four head-to-head questions at 62.5–65.0%. All four clear conventional significance (binomial test, two-sided, vs 50/50). "Want to talk again" produces the strongest signal at 65/35, p=0.010.

Statistical detail — Likert

Likert metric	Tavus	Anam	Paired Δ	d_z	Wilcoxon p
Empathetic / emotionally aware	+0.91	+0.25	+0.66	0.40	0.001
Natural conversation	+0.91	+0.34	+0.58	0.43	<0.001
Behavior lifelike	+0.61	+0.05	+0.56	0.35	0.004
Felt human (not avatar)	+0.51	+0.01	+0.50	0.33	0.005
Enjoyed the experience	+0.98	+0.50	+0.48	0.33	0.005
Forgot talking to AI	+0.19	−0.18	+0.36	0.24	0.037
Felt understood (not significant)	+0.95	+0.60	+0.35	0.21	0.066

Statistical detail — Head-to-head

Head-to-head question	Tavus	Anam	n	Binomial p
Overall, which did you prefer?	50 (62.5%)	30 (37.5%)	80	0.033
Which felt more like a real human?	51 (63.7%)	29 (36.2%)	80	0.018
Which was easier and more natural to talk to?	50 (62.5%)	30 (37.5%)	80	0.033
Which would you want to talk to again?	52 (65.0%)	28 (35.0%)	80	0.010

Qualitative findings

Open-text explanations of preferred picks

Each participant wrote a 10–500 character explanation of the AI they preferred (mean length 178 characters). We tagged each response with theme keywords and tabulated theme prevalence. Themes are split into two tables below: positive themes (what participants praised about their preferred pick) and negative feedback (specific complaints about the provider they did not pick).

Positive themes

What participants praised about the AI they preferred.

Theme	What it captures	Tavus-preferred (n=50)	Anam-preferred (n=30)
Natural / realistic	Praise of natural or lifelike behavior overall	19 (38%)	10 (33%)
Engaging / responsive	Praise of responsiveness or sustained engagement	6 (12%)	4 (13%)
Visual quality	Explicit mention of visual appearance	6 (12%)	8 (27%)
Voice / tone	Praise of voice or speaking style	6 (12%)	3 (10%)
Friendly / warm	Praise of warmth or friendliness	6 (12%)	2 (7%)
Story-building / creative	Described AI as narrative collaborator	5 (10%)	1 (3%)
Empathetic / understood	Described emotional awareness or feeling listened to	4 (8%)	5 (17%)

Negative feedback

Specific complaints about each provider, drawn from open-text explanations written by participants who preferred the other provider. Percentages reflect the share of those participants who cited each complaint type. Only 30 participants preferred Anam, so the n for "About Tavus" complaints is 30. Fifty preferred Tavus, so the n for "About Anam" complaints is 50.

Complaint type	What it captures	About Tavus (n=30)	About Anam (n=50)
Repetitive / glitchy	Said the AI repeated itself, looped, or had glitches	2 (7%)	6 (12%)
Confused / off-topic	Said the AI didn't follow the conversation or went off-topic	0 (0%)	3 (6%)
Uncanny / robotic	Said the AI felt robotic, unnatural, or uncanny	2 (7%)	3 (6%)

4 of 30 Anam-preferred respondents (13%) cited any of these specific complaints about Tavus. 12 of 50 Tavus-preferred respondents (24%) cited any of these specific complaints about Anam.

Patterns

These patterns are directional rather than definitive. Keyword tagging cannot cleanly separate modalities. "Natural" and "realistic" (the most common theme for both providers) can describe visual presentation, conversational flow, or the overall feel of the interaction. The "Visual quality" theme captures only responses that explicitly named appearance, and likely undercounts visual mentions for participants who used "natural" or "realistic" instead.

Two signals from this open-text data are robust to that ambiguity and worth naming explicitly.

Tavus showed a clear lead on what we tagged as the story-building and creative theme. This category captured responses where participants explicitly described the AI as a narrative collaborator rather than a conversation partner: phrases like "built the story with me," "added unexpected twists," "kept the narrative going," "remembered details I'd mentioned earlier," or "had its own ideas about where the plot should go." The regex matched on terms including story, narrative, plot, creative, imaginative, built, collaborated, and came up with, scoped to participant explanations of why they preferred one provider over the other. The theme appeared in 10% of Tavus-preferred explanations versus 3% of Anam-preferred, a 3× margin. This signal is robust because the words it captures describe a specific behavior (co-construction of narrative content) rather than a modality of presentation, so it is not subject to the visual/conversational ambiguity that affects the broader "natural" and "realistic" categories.

Complaints about the losing provider were more diverse when Anam was the loser than when Tavus was. When participants who chose Tavus explained their pick, they cited a wider range of specific failures by Anam: repetition (12%), confused or off-topic responses (6%), and uncanny or robotic presentation (6%). When participants who chose Anam explained their pick, the equivalent complaints about Tavus appeared at lower and narrower rates. This breadth-of-complaint pattern suggests that Tavus losses were typically attributable to a single failure mode, while Anam losses spanned several. The qualitative coding cannot tell us which failure mode dominated for any individual session, but the cross-session pattern is consistent.

Beyond these two signals, this open-text data cannot tell us which provider wins on any specific dimension, such as visual quality or conversational quality. The keyword categories overlap too much for that kind of claim.

Operational metrics

Call duration and conversational turn density

Provider	Mean duration	Median	SD	Mean turns	Mean s/turn
Tavus	142.7s	150.1s	22.7s	11.9	16.5s
Anam	129.4s	147.5s	27.1s	8.8	14.7s

Tavus calls ran 13.3 seconds longer on average (paired Wilcoxon p<0.0001) and contained roughly 35% more conversational turns at a similar per-turn duration. This is a stylistic difference rather than a quality judgment: Tavus tends toward back-and-forth exchange, Anam toward longer single turns. Anam's higher duration variance reflects more short-outlier sessions where the conversation ended early.

Limitations

What this study does not establish

The study tested a single scenario. Story mode is one open-ended conversational scenario, and findings may not generalize to other interaction settings such as customer support, structured Q&A, education, or healthcare. The size and shape of the effects observed here should be assumed to be scenario-specific until replicated.
The sample size has limits. n=80 paired comparisons is sufficient to detect moderate effects on the headline preference question but underpowered for smaller effects. The felt-understood result (p=0.066) is the clearest example of a metric where additional data would help resolve the question.
Qualitative theme tagging was automated. Open-text themes were assigned via keyword regex rather than human coding. The tagging is reproducible but is less nuanced than coded analysis would be. In particular, keyword categories are not modality-pure: "natural" and "realistic" can describe visual presentation, conversational flow, or both, so the qualitative section should be read as directional rather than as a clean attribution of strengths to specific modalities.
This is a vendor-commissioned study. It was commissioned by Tavus. Even with full methodological transparency, vendor-commissioned research carries inherent confirmation-bias risk that fully independent evaluation does not. We encourage independent replication.
Configuration choices are not exhaustive. While we used each provider's documented recommended configuration, other configurations are possible. We did not test every combination of voice, brain, or avatar persona for either provider.

Raw data and reproducibility

Verify the analysis from the data

The analysis was conducted in Python using pandas for data preparation, scipy.stats for Wilcoxon and binomial tests, and matplotlib for chart generation. Reproducing the headline statistics from the raw CSV requires fewer than 50 lines of code.

Questions, replication attempts, or methodology feedback can be directed to the Avatar Arena team. We are particularly interested in independent replication of these findings.

This study was commissioned by Tavus, the company behind Phoenix-4. Methodology, configurations, and statistical methods are documented in full to enable independent replication.

Avatar Arena is a benchmarking program for conversational AI avatars, focused on rigorous and reproducible evaluation. Studies are published when methodology and findings are robust enough for external readers to evaluate independently.

Phoenix-4: Real-Time Human Rendering with Emotional Intelligence

Phoenix-4 is the first real-time model to generate and control emotional states, active listening behavior, and continuous facial motion as a single, unified system. It is a real-time behavior generation engine, built from the ground up, that goes beyond photorealism to transform conversation data into emotionally responsive, context-aware facial expression and head motion with millisecond-level latency.

Eloi Du Bois

February 18, 2026

From random noise to real images: Understanding diffusion and flow matching

A clear intro to diffusion and flow-matching: data distributions, ODE vs SDE, and the path from Gaussian noise to realistic images/videos powering SOTA models.

Karthik Ragunath Ananda Kumar

September 22, 2025

Introducing the evolution of Conversational Video Interface – now with Emotional Intelligence

Introducing our new family of state-of-the-art AI models: Phoenix-3, Raven-0, and Sparrow-0. Together they bring Conversational Video Interfaces (CVI) to the next level, and power Charlie, our new demo persona.

Julia Szatar

March 6, 2025

Developer Account

PALs Account