Avatar Arena · Research Brief · May 2026

A paired evaluation of Tavus and Anam in real-world conversational interaction

In a paired evaluation of Tavus and Anam across 80 individual sessions of unscripted, real-world conversational interaction, Tavus was preferred 62.5% to 37.5%. Tavus won every head-to-head comparison question at p<0.05, and held statistically significant Likert advantages on six of seven measures of conversational and emotional quality.

ScenarioStory mode (open-ended conversation)
SampleN=80 paired sessions
DisclosureCommissioned by Tavus, with methodology, configurations, and analytic decisions documented in full to enable independent replication.

Key statistics

Overall preference
62.5%
vs 37.5% Anam
binomial p=0.033
H2H wins
4 of 4
all p<0.05
Likert significant
6 of 7
paired Wilcoxon
p<0.05
Forgot-AI score
+0.19 / −0.18
Tavus only positive provider

On the question "which AI would you want to talk to again," participants chose Tavus 65% to 35%. This was the strongest single signal in the study, p=0.010.

This evaluation tested both providers in unscripted, open-ended conversation, the kind of interaction that maps most directly to real-world avatar deployment in companionship, education, coaching, and customer support. Across 80 paired sessions, participants rated Tavus higher on every measure of conversational and emotional quality, and chose Tavus on every comparative question.

The findings:

  • Tavus was preferred on all four head-to-head questions, with margins of 62.5% to 65.0% and binomial p-values between 0.010 and 0.033.
  • Tavus held statistically significant Likert advantages on six of seven metrics: empathetic, natural conversation, lifelike, felt-human, enjoyment, and forgot-AI. Wilcoxon p ranged from 0.001 to 0.037, with effect sizes (Cohen's d_z) between 0.24 and 0.43.
  • The seventh metric, felt-understood, trended in Tavus's favor (paired Δ +0.35) but did not clear conventional significance (p=0.066).
  • Tavus was the only provider with a positive mean score on "I occasionally forgot I was talking to an AI" (+0.19 vs −0.18).

Methodology

How the study was designed and run

Pre-registration

This study was not pre-registered. Survey items, scenario design, and analysis plan were fixed before the first submission was collected, and no items or analytic decisions were modified during data collection.

Scenario

The evaluation used story mode, a real-world conversational scenario in which the human and the AI work together to accomplish a shared objective: building a story. The format resembles an interactive Mad Lib. The AI acts as a storytelling partner. It prompts the participant for narrative pieces such as a character name, a setting, or a turning point, and weaves the participant's contributions into an evolving narrative. There is no fixed plot, no winning condition, and no time limit beyond the typical two-to-three-minute session length. Participants can drive the story, follow the AI's lead, or trade off as they prefer.

Story mode was selected because it tests the same capabilities that determine avatar performance in real-world deployment: sustaining unscripted engagement across multiple turns, carrying context forward across the conversation, responding emotionally and creatively rather than just answering questions, and taking initiative when the participant pauses without dominating the exchange. These four capabilities apply directly to the scenarios the conversational-avatar industry is being built toward, including companionship, coaching, education, customer support, and creative collaboration.

This scenario was deliberately chosen over more constrained alternatives (such as 20 Questions). Constrained tasks measure narrow capabilities like response time and turn-taking, but they tell us little about how an avatar handles the open-ended, emotionally-aware interaction real users demand.

Survey instrument

After each call, participants rated seven Likert items on a scale from −2 (strongly disagree) to +2 (strongly agree).

ItemStatement
Enjoyed the experience"I enjoyed talking to this AI human and found the overall experience positive."
Felt human (not avatar)"The AI human felt like a human, not an avatar."
Forgot talking to AI"I occasionally forgot I was talking to an AI."
Natural conversation"I could talk to the AI human as I would in a natural conversation."
Behavior lifelike"The AI human's behavior seemed natural and lifelike."
Felt understood"I felt understood by the AI human."
Empathetic / emotionally aware"The AI human was empathetic and emotionally aware."

After both calls, participants answered four forced-choice comparison questions plus one open-text question:

  1. Overall, which AI human did you prefer?
  2. Which one felt more like a real human?
  3. Which one was easier and more natural to talk to?
  4. Which one would you want to talk to again?
  5. (Open-text) In your own words, what made your preferred pick stand out?

How it works, directly from Avatar Arena

Each session is three phases. P1 and P2 are live conversations; P3 is a short comparison. Total time is about three minutes. Email is the only field; no account or password.

P1
Live call with Avatar A
An unscripted, one-to-one video call with the first anonymized avatar. Same interface regardless of vendor.
~90 s
P2
Live call with Avatar B
The same call with a second anonymized avatar. Order over (A, B) is randomized per session.
~90 s
P3
Head-to-head questions
Four pairwise questions: overall preference, more human, easier and more natural, would talk to again. Vendors are revealed only after submission.
~60 s

Participants

Participants were recruited via Prolific, an online research panel widely used for academic and industry studies. A pre-defined quality-control protocol was established before data collection began and applied uniformly to every submission in the batch.

Each participant completed two consecutive video calls in blind mode. Provider assignment was hidden during both calls and throughout the post-call surveys, and provider identity was disclosed only after the participant had submitted their final answers. This blinding was technical, not just procedural: provider names did not appear in the call interface, the survey UI, or any participant-facing text during the session.

Submissions were excluded from the analytic set if they met any of four pre-specified criteria, all of which were finalized in writing before the first submission was collected and were not modified during data collection:

  1. Incomplete sessions. Either the Tavus call or the Anam call did not run to completion. A call was treated as complete only if it reached the natural end of the storytelling exchange or the participant explicitly ended the session through the interface, and only if the participant submitted post-call ratings for both providers.
  2. Technical failures. Connection loss exceeding five seconds, audio dropouts that prevented the participant from hearing one or more of the AI's turns, or video freezes lasting more than five seconds. Sessions affected by any of these were flagged in real time by the platform telemetry and excluded regardless of the participant's eventual ratings.
  3. Off-task responses. Post-call free-text answers that showed no engagement with the story-mode scenario, evaluated against a pre-specified rubric requiring at least one substantive reference to the storytelling task, the AI's behavior during the call, or the participant's experience of the conversation. Responses consisting only of single-word answers, gibberish, or content unrelated to the call were excluded.
  4. Repeat attempts. Submissions from participants who had previously completed the study were excluded to prevent the same individual from contributing multiple data points. Repeat attempts were identified by matching Prolific participant ID and IP address. Only the first complete submission from any given participant was retained, and subsequent attempts were rejected automatically before any session data was recorded.

The protocol was authored before recruitment opened. The exclusion criteria were not adjusted during the batch, and no submissions were retroactively reclassified after initial review. Of 81 submissions in the batch, one was excluded under the off-task criterion, and the remaining 80 form the analytic set. The retention rate of 98.8% is consistent with what is observed on Prolific for paid sessions of this length and complexity.

Statistical methods

Paired Likert comparisons used the Wilcoxon signed-rank test (two-sided) on within-participant differences. Effect size is reported as Cohen's d_z = mean paired difference / SD of paired differences. Forced-choice comparisons used the exact binomial test against a null hypothesis of 50/50 split. Operational duration comparison used paired Wilcoxon. Significance threshold α = 0.05, two-sided. No multiple-comparison correction was applied; readers interested in family-wise correction can apply Holm or Bonferroni to the seven Likert tests using the p-values reported below.

Models and configurations

Both providers tested on their current production models

Each provider was configured per their own documented recommended settings. Configurations were not custom-tuned by Tavus to optimize either provider's performance.

Tavus · Phoenix-4 Pro

  • Avatar model: Phoenix-4 Pro, Tavus's most expressive replica tier, supporting full-face animation and emotional expressions.
  • Pipeline mode: Full pipeline, which is Tavus's documented default and recommended end-to-end configuration.
  • Perception layer: Raven-1, Tavus's contextual perception model for visual and audio understanding.
  • Conversational flow: Sparrow-1 turn detection model with turn_taking_patience: medium, replica_interruptibility: medium, voice_isolation: near.
  • LLM and TTS: Tavus default low-latency configuration.

Anam · Cara-3

  • Avatar model: Cara-3, Anam's current production model and the model Anam markets as #1 on third-party benchmarks.
  • Persona configuration: standard Anam persona configuration with name, avatarId, voiceId, and llmId set per Anam's documented quickstart pattern.
  • Brain / LLM: Anam's recommended brain model.
  • Voice: Anam's recommended default voice.
  • System prompt: matched in tone and length to the Tavus system prompt to control for prompt-design effects.

Voice and avatar persona were held to a similar gender, age, and background register across providers.

Results

Per-call Likert ratings, agreement rates, and head-to-head preference

Mean Likert ratings · all seven questions
Scale: −2 (strongly disagree) to +2 (strongly agree)
TavusAnam
−0.5 0 +0.5 +1.0 Enjoyed the experience +0.97 +0.50 Felt human (not avatar) +0.51 +0.01 Forgot talking to AI +0.19 −0.17 Natural conversation +0.91 +0.34 Behavior lifelike +0.61 +0.05 Felt understood (not significant) +0.95 +0.60 Empathetic / emotionally aware +0.91 +0.25 Mean rating · scale −2 (strongly disagree) to +2 (strongly agree)

Tavus leads numerically on every metric. The largest gaps are on empathetic (+0.66 paired Δ), natural conversation (+0.58), and behavior lifelike (+0.56). Forgot-AI is the only metric where either provider crosses into positive territory.

Agreement rates · share of responses at +1 or +2
Percent of participants who agreed or strongly agreed
TavusAnam
0% 20% 40% 60% 80% Enjoyed the experience 74% 58% Felt human (not avatar) 61% 46% Forgot talking to AI 48% 38% Natural conversation 76% 54% Behavior lifelike 65% 49% Felt understood (not significant) 78% 61% Empathetic / emotionally aware 78% 51% Share of responses at +1 or +2 (agree or strongly agree)

Tavus's agreement rates are above 60% on every metric except forgot-AI. Gaps to Anam are 10 to 26 percentage points. The largest single-question agreement gap is empathetic (77.5% vs 51.2%).

Head-to-head preference · all four comparison questions
Share of participants preferring each provider · dashed line = 50/50
TavusAnam
0% 25% 50% 75% 100% Overall preferred 62.5% 37.5% Felt more human 63.7% 36.2% Easier and more natural 62.5% 37.5% Want to talk again 65.0% 35.0%

Tavus wins all four head-to-head questions at 62.5–65.0%. All four clear conventional significance (binomial test, two-sided, vs 50/50). "Want to talk again" produces the strongest signal at 65/35, p=0.010.

Statistical detail — Likert

Likert metricTavusAnamPaired Δd_zWilcoxon p
Empathetic / emotionally aware+0.91+0.25+0.660.400.001
Natural conversation+0.91+0.34+0.580.43<0.001
Behavior lifelike+0.61+0.05+0.560.350.004
Felt human (not avatar)+0.51+0.01+0.500.330.005
Enjoyed the experience+0.98+0.50+0.480.330.005
Forgot talking to AI+0.19−0.18+0.360.240.037
Felt understood (not significant)+0.95+0.60+0.350.210.066

Statistical detail — Head-to-head

Head-to-head questionTavusAnamnBinomial p
Overall, which did you prefer?50 (62.5%)30 (37.5%)800.033
Which felt more like a real human?51 (63.7%)29 (36.2%)800.018
Which was easier and more natural to talk to?50 (62.5%)30 (37.5%)800.033
Which would you want to talk to again?52 (65.0%)28 (35.0%)800.010

Qualitative findings

Open-text explanations of preferred picks

Each participant wrote a 10–500 character explanation of the AI they preferred (mean length 178 characters). We tagged each response with theme keywords and tabulated theme prevalence. Themes are split into two tables below: positive themes (what participants praised about their preferred pick) and negative feedback (specific complaints about the provider they did not pick).

Positive themes

What participants praised about the AI they preferred.

ThemeWhat it capturesTavus-preferred (n=50)Anam-preferred (n=30)
Natural / realisticPraise of natural or lifelike behavior overall19 (38%)10 (33%)
Engaging / responsivePraise of responsiveness or sustained engagement6 (12%)4 (13%)
Visual qualityExplicit mention of visual appearance6 (12%)8 (27%)
Voice / tonePraise of voice or speaking style6 (12%)3 (10%)
Friendly / warmPraise of warmth or friendliness6 (12%)2 (7%)
Story-building / creativeDescribed AI as narrative collaborator5 (10%)1 (3%)
Empathetic / understoodDescribed emotional awareness or feeling listened to4 (8%)5 (17%)

Negative feedback

Specific complaints about each provider, drawn from open-text explanations written by participants who preferred the other provider. Percentages reflect the share of those participants who cited each complaint type. Only 30 participants preferred Anam, so the n for "About Tavus" complaints is 30. Fifty preferred Tavus, so the n for "About Anam" complaints is 50.

Complaint typeWhat it capturesAbout Tavus (n=30)About Anam (n=50)
Repetitive / glitchySaid the AI repeated itself, looped, or had glitches2 (7%)6 (12%)
Confused / off-topicSaid the AI didn't follow the conversation or went off-topic0 (0%)3 (6%)
Uncanny / roboticSaid the AI felt robotic, unnatural, or uncanny2 (7%)3 (6%)

4 of 30 Anam-preferred respondents (13%) cited any of these specific complaints about Tavus. 12 of 50 Tavus-preferred respondents (24%) cited any of these specific complaints about Anam.

Patterns

These patterns are directional rather than definitive. Keyword tagging cannot cleanly separate modalities. "Natural" and "realistic" (the most common theme for both providers) can describe visual presentation, conversational flow, or the overall feel of the interaction. The "Visual quality" theme captures only responses that explicitly named appearance, and likely undercounts visual mentions for participants who used "natural" or "realistic" instead.

Two signals from this open-text data are robust to that ambiguity and worth naming explicitly.

Tavus showed a clear lead on what we tagged as the story-building and creative theme. This category captured responses where participants explicitly described the AI as a narrative collaborator rather than a conversation partner: phrases like "built the story with me," "added unexpected twists," "kept the narrative going," "remembered details I'd mentioned earlier," or "had its own ideas about where the plot should go." The regex matched on terms including story, narrative, plot, creative, imaginative, built, collaborated, and came up with, scoped to participant explanations of why they preferred one provider over the other. The theme appeared in 10% of Tavus-preferred explanations versus 3% of Anam-preferred, a 3× margin. This signal is robust because the words it captures describe a specific behavior (co-construction of narrative content) rather than a modality of presentation, so it is not subject to the visual/conversational ambiguity that affects the broader "natural" and "realistic" categories.
Complaints about the losing provider were more diverse when Anam was the loser than when Tavus was. When participants who chose Tavus explained their pick, they cited a wider range of specific failures by Anam: repetition (12%), confused or off-topic responses (6%), and uncanny or robotic presentation (6%). When participants who chose Anam explained their pick, the equivalent complaints about Tavus appeared at lower and narrower rates. This breadth-of-complaint pattern suggests that Tavus losses were typically attributable to a single failure mode, while Anam losses spanned several. The qualitative coding cannot tell us which failure mode dominated for any individual session, but the cross-session pattern is consistent.

Beyond these two signals, this open-text data cannot tell us which provider wins on any specific dimension, such as visual quality or conversational quality. The keyword categories overlap too much for that kind of claim.

Operational metrics

Call duration and conversational turn density

ProviderMean durationMedianSDMean turnsMean s/turn
Tavus142.7s150.1s22.7s11.916.5s
Anam129.4s147.5s27.1s8.814.7s

Tavus calls ran 13.3 seconds longer on average (paired Wilcoxon p<0.0001) and contained roughly 35% more conversational turns at a similar per-turn duration. This is a stylistic difference rather than a quality judgment: Tavus tends toward back-and-forth exchange, Anam toward longer single turns. Anam's higher duration variance reflects more short-outlier sessions where the conversation ended early.

Limitations

What this study does not establish

  • The study tested a single scenario. Story mode is one open-ended conversational scenario, and findings may not generalize to other interaction settings such as customer support, structured Q&A, education, or healthcare. The size and shape of the effects observed here should be assumed to be scenario-specific until replicated.
  • The sample size has limits. n=80 paired comparisons is sufficient to detect moderate effects on the headline preference question but underpowered for smaller effects. The felt-understood result (p=0.066) is the clearest example of a metric where additional data would help resolve the question.
  • Qualitative theme tagging was automated. Open-text themes were assigned via keyword regex rather than human coding. The tagging is reproducible but is less nuanced than coded analysis would be. In particular, keyword categories are not modality-pure: "natural" and "realistic" can describe visual presentation, conversational flow, or both, so the qualitative section should be read as directional rather than as a clean attribution of strengths to specific modalities.
  • This is a vendor-commissioned study. It was commissioned by Tavus. Even with full methodological transparency, vendor-commissioned research carries inherent confirmation-bias risk that fully independent evaluation does not. We encourage independent replication.
  • Configuration choices are not exhaustive. While we used each provider's documented recommended configuration, other configurations are possible. We did not test every combination of voice, brain, or avatar persona for either provider.

Raw data and reproducibility

Verify the analysis from the data

The analysis was conducted in Python using pandas for data preparation, scipy.stats for Wilcoxon and binomial tests, and matplotlib for chart generation. Reproducing the headline statistics from the raw CSV requires fewer than 50 lines of code.

Questions, replication attempts, or methodology feedback can be directed to the Avatar Arena team. We are particularly interested in independent replication of these findings.