All Posts
A paired evaluation of Tavus and Anam in real-world conversational interaction
.png)
.png)
Choose how you want to experience Tavus. Whether you’re building with our APIs or meeting a PAL, you can switch anytime.
Build real-time, human-like AI experiences using Tavus APIs and tools.
Best for developers, founders, and teams integrating Tavus into a product.
Meet your personal AI companions who listen, remember, and are always present.
Best for individuals looking to talk, explore, and connect with a friend.
All Posts
.png)
.png)
Written by

Jesse Rowe
publish date
May 6, 2026
Avatar Arena · Research Brief · May 2026
In a paired evaluation of Tavus and Anam across 80 individual sessions of unscripted, real-world conversational interaction, Tavus was preferred 62.5% to 37.5%. Tavus won every head-to-head comparison question at p<0.05, and held statistically significant Likert advantages on six of seven measures of conversational and emotional quality.
On the question "which AI would you want to talk to again," participants chose Tavus 65% to 35%. This was the strongest single signal in the study, p=0.010.
This evaluation tested both providers in unscripted, open-ended conversation, the kind of interaction that maps most directly to real-world avatar deployment in companionship, education, coaching, and customer support. Across 80 paired sessions, participants rated Tavus higher on every measure of conversational and emotional quality, and chose Tavus on every comparative question.
The findings:
This study was not pre-registered. Survey items, scenario design, and analysis plan were fixed before the first submission was collected, and no items or analytic decisions were modified during data collection.
The evaluation used story mode, a real-world conversational scenario in which the human and the AI work together to accomplish a shared objective: building a story. The format resembles an interactive Mad Lib. The AI acts as a storytelling partner. It prompts the participant for narrative pieces such as a character name, a setting, or a turning point, and weaves the participant's contributions into an evolving narrative. There is no fixed plot, no winning condition, and no time limit beyond the typical two-to-three-minute session length. Participants can drive the story, follow the AI's lead, or trade off as they prefer.
Story mode was selected because it tests the same capabilities that determine avatar performance in real-world deployment: sustaining unscripted engagement across multiple turns, carrying context forward across the conversation, responding emotionally and creatively rather than just answering questions, and taking initiative when the participant pauses without dominating the exchange. These four capabilities apply directly to the scenarios the conversational-avatar industry is being built toward, including companionship, coaching, education, customer support, and creative collaboration.
This scenario was deliberately chosen over more constrained alternatives (such as 20 Questions). Constrained tasks measure narrow capabilities like response time and turn-taking, but they tell us little about how an avatar handles the open-ended, emotionally-aware interaction real users demand.
After each call, participants rated seven Likert items on a scale from −2 (strongly disagree) to +2 (strongly agree).
| Item | Statement |
|---|---|
| Enjoyed the experience | "I enjoyed talking to this AI human and found the overall experience positive." |
| Felt human (not avatar) | "The AI human felt like a human, not an avatar." |
| Forgot talking to AI | "I occasionally forgot I was talking to an AI." |
| Natural conversation | "I could talk to the AI human as I would in a natural conversation." |
| Behavior lifelike | "The AI human's behavior seemed natural and lifelike." |
| Felt understood | "I felt understood by the AI human." |
| Empathetic / emotionally aware | "The AI human was empathetic and emotionally aware." |
After both calls, participants answered four forced-choice comparison questions plus one open-text question:
Each session is three phases. P1 and P2 are live conversations; P3 is a short comparison. Total time is about three minutes. Email is the only field; no account or password.
Participants were recruited via Prolific, an online research panel widely used for academic and industry studies. A pre-defined quality-control protocol was established before data collection began and applied uniformly to every submission in the batch.
Each participant completed two consecutive video calls in blind mode. Provider assignment was hidden during both calls and throughout the post-call surveys, and provider identity was disclosed only after the participant had submitted their final answers. This blinding was technical, not just procedural: provider names did not appear in the call interface, the survey UI, or any participant-facing text during the session.
Submissions were excluded from the analytic set if they met any of four pre-specified criteria, all of which were finalized in writing before the first submission was collected and were not modified during data collection:
The protocol was authored before recruitment opened. The exclusion criteria were not adjusted during the batch, and no submissions were retroactively reclassified after initial review. Of 81 submissions in the batch, one was excluded under the off-task criterion, and the remaining 80 form the analytic set. The retention rate of 98.8% is consistent with what is observed on Prolific for paid sessions of this length and complexity.
Paired Likert comparisons used the Wilcoxon signed-rank test (two-sided) on within-participant differences. Effect size is reported as Cohen's d_z = mean paired difference / SD of paired differences. Forced-choice comparisons used the exact binomial test against a null hypothesis of 50/50 split. Operational duration comparison used paired Wilcoxon. Significance threshold α = 0.05, two-sided. No multiple-comparison correction was applied; readers interested in family-wise correction can apply Holm or Bonferroni to the seven Likert tests using the p-values reported below.
Each provider was configured per their own documented recommended settings. Configurations were not custom-tuned by Tavus to optimize either provider's performance.
Voice and avatar persona were held to a similar gender, age, and background register across providers.
Tavus leads numerically on every metric. The largest gaps are on empathetic (+0.66 paired Δ), natural conversation (+0.58), and behavior lifelike (+0.56). Forgot-AI is the only metric where either provider crosses into positive territory.
Tavus's agreement rates are above 60% on every metric except forgot-AI. Gaps to Anam are 10 to 26 percentage points. The largest single-question agreement gap is empathetic (77.5% vs 51.2%).
Tavus wins all four head-to-head questions at 62.5–65.0%. All four clear conventional significance (binomial test, two-sided, vs 50/50). "Want to talk again" produces the strongest signal at 65/35, p=0.010.
| Likert metric | Tavus | Anam | Paired Δ | d_z | Wilcoxon p |
|---|---|---|---|---|---|
| Empathetic / emotionally aware | +0.91 | +0.25 | +0.66 | 0.40 | 0.001 |
| Natural conversation | +0.91 | +0.34 | +0.58 | 0.43 | <0.001 |
| Behavior lifelike | +0.61 | +0.05 | +0.56 | 0.35 | 0.004 |
| Felt human (not avatar) | +0.51 | +0.01 | +0.50 | 0.33 | 0.005 |
| Enjoyed the experience | +0.98 | +0.50 | +0.48 | 0.33 | 0.005 |
| Forgot talking to AI | +0.19 | −0.18 | +0.36 | 0.24 | 0.037 |
| Felt understood (not significant) | +0.95 | +0.60 | +0.35 | 0.21 | 0.066 |
| Head-to-head question | Tavus | Anam | n | Binomial p |
|---|---|---|---|---|
| Overall, which did you prefer? | 50 (62.5%) | 30 (37.5%) | 80 | 0.033 |
| Which felt more like a real human? | 51 (63.7%) | 29 (36.2%) | 80 | 0.018 |
| Which was easier and more natural to talk to? | 50 (62.5%) | 30 (37.5%) | 80 | 0.033 |
| Which would you want to talk to again? | 52 (65.0%) | 28 (35.0%) | 80 | 0.010 |
Each participant wrote a 10–500 character explanation of the AI they preferred (mean length 178 characters). We tagged each response with theme keywords and tabulated theme prevalence. Themes are split into two tables below: positive themes (what participants praised about their preferred pick) and negative feedback (specific complaints about the provider they did not pick).
What participants praised about the AI they preferred.
| Theme | What it captures | Tavus-preferred (n=50) | Anam-preferred (n=30) |
|---|---|---|---|
| Natural / realistic | Praise of natural or lifelike behavior overall | 19 (38%) | 10 (33%) |
| Engaging / responsive | Praise of responsiveness or sustained engagement | 6 (12%) | 4 (13%) |
| Visual quality | Explicit mention of visual appearance | 6 (12%) | 8 (27%) |
| Voice / tone | Praise of voice or speaking style | 6 (12%) | 3 (10%) |
| Friendly / warm | Praise of warmth or friendliness | 6 (12%) | 2 (7%) |
| Story-building / creative | Described AI as narrative collaborator | 5 (10%) | 1 (3%) |
| Empathetic / understood | Described emotional awareness or feeling listened to | 4 (8%) | 5 (17%) |
Specific complaints about each provider, drawn from open-text explanations written by participants who preferred the other provider. Percentages reflect the share of those participants who cited each complaint type. Only 30 participants preferred Anam, so the n for "About Tavus" complaints is 30. Fifty preferred Tavus, so the n for "About Anam" complaints is 50.
| Complaint type | What it captures | About Tavus (n=30) | About Anam (n=50) |
|---|---|---|---|
| Repetitive / glitchy | Said the AI repeated itself, looped, or had glitches | 2 (7%) | 6 (12%) |
| Confused / off-topic | Said the AI didn't follow the conversation or went off-topic | 0 (0%) | 3 (6%) |
| Uncanny / robotic | Said the AI felt robotic, unnatural, or uncanny | 2 (7%) | 3 (6%) |
4 of 30 Anam-preferred respondents (13%) cited any of these specific complaints about Tavus. 12 of 50 Tavus-preferred respondents (24%) cited any of these specific complaints about Anam.
These patterns are directional rather than definitive. Keyword tagging cannot cleanly separate modalities. "Natural" and "realistic" (the most common theme for both providers) can describe visual presentation, conversational flow, or the overall feel of the interaction. The "Visual quality" theme captures only responses that explicitly named appearance, and likely undercounts visual mentions for participants who used "natural" or "realistic" instead.
Two signals from this open-text data are robust to that ambiguity and worth naming explicitly.
Beyond these two signals, this open-text data cannot tell us which provider wins on any specific dimension, such as visual quality or conversational quality. The keyword categories overlap too much for that kind of claim.
| Provider | Mean duration | Median | SD | Mean turns | Mean s/turn |
|---|---|---|---|---|---|
| Tavus | 142.7s | 150.1s | 22.7s | 11.9 | 16.5s |
| Anam | 129.4s | 147.5s | 27.1s | 8.8 | 14.7s |
Tavus calls ran 13.3 seconds longer on average (paired Wilcoxon p<0.0001) and contained roughly 35% more conversational turns at a similar per-turn duration. This is a stylistic difference rather than a quality judgment: Tavus tends toward back-and-forth exchange, Anam toward longer single turns. Anam's higher duration variance reflects more short-outlier sessions where the conversation ended early.
The analysis was conducted in Python using pandas for data preparation, scipy.stats for Wilcoxon and binomial tests, and matplotlib for chart generation. Reproducing the headline statistics from the raw CSV requires fewer than 50 lines of code.
Questions, replication attempts, or methodology feedback can be directed to the Avatar Arena team. We are particularly interested in independent replication of these findings.
This study was commissioned by Tavus, the company behind Phoenix-4. Methodology, configurations, and statistical methods are documented in full to enable independent replication.
Avatar Arena is a benchmarking program for conversational AI avatars, focused on rigorous and reproducible evaluation. Studies are published when methodology and findings are robust enough for external readers to evaluate independently.
Phoenix-4 is a product of Tavus. Cara-3 is a product of Anam. References to Cara-3 or any third-party benchmark are factual and not endorsed by the cited parties. © 2026 Tavus.