All Posts

AI, News, and Ethics

Conversational AI for finance: building advisor-quality conversations at scale

Written by

The Tavus Team

publish date

April 17, 2026

Flight Log: 2/6/2026

Financial advice has always been a trust problem. People don't act on information from someone they don't trust, no matter how good the information is. And trust, in financial conversations, forms face to face: through eye contact, attentive expression, and the visible proof that someone is paying attention to their specific situation.

When someone brings their retirement savings, their children's college fund, or the proceeds from a business exit to a conversation, those signals are what they're reading.

It's exactly this kind of conversation that resists automation, not because the information is complex, but because the trust is personal.

That distinction has shaped what financial firms have and haven't been able to automate. The transactional conversations, the ones that don't depend on trust, are already handled. What remains are the ones where presence determines whether the advice lands and whether the client comes back. Real-time conversational video AI is the infrastructure that extends advisor-quality presence to those conversations at scale, without a human on the other end of every call.

The conversations that determine retention

Financial institutions run two fundamentally different kinds of client conversations, and only one has been automated.

The commodity tier covers balance inquiries, rate lookups, transaction confirmations, appointment scheduling, and FAQ support. Text chatbots and voice agents handle this volume well. Kasisto, LivePerson, SoundHound's Amelia, and dozens of other platforms now field the majority of these interactions across banks.

Then there are the conversations that still require a human read on the client:

Portfolio reviews after a drawdown, when confidence is fragile and clients need more than a performance summary
New client onboarding where goals, fears, and financial history surface for the first time
Loan application guidance through terms most clients don't fully understand on the first pass
Financial planning discussions that touch retirement, estate, or education funding
Benefit elections where employees make decisions with long-term consequences they can't yet feel
Post-market-event check-ins when clients need reassurance more than information

In each of these, a client might signal concern with a pause or a shift in vocal tone before ever articulating it. A first-time investor nods along while actually confused. What the client doesn't say is often more important than what they do.

According to the 2025 Investor Engagement Survey from Logica Research and CapIntel, 61% of clients would terminate a relationship with their advisor over broken trust, ranking it above poor performance relative to expectations (54%). A YCharts survey found that three out of four clients either switched or considered switching advisors in 2023, with poor communication linked to declining confidence. Every one of those findings points back to presence.

Where text and voice fall short in finance

Text reduces a conversation to its words. A client who types "that looks fine" is indistinguishable from one who means it. Everything else, vocal hesitation, a pause before answering, a shift in body language, is gone.

Voice recovers tone and pacing. A client who says "that looks fine" in a slowing cadence is a different signal than one who says it with energy. But voice still loses the visual channel entirely. The advisor can't see confusion forming and the client can't register the nonverbal cues that build trust: eye contact, nodding, the expression that says "I'm with you."

The ceiling for text and voice in financial services is a medium problem. No improvement to language models or speech recognition will give a voice agent the ability to see that a client's brow has furrowed.

Consider a client reviewing her portfolio who says "that allocation looks fine" while her vocal cadence slows and she begins asking about withdrawal flexibility. A voice agent hears the verbal confirmation and advances. An AI Persona for financial services with real-time video perception catches the gap between the words and the delivery and holds the floor open. The client surfaces a family situation she hadn't mentioned. The advice that follows is different, and so is the outcome.

The cost of scaling advisory conversations

Advisory conversations carry a per-interaction labor cost. A wealth management firm running 5,000 client conversations per month at an average advisor cost of $150 per hour is spending real money on every portfolio review, every onboarding call, every post-market check-in. The conversations that don't happen, the clients who sit in a phone queue or get a chatbot when they needed an advisor, represent a retention risk.

Real-time conversational video infrastructure changes that cost structure. Advisor-quality presence shifts from a per-conversation staffing expense to an amortized infrastructure cost. The AI Persona handles the volume: the 11 PM check-in, the Saturday onboarding, the third follow-up question about a loan term. Human advisors focus on the conversations where their judgment and relationship depth matter most. The firm's cost per advisory-quality conversation drops. The number of clients who receive that quality of presence goes up.

What advisor-quality conversation requires

An AI Persona that handles trust-dependent conversations needs four capabilities working together as a single loop:

Knowing when to speak and when to wait. Financial conversations have natural pauses: a client calculating their budget, recalling a prior conversation with their accountant, processing news about a portfolio change.
Seeing and hearing the client at the same time. Verbal confirmation and visual signal routinely diverge in financial conversations. Neither channel alone tells the full story.
Matching expression to the conversation. A client disclosing a difficult financial situation, a divorce, a business loss, should not receive the same neutral expression as one confirming a routine allocation change.
Taking action mid-conversation. Scheduling a follow-up with a human advisor, logging the session summary to the Customer Relationship Management (CRM) system, pulling policy details or fund documentation on demand.

A pre-rendered video avatar can't do any of this. It doesn't perceive the client, doesn't adapt its timing, and doesn't change its behavior based on what's happening in the conversation. Timing, perception, expression, and mid-conversation action need to operate as a live behavioral loop, perceiving and responding in real time.

How Tavus's CVI reads, responds, and acts in financial conversations

Timing, perception, expression, and mid-conversation action have to work as a single system. If timing is right but the system can't perceive the client, it waits at the wrong moments. If perception is sharp but expression is flat, the client still feels like they're talking to a machine. The loop has to close.

Tavus's Conversational Video Interface (CVI) API is the infrastructure that closes it. Product teams integrate the CVI API into their own applications, building white-label conversational video experiences on top of Tavus's platform. Four components operate as a closed loop inside every real-time AI Persona session: three proprietary models and a large language model (LLM) intelligence layer.

Sparrow-1 governs conversational timing, predicting who owns the conversational floor at every moment with 55ms median latency, 100% precision, 100% recall, and zero interruptions on benchmark
Raven-1 is Tavus's multimodal perception system, fusing audio and visual signals into a natural language description of the client's state that the LLM reasons over directly
The LLM intelligence layer reasons over Raven-1's perception output, deciding what the AI Persona says and how: routing content, generating responses, and directing Phoenix-4's expression
Phoenix-4 is Tavus's real-time facial behavior engine, rendering the expression the LLM calls for at 40 frames per second at 1080p across 10+ controllable emotional states

A recently widowed client joins a new client onboarding conversation to discuss managing an inheritance. She begins describing her late husband's investment approach, then pauses mid-sentence, searching for how to say what comes next. Sparrow-1, Tavus's conversational flow model, holds the space open, predicting who owns the conversational floor at the frame level. The silence isn't empty. The client is deciding how much to share.

While Sparrow-1 holds the floor, Raven-1 fuses her steady voice with the dropped gaze and shifted breathing, catching grief surfacing alongside the financial question, with perceptual context never more than 300ms stale. That fused signal reaches the LLM intelligence layer as a natural language description of her state. The LLM reasons over it and determines how to respond: hold space, soften, invite rather than advance.

Phoenix-4 renders what the LLM calls for. It generates a slight nod, a softening of expression, active listening cues drawn from training on thousands of hours of human conversational data. The client feels heard before the AI Persona has said a word. She continues, shares the full picture, and the advice that follows accounts for what a simple questionnaire would have missed entirely.

The closed loop runs at approximately 500ms total pipeline latency. Each component feeds the others continuously: Raven-1's fused perception informs the LLM's reasoning, which shapes Sparrow-1's timing decisions and the expression Phoenix-4 renders. Function Calling handles mid-conversation action, connecting the AI Persona to CRM systems, scheduling tools, and documentation retrieval during the live session. The AI Persona can trigger functions from user speech or from signals Raven-1 perceives in real time, connecting to whatever the conversation requires: a CRM record update, a calendar booking, a documentation pull.

Memory, knowledge, and compliance: what makes the AI Persona trustworthy over time

Perception and timing are what the AI Persona does in the room. Memory, grounded knowledge, and enforced boundaries are what make it trustworthy across sessions, and what make it viable in a regulated industry.

Memories give the AI Persona cross-session continuity. When a client returns for their quarterly review, the AI Persona recalls that last time they mentioned concerns about their daughter's tuition timeline and the possibility of drawing on the portfolio early. That continuity is what distinguishes an advisor relationship from a transaction. Without it, every conversation starts cold, and the client is the one who has to do the work of re-establishing context. Memories, scoped per participant, carry that context forward automatically.

Knowledge Base grounds the AI Persona's responses in the firm's actual source material. When a client asks about expense ratios on a specific fund, the AI Persona retrieves the answer from the firm's uploaded prospectuses and policy documents in approximately 30ms, fast enough that the conversation doesn't pause. The response carries the authority of verified source material, not general training data.

Guardrails hold the AI Persona to the scope the firm defines. Financial services is one of the most regulated industries in the world, and an AI Persona operating outside approved guidance language is a liability, not an asset. Guardrails keep responses within the firm's authorized guidance range, route out-of-scope questions to appropriate resources, and escalate to a licensed advisor when a client's question requires professional judgment. A compliance officer reviewing the system can see exactly where the boundary sits and confirm it is being enforced in every conversation.

Objectives make the AI Persona's effectiveness measurable. Financial firms need evidence that required steps were completed: that the client reviewed the fee structure and confirmed they understood it, that the risk tolerance questionnaire reached completion before any allocation was discussed, that the required disclosures were delivered and acknowledged. Objectives configure each conversation with specific completion criteria. The AI Persona works toward those outcomes, and the firm has a verifiable record of what was accomplished, not merely that a conversation occurred.

Memories, Knowledge Base, Guardrails, and Objectives are what separate an AI Persona from a sophisticated avatar. Perception and rendering are how the AI Persona shows up in the moment. Memory, knowledge, compliance, and outcome tracking are how it earns and maintains trust across time.

Face-to-face conversational AI use cases in financial services

The conversations that matter most in financial services share a common trait: what the client signals nonverbally often diverges from what they say. Each of the use cases below hits that gap.

New client onboarding. A face-to-face AI Persona can notice when a recently divorced father's energy shifts as the conversation turns to his children's education savings, and probe gently, following the shift before advancing to the next question.
Portfolio review conversations. A retired teacher checking in after a sharp drawdown may say "I understand the long-term picture" while shortening her answers and gripping her phone. An AI Persona that registers the emotional temperature turns a portfolio explanation into a retention conversation.
Loan and lending guidance. A young couple buying their first home who say "got it" mid-explanation while their expressions signal confusion are customers who will call back three times before closing. Catching that mismatch in real time changes the completion rate.
Advisor training and practice. AI Personas can run difficult money conversations, inheritance discussions, and portfolio loss scenarios, surfacing the moments where a trainee's tone and expression diverged from intention.

Each of these conversations already happens at financial institutions every day. The question is whether the clients who need them most, the ones outside of business hours or simply waiting in queue, get the same quality of presence as everyone else.

The presence that keeps clients

Financial conversations carry weight that outlasts the meeting itself. The client who felt heard when she disclosed a family situation during a portfolio review stays. The first-generation investor who was given space to find her question refers her sister. And the couple who actually understood their mortgage terms before signing? They close with confidence and come back for the next product.

That quality of attention, the kind that recalls what a client shared last quarter, grounds its answers in verified source material, stays within the boundaries compliance requires, and tracks whether the conversation actually achieved what it needed to, has been the defining advantage of great financial advisors for decades. It's also been the bottleneck. Every firm has more clients who deserve that presence than advisors who can deliver it.

Real-time conversational video AI removes the bottleneck without removing the presence. The advisor-quality conversation that used to require a human on the other end of every call can now reach every client who needs it. See it for yourself. Book a demo.

Phoenix-4: Real-Time Human Rendering with Emotional Intelligence

Phoenix-4 is the first real-time model to generate and control emotional states, active listening behavior, and continuous facial motion as a single, unified system. It is a real-time behavior generation engine, built from the ground up, that goes beyond photorealism to transform conversation data into emotionally responsive, context-aware facial expression and head motion with millisecond-level latency.

Eloi Du Bois

February 18, 2026

From random noise to real images: Understanding diffusion and flow matching

A clear intro to diffusion and flow-matching: data distributions, ODE vs SDE, and the path from Gaussian noise to realistic images/videos powering SOTA models.

Karthik Ragunath Ananda Kumar

September 22, 2025

Introducing the evolution of Conversational Video Interface – now with Emotional Intelligence

Introducing our new family of state-of-the-art AI models: Phoenix-3, Raven-0, and Sparrow-0. Together they bring Conversational Video Interfaces (CVI) to the next level, and power Charlie, our new demo persona.

Julia Szatar

March 6, 2025

Developer Account

PALs Account