All Posts
What is an AI agent? Types, architecture, and the role of video


If you're being asked to evaluate, fund, or deploy AI agents right now, you've probably noticed that the term means something different to every person in the room. One colleague is describing a basic automation flow for support routing. Another is describing a system that conducts live video interviews and adjusts its approach in real time. Both are called "AI agents."
Before you can separate agents that produce outcomes from agents that produce escalations, you need a framework built on a question almost nobody is asking: what can the agent actually perceive?
The answer to what the agent can perceive matters most for product leaders, AI/ML leads, and teams evaluating conversational AI infrastructure for customer or employee-facing experiences. Most teams are still in the category-education stage. They're trying to understand what belongs in their stack before they commit to a build or buy decision.
An AI agent is a software system that pursues a goal by perceiving its environment, reasoning about what to do next, taking action, and iterating, without requiring a human to direct each step. AI agency shows up in sustained autonomy across multiple steps. As MIT Sloan puts it, these are systems that "perceive, reason, and act on their own," completing tasks independently or with minimal human supervision.
AI assistants, copilots, and agents occupy different roles in a product stack. Assistants answer the prompt in front of them. Copilots support a human who still holds decision authority. Agents carry work across multiple steps, check whether the task is complete, and operate independently within defined boundaries. Teams often collapse these categories together, and the result is bad evaluation criteria. Gartner warns that confusing them is so common it has a name: agentwashing, the mislabeling of basic AI assistants as agents.
Most discussions of agentic AI stay focused on planning, tool use, and multi-step execution. In conversational systems, especially the ones speaking directly to people in consequential situations, perception deserves equal weight. Product leaders need to know what the agent can perceive about the person it's talking to.
AI agents vary by how they respond, how much context they retain, and how much planning they can do. Differences in memory, planning, and context retention shape what happens when a conversation gets messy, emotional, or ambiguous.
The table below gives a quick read on where each type works well and where it tends to break.
In practice, most enterprise systems land in the hybrid category, and the real question becomes whether the perception layer is strong enough to route the conversation into the right behavior.
As conversations become more consequential, the perception layer matters more. In human-facing systems, the interface has a large influence on how much the agent can actually perceive.
Five core components make an AI agent work, and those components operate as connected layers rather than independent features.
Architecture frameworks from sources like McKinsey account for these core components, but often leave out the interface, the medium through which the agent communicates with humans. Real-time AI video makes face-to-face conversation available at scale, so the interface belongs in the architecture discussion for conversational agents.
For conversational agents, the interface shapes system capability by setting the limits of perception. What the agent can sense about the person in front of it shapes what it can reason about and how it responds.
Communication research shows that different media carry different amounts of conversational information, and that text, voice, and video do not perform equivalently in cooperative or high-context interactions:
Tavus describes this as the lossy medium problem: traditional systems reduce everything to a transcribed text stream, losing much of the communicative signal.
The table points to a practical decision. If the conversation depends on trust, disclosure, or emotional nuance, the interface determines whether the agent has enough signal to respond well.
In any consequential conversation, one where a person needs to disclose something, consent to something, or make a decision that matters, the medium determines whether the agent has enough information to respond to what the person actually means. A pre-rendered face can look attentive. An AI Persona backed by a behavioral system can perceive expression and conversational timing in real time and respond to those signals. The perception-to-expression loop creates a stronger sense of presence.
Tavus's Conversational Video Interface (CVI) turns the case for video-based perception into infrastructure: real-time conversational video that gives AI Personas access to a fuller perceptual channel.
Teams build their own conversational experiences on top of the CVI API and SDKs instead of deploying a fixed Tavus-owned experience. The infrastructure is also white-label, so the AI Personas live inside the customer's product surface and brand. For teams evaluating AI video agents, this is where the category becomes concrete. Tavus is also a Human Computing research lab with products, and that mix shows up in how the system is built.
The system delivers the full stack required for AI Personas that feel genuinely human: perception (Raven-1), conversational intelligence (Sparrow-1 + LLM layer), personality and memory (Memories, Knowledge Base, Guardrails, Objectives), and rendering (Phoenix-4). Tavus doesn't just provide the conversational interface. It provides every component necessary for an AI Persona to understand the person it's talking to, remember what matters, and act with the judgment the moment requires.
The behavioral stack behind each AI Persona operates as a closed loop: Sparrow-1 reads conversational intent to govern when the AI Persona speaks, Raven-1 interprets what the other person is feeling through fused audio and visual signals, the LLM layer reasons about what to say and do next, and Phoenix-4 generates real-time facial behavior that reflects that understanding back naturally.
Emotional intelligence appears here as a system capability, with perception feeding expression within Tavus's sub-second end-to-end conversational latency.
Let's take a closer look at these models.
Sparrow-1, the conversational flow model, governs when the AI Persona speaks, waits, or holds the floor. It operates at the frame level from raw audio, predicting floor ownership rather than detecting silence. Tavus uses Sparrow-1 inside the conversational flow layer, while speculative inference is available as a separate LLM configuration option.
On a benchmark of 28 real-world conversational samples, Sparrow-1 achieves 55ms median floor-prediction latency, 100% precision, and zero interruptions. The practical effect is pacing that feels patient as well as fast. During a candidate screening call, Sparrow-1 holds the floor open while an applicant gathers their thoughts rather than jumping in with the next question.
Timing matters because an agent that interrupts someone during a consequential decision, a benefits election, a screening call, a consent confirmation, breaks the one thing the interaction depends on: the person's willingness to stay.
Raven-1, the multimodal perception system, works on the same moment from a different angle. It fuses audio and visual signals, tone, prosody, expression, posture, gaze, hesitation, into a unified understanding of the person's state. It tracks emotional arcs within a single turn.
Raven-1 outputs natural language descriptions, for example, descriptions of emotional and attentional state, that downstream reasoning systems can act on directly, and is used within Tavus's CVI.
Natural language descriptions matter because they let the LLM layer reason about ambiguity directly. A categorical label like "neutral" tells the agent nothing useful. A description like "confident tone with brief gaze aversion before the final phrase" gives the reasoning layer actual context to decide whether to proceed or probe further.
A smile paired with a sarcastic tone means something different from the same smile paired with genuine warmth. Raven-1 captures both, making it possible for an AI Persona to distinguish genuine comprehension from performed comprehension.
That perceptual read has to show up on screen. Phoenix-4, the real-time facial behavior engine, generates emotionally responsive behavior from training on thousands of hours of human conversational data rather than from pre-programmed animation states.
Active listening behavior, nodding and responsive micro-expressions while the person speaks, emerges from the training data itself. Full-duplex generation means Phoenix-4 produces behavior while listening, not just when speaking, across 10+ controllable emotional states at 40fps and 1080p.
Active listening behavior matters because it's how people decide whether the agent is actually tracking them or just waiting for its turn to speak. That judgment happens unconsciously, and it determines everything that follows.
Beyond the behavioral stack, CVI includes the intelligence and personality layers that separate a demo from a production-grade agent. Memories retain context across sessions so returning users don't start over. Knowledge Base grounds every response in your actual data and procedures through real-time retrieval. Function Calling lets AI Personas take action mid-conversation: booking appointments, logging results, triggering workflows. And Objectives and Guardrails set measurable completion criteria and compliance boundaries natively. These capabilities map directly to the memory, execution, and governance layers in the architecture framework above.
A financial services firm deploys an AI Persona to walk clients through portfolio reviews. One client says she's "comfortable with the allocation," but her pace has slowed and she's broken eye contact. Raven-1 reads both signals together. Sparrow-1 reads the pause as unresolved and holds the floor open. Phoenix-4 sustains an expression that signals the floor is still hers. She raises a concern that changes the direction of the conversation.
The integration across timing, perception, and expression is a useful production test. Systems that hold up in production need all three working together, not one strong demo component. Through the CVI API, SDKs, and white-label deployment model, teams can carry that loop into their own onboarding flows, support journeys, training programs, or advisory experiences.
The business case sits in labor economics and conversation quality. Gartner projects that conversational AI deployments will reduce contact center agent labor costs by $80 billion by 2026. McKinsey estimates that applying generative AI to customer care functions could increase productivity at a value ranging from 30 to 45% of current function costs.
AI Personas shift that from per-conversation labor cost to infrastructure cost amortized across thousands of conversations, and because real-time video preserves more of the cooperative behavior people rely on in face-to-face interaction, those conversations stay closer to the standard that justified having them in the first place.
Product leaders should spend less time fixating on planning sophistication and more time evaluating whether the agent can perceive what is actually happening in the conversation. That evaluation starts with the interface and the signals it preserves. For the conversations that drive real business outcomes, the ones where a person needs to feel genuine presence before they decide, disclose, or commit, face-to-face interaction carries the signals the agent needs.
Organizations that turn AI agent budgets into measurable retention, throughput, and cost reduction usually share a common trait: their agents can read the room and create presence. That's what Tavus AI Personas are built to deliver. Book a demo.