AI, News, and Ethics

What is an AI agent? Types, architecture, and the role of video

Written by

The Tavus Team

publish date

April 17, 2026

Gaussian Splatting: Explained Through Code

If you're being asked to evaluate, fund, or deploy AI agents right now, you've probably noticed that the term means something different to every person in the room. One colleague is describing a basic automation flow for support routing. Another is describing a system that conducts live video interviews and adjusts its approach in real time. Both are called "AI agents."

Before you can separate agents that produce outcomes from agents that produce escalations, you need a framework built on a question almost nobody is asking: what can the agent actually perceive?

The answer to what the agent can perceive matters most for product leaders, AI/ML leads, and teams evaluating conversational AI infrastructure for customer or employee-facing experiences. Most teams are still in the category-education stage. They're trying to understand what belongs in their stack before they commit to a build or buy decision.

What is an AI agent?

An AI agent is a software system that pursues a goal by perceiving its environment, reasoning about what to do next, taking action, and iterating, without requiring a human to direct each step. AI agency shows up in sustained autonomy across multiple steps. As MIT Sloan puts it, these are systems that "perceive, reason, and act on their own," completing tasks independently or with minimal human supervision.

AI assistants, copilots, and agents occupy different roles in a product stack. Assistants answer the prompt in front of them. Copilots support a human who still holds decision authority. Agents carry work across multiple steps, check whether the task is complete, and operate independently within defined boundaries. Teams often collapse these categories together, and the result is bad evaluation criteria. Gartner warns that confusing them is so common it has a name: agentwashing, the mislabeling of basic AI assistants as agents.

Most discussions of agentic AI stay focused on planning, tool use, and multi-step execution. In conversational systems, especially the ones speaking directly to people in consequential situations, perception deserves equal weight. Product leaders need to know what the agent can perceive about the person it's talking to.

Types of AI agents: a practical map for decision-makers

AI agents vary by how they respond, how much context they retain, and how much planning they can do. Differences in memory, planning, and context retention shape what happens when a conversation gets messy, emotional, or ambiguous.

Reactive agents respond to inputs based on predefined rules, with no memory or planning. They fit high-volume, structured tasks such as FAQ routing, simple workflow automation, and basic triage.
Deliberative agents maintain an internal model of the situation and plan a sequence of actions toward a goal. They work well for multi-step workflows with conditional logic, such as guiding a customer through a claims process or walking a new hire through benefits enrollment.
Learning agents improve through interaction. They track outcomes, adjust based on feedback, and converge on approaches that work better for a given user or context over time. Coaching and training applications benefit most here.
Hybrid agents combine reactive speed with deliberative depth. They answer clear inputs immediately and shift into deeper reasoning when something is ambiguous or high-stakes. Hybrid architectures are an increasingly common pattern in production agent systems.

The table below gives a quick read on where each type works well and where it tends to break.

Agent type	What it does	Built for	Where it reaches its limit
Reactive	Responds to inputs via predefined rules; no memory or planning	High-volume, structured tasks: FAQ routing, simple automation	Anything outside its rule set produces a wrong answer or an escalation
Deliberative	Maintains an internal model; plans action sequences toward a goal	Multi-step workflows requiring conditional logic	Deliberation takes time; in real-time conversation, that pause shows
Learning	Improves through interaction; adjusts based on feedback and outcomes	Coaching, training, personalized experiences that should adapt	Requires data over time; cold-start performance is limited
Hybrid	Combines reactive speed with deliberative reasoning	Production-grade conversational agents handling varied, complex inputs	Perception quality determines how accurately the reasoning layer gets triggered

In practice, most enterprise systems land in the hybrid category, and the real question becomes whether the perception layer is strong enough to route the conversation into the right behavior.

As conversations become more consequential, the perception layer matters more. In human-facing systems, the interface has a large influence on how much the agent can actually perceive.

The architecture behind an AI agent

Five core components make an AI agent work, and those components operate as connected layers rather than independent features.

Perception covers the inputs the agent receives:
- For task agents, those inputs are usually structured data and API outputs.
- For conversational agents, the input includes the person on the other side of the exchange: what they said, how they said it, and in some interfaces, how they looked while saying it.
Memory usually falls into two categories:
- Working memory holds current session context.
- Persistent memory retains what a person has said before, what they've agreed to, and what remains unresolved.
Planning governs how the agent decides what to do next:
- Reactive agents follow rule-based flows.
- Deliberative agents use reasoning and task decomposition.
Execution covers the actions the agent takes in the world: calling tools, triggering workflows, writing to external systems. This is the layer that separates agents from standard large language models (LLMs), which generate text but do not act.
Governance sets the boundaries. As Deloitte's agentic AI governance research puts it, humans increasingly focus on "validation, oversight, and building guardrails for agent operations."

Architecture frameworks from sources like McKinsey account for these core components, but often leave out the interface, the medium through which the agent communicates with humans. Real-time AI video makes face-to-face conversation available at scale, so the interface belongs in the architecture discussion for conversational agents.

Why the interface is an architecture decision, not a design choice

For conversational agents, the interface shapes system capability by setting the limits of perception. What the agent can sense about the person in front of it shapes what it can reason about and how it responds.

Communication research shows that different media carry different amounts of conversational information, and that text, voice, and video do not perform equivalently in cooperative or high-context interactions:

Text strips everything to words
Voice adds prosody: pace, pitch, rhythm
Real-time video carries the visual channel: expression, gaze, posture

Tavus describes this as the lossy medium problem: traditional systems reduce everything to a transcribed text stream, losing much of the communicative signal.

Interface	What the agent perceives	What it misses	Where it breaks down
Text only	Words and explicit meaning	Tone, pace, expression, hesitation, gaze	Conversations where what the person means diverges from what they typed
Voice only	Words + prosody (pace, pitch, rhythm)	The visual channel: expression, posture, gaze	Consequential conversations where doubt or disengagement shows before it's stated
Real-time video (AI Personas)	Words + prosody + expression + posture + gaze + hesitation, audio and visual fused	Dependent on connection quality and user camera access	Preserves more of the conversational channel in high-context interactions

The table points to a practical decision. If the conversation depends on trust, disclosure, or emotional nuance, the interface determines whether the agent has enough signal to respond well.

In any consequential conversation, one where a person needs to disclose something, consent to something, or make a decision that matters, the medium determines whether the agent has enough information to respond to what the person actually means. A pre-rendered face can look attentive. An AI Persona backed by a behavioral system can perceive expression and conversational timing in real time and respond to those signals. The perception-to-expression loop creates a stronger sense of presence.

How Tavus AI Personas close the perception gap

Tavus's Conversational Video Interface (CVI) turns the case for video-based perception into infrastructure: real-time conversational video that gives AI Personas access to a fuller perceptual channel.

Teams build their own conversational experiences on top of the CVI API and SDKs instead of deploying a fixed Tavus-owned experience. The infrastructure is also white-label, so the AI Personas live inside the customer's product surface and brand. For teams evaluating AI video agents, this is where the category becomes concrete. Tavus is also a Human Computing research lab with products, and that mix shows up in how the system is built.

The system delivers the full stack required for AI Personas that feel genuinely human: perception (Raven-1), conversational intelligence (Sparrow-1 + LLM layer), personality and memory (Memories, Knowledge Base, Guardrails, Objectives), and rendering (Phoenix-4). Tavus doesn't just provide the conversational interface. It provides every component necessary for an AI Persona to understand the person it's talking to, remember what matters, and act with the judgment the moment requires.

The behavioral stack behind each AI Persona operates as a closed loop: Sparrow-1 reads conversational intent to govern when the AI Persona speaks, Raven-1 interprets what the other person is feeling through fused audio and visual signals, the LLM layer reasons about what to say and do next, and Phoenix-4 generates real-time facial behavior that reflects that understanding back naturally.

Emotional intelligence appears here as a system capability, with perception feeding expression within Tavus's sub-second end-to-end conversational latency.

Let's take a closer look at these models.

Sparrow-1

Sparrow-1, the conversational flow model, governs when the AI Persona speaks, waits, or holds the floor. It operates at the frame level from raw audio, predicting floor ownership rather than detecting silence. Tavus uses Sparrow-1 inside the conversational flow layer, while speculative inference is available as a separate LLM configuration option.

On a benchmark of 28 real-world conversational samples, Sparrow-1 achieves 55ms median floor-prediction latency, 100% precision, and zero interruptions. The practical effect is pacing that feels patient as well as fast. During a candidate screening call, Sparrow-1 holds the floor open while an applicant gathers their thoughts rather than jumping in with the next question.

Timing matters because an agent that interrupts someone during a consequential decision, a benefits election, a screening call, a consent confirmation, breaks the one thing the interaction depends on: the person's willingness to stay.

Raven-1

Raven-1, the multimodal perception system, works on the same moment from a different angle. It fuses audio and visual signals, tone, prosody, expression, posture, gaze, hesitation, into a unified understanding of the person's state. It tracks emotional arcs within a single turn.

Raven-1 outputs natural language descriptions, for example, descriptions of emotional and attentional state, that downstream reasoning systems can act on directly, and is used within Tavus's CVI.

Natural language descriptions matter because they let the LLM layer reason about ambiguity directly. A categorical label like "neutral" tells the agent nothing useful. A description like "confident tone with brief gaze aversion before the final phrase" gives the reasoning layer actual context to decide whether to proceed or probe further.

A smile paired with a sarcastic tone means something different from the same smile paired with genuine warmth. Raven-1 captures both, making it possible for an AI Persona to distinguish genuine comprehension from performed comprehension.

Phoenix-4

That perceptual read has to show up on screen. Phoenix-4, the real-time facial behavior engine, generates emotionally responsive behavior from training on thousands of hours of human conversational data rather than from pre-programmed animation states.

Active listening behavior, nodding and responsive micro-expressions while the person speaks, emerges from the training data itself. Full-duplex generation means Phoenix-4 produces behavior while listening, not just when speaking, across 10+ controllable emotional states at 40fps and 1080p.

Active listening behavior matters because it's how people decide whether the agent is actually tracking them or just waiting for its turn to speak. That judgment happens unconsciously, and it determines everything that follows.

The intelligence and personality layer

Beyond the behavioral stack, CVI includes the intelligence and personality layers that separate a demo from a production-grade agent. Memories retain context across sessions so returning users don't start over. Knowledge Base grounds every response in your actual data and procedures through real-time retrieval. Function Calling lets AI Personas take action mid-conversation: booking appointments, logging results, triggering workflows. And Objectives and Guardrails set measurable completion criteria and compliance boundaries natively. These capabilities map directly to the memory, execution, and governance layers in the architecture framework above.

The behavioral stack in practice

A financial services firm deploys an AI Persona to walk clients through portfolio reviews. One client says she's "comfortable with the allocation," but her pace has slowed and she's broken eye contact. Raven-1 reads both signals together. Sparrow-1 reads the pause as unresolved and holds the floor open. Phoenix-4 sustains an expression that signals the floor is still hers. She raises a concern that changes the direction of the conversation.

The integration across timing, perception, and expression is a useful production test. Systems that hold up in production need all three working together, not one strong demo component. Through the CVI API, SDKs, and white-label deployment model, teams can carry that loop into their own onboarding flows, support journeys, training programs, or advisory experiences.

The business case sits in labor economics and conversation quality. Gartner projects that conversational AI deployments will reduce contact center agent labor costs by $80 billion by 2026. McKinsey estimates that applying generative AI to customer care functions could increase productivity at a value ranging from 30 to 45% of current function costs.

AI Personas shift that from per-conversation labor cost to infrastructure cost amortized across thousands of conversations, and because real-time video preserves more of the cooperative behavior people rely on in face-to-face interaction, those conversations stay closer to the standard that justified having them in the first place.

The question worth asking

Product leaders should spend less time fixating on planning sophistication and more time evaluating whether the agent can perceive what is actually happening in the conversation. That evaluation starts with the interface and the signals it preserves. For the conversations that drive real business outcomes, the ones where a person needs to feel genuine presence before they decide, disclose, or commit, face-to-face interaction carries the signals the agent needs.

Organizations that turn AI agent budgets into measurable retention, throughput, and cost reduction usually share a common trait: their agents can read the room and create presence. That's what Tavus AI Personas are built to deliver. Book a demo.

Conversational AI security: what enterprise teams need to verify before deployment

Learn what enterprise teams must verify before deploying conversational AI: encryption standards, compliance certifications, prompt injection controls, and audit logging.

Tavus Team

May 28, 2026

AI Chatbot APIs vs. AI Video Agent APIs: Architecture Differences

AI chatbot APIs process text. AI video agent APIs run perception, flow, and rendering as a closed loop. Here's how the architectures differ.

Tavus Team

May 28, 2026

Factors affecting latency in real-time voice AI conversations

Voice AI latency compounds at every pipeline stage. Learn which factors matter most and what to ask vendors when evaluating conversational AI infrastructure.

Tavus Team

May 8, 2026