The (Tavus) Hackathon Cookbook

Alex Behrens

May 28, 2025

Table of Contents

So you want to build a Conversational Video Interface? You’re in good company, startups like Mercor and Delphi and Fortune 500s alike are leveraging Tavus APIs to build the future of AI interaction.

If you haven’t yet, sign up for a Tavus account here or find a Tavus team member to get free credits and start building right away. Once the promo is applied, you’ll see the free conversational video minutes and replica generations added to your account, these work for both new and existing users. You can double-check your billing page to confirm and track usage anytime via the “invoice history” link.

The Basics

So how does this actually work? At the core of the Tavus offering is something we call CVI, or the Conversational Video Interface. You can think of it like a live video portal into an AI agent. It feels like a real video call, but the other side is a hyperrealistic avatar powered by your prompts, logic, and data.

That avatar is called a Replica. Replicas are digital twins trained on real people using just two minutes of source video. Once submitted, the system kicks off a heavy inference process to generate a 3D Gaussian diffusion model of you! This takes a few hours, but the result is a fully rendered and controllable AI version of that person, ready to plug into anything you want. You can create your own Replicas or use Tavus’s library of 100+ prebuilt Replicas ready to test with right away.

Screenshot 2025-05-30 at 1.20.04 PM 1 (1).png

Once you have your Replica ready you’ll create a Persona, which controls how your Replica behaves in conversation. This includes the system prompt that defines tone and behavior, the voice model it uses, and the LLM that controls the conversation. You can also configure tool calling, memory settings (context dump, RAG, etc), and more.

Each Persona is also made up of layers that define how it processes and responds during a conversation. You can alter the base LLM for generating replies, the perception layer for interpreting visual signals, the Speech-to-Text (STT) engine for transcribing the user’s voice, and the Text-to-Speech (TTS) engine to swap out the Replica’s voice. These settings give you full control over the brain, eyes, ears, and voice of your Persona.

Once your Replica and Persona are set up, the next step is to initiate a live session using the Create Conversation API. This API call establishes a WebRTC video room where your AI-powered Replica, guided by its Persona, interacts with users in real time.

You can customize each conversation by specifying parameters such as the unique Replica and Persona IDs, and optional settings like a custom greeting, conversational context, language preferences (Tavus supports 30+), and call duration settings. Upon creation, the API returns a conversation_url that can be embedded into your application or accessed directly, allowing users to engage in a dynamic, face-to-face interaction with the AI agent.

Speaking of embedding, that part is dead simple. You can choose between Daily (default) or LiveKit as your WebRTC provider. Both support flexible open source UI layers, so you can style and control the look and behavior however you want. Use whichever one fits your build best.

The entire system is API-first and dev-friendly. You can leverage Tavus with TypeScript, JavaScript, Python, or whatever stack you’re working in. Everything from spinning up conversations to creating Replicas and Personas is handled through clean, well-documented endpoints.

Now that you’re briefed, you can head over to the Create Conversation tab in the Developer Portal to start testing real-time conversations, prompts, and grab your API keys to so you can start embedding your first agent in just a few minutes.

The Models

To build with Tavus, it helps to know how our model stack works. CVI runs on a modular system that handles rendering, perception, timing, and language to deliver lifelike AI video agents.

We built three core models in-house to power the experience:

Phoenix-3 handles facial rendering, lip sync, and expression generation
Raven-0 processes visual input like emotions, gestures, and object recognition
Sparrow-0 manages turn-taking and conversation pacing

We also plug in best-in-class external models for the rest:

LLM: Default is Llama 3.3 (70B), good for long context and fast response, with support for any OpenAI streaming compatible LLM (finetunes and RAG are welcome here!)
TTS: Default is Cartesia Sonic 2.0 for voice clones, with support for PlayHT and ElevenLabs voices
STT: Multiple provider options depending on your speed and language needs

Each of these can be configured per Persona, giving you full control over the intelligence, tone, and rhythm of your Replica.

The UI/UX

There are a few key UI and UX patterns that make Conversational Video Interfaces feel intuitive and polished. These are based on what we've seen work best across projects using Tavus. Our examples repo gives you everything you need to get started with CVI, from a barebones React framework to fully fleshed out projects.

As mentioned above, the conversation UI is powered by a WebRTC provider. By default, Tavus spins up a prebuilt Daily room with all the core UI components, perfect for testing and quick demos. For production, you’ll want to build your own experience using LiveKit or customize Daily's open source UI.

Start with our embedding guide to get the basics, then dive into the Daily or LiveKit docs for full control.

Here’s the core flow:

Intro Screen: Start with a GIF of the Replica and an engaging title or welcome message. Make it immediately clear that this is a two-way video experience, not just a passive video. The CTA should prompt users to start the conversation.
Instruction screen: After clicking the initial CTA the user needs more information to prepare themselves for the call, this is a screen meant to be skimmed and quickly clicked through, but gives the user important context.
Haircheck Screen: This is your pre-call lobby. Let the user see themselves and get comfortable before jumping in. It also helps set expectations for video and audio permissions.
Conversation Screen: Keep this simple and clean. Think FaceTime, focused on the conversation windows, with optional buttons for mute, end, screenshare and any features you’ve added.
Closing Screen: Wrap up with a message from the agent and a clear next step. This could be a form, a link, a booking flow, or just a simple “try again” button.
Error Screen: If something breaks, keep it human. Show a friendly message and offer a retry option or fallback mode. You may want to consider different screens for errors that behave differently (concurrency limit hit vs. conversation API failure).

Other concepts:

Artifacts: Give users a shared output from the session. This could be a checklist, a drawing canvas, generated code, or an AI-created image. Tool calls can trigger these automatically, giving the user something tangible from the interaction.
Time Limit: If you’re capping session time, use echo mode to signal the end is near. It lets the agent wrap up gracefully about a minute before timeout.
Greenscreen Mode: With greenscreen Replicas, you can embed agents directly on your site as floating holograms. This opens up unique UX patterns, especially for support, onboarding, or in-product guidance. Checkout this repo to get started!

The Capabilities

Tavus gives you full control over how your agents behave, adapt, and react during live sessions. You can push updates to context, inject user inputs, handle real-time events, and output structured artifacts based on both what the user says and what the Replica sees. This is achieved via the persona layer and the conversation layer.

You can also record sessions, transcribe audio, and use those outputs downstream for training data, analytics, or follow-up workflows.

Interaction Types (see all)

Echo Interaction: Trigger a final message before timeout
Text Respond Interaction: Inject custom text input as if spoken by the user
Interrupt Interaction: Stop the Replica mid-sentence
Overwrite Conversational Context Interaction: Modify conversation context on the fly
Sensitivity Interaction: Adjust how responsive the Replica is to user input or silence

Observable Events

Utterance Event: Fired when the Replica speaks
Tool Call Event: When the LLM triggers an external function
Perception Tool Call Event: When a visual cue triggers a tool
Perception Analysis Event: Summary of visual cues from the session
Replica Started/Stopped Speaking Event: Tracks speech timing
User Started/Stopped Speaking Event: Detects user voice activity
Replica Interrupted Event: Triggered when a user cuts off the agent

Tools and Function Calling

You can give your Persona access to external tools by defining function calls the LLM can trigger during a conversation. These tools let the agent go beyond text and take real action, like querying APIs, updating databases, or triggering workflows.

Example use cases:

Get real-time weather, stock prices, or calendar events
Trigger actions like sending a follow-up email or booking a meeting
Chain multi-step flows by calling custom backend logic

Tools are passed in via the Persona config under layers.llm.tools, and can be configured with full JSON schema validation. During the conversation, if the LLM decides to use a tool, a conversation.tool_call event is emitted. This contains the tool name, arguments as a JSON string, the utterance that triggered the call, and the conversation and inference IDs letting you listen for tool calls in real time and route them however your app needs.

Perception with Raven

Raven-0 is the perception model that gives your Replica real-time visual awareness. It analyzes webcam and screen share inputs to understand user behavior, emotions, and context, enabling more adaptive, responsive conversations.

Key capabilities include:

Ambient and active perception for facial cues, objects, and screen content
Visual tool calls that trigger actions based on what Raven sees (e.g. ID cards, frustration)
End-of-call analysis that summarizes visual events for downstream logic or reporting

The Prompts

Prompting defines your Replica’s behavior. It instructs the LLM on its role, tone, and boundaries. Without clear prompts, responses can be inconsistent. With well-structured prompts, you can achieve precise and reliable interactions.

Best Practices

Role Definition: Begin with "You are..." to establish the agent's identity.
Objective: Clearly state the agent's purpose.
Tone and Style: Specify how the agent should communicate (e.g., friendly, formal).
Boundaries: Define what the agent should avoid or defer.
Examples: Include sample interactions to guide behavior.

This format is ideal for long-form prompting inside a Persona, where you want control, tone consistency, and clear safety boundaries.

Real-Time Updates

You can also adjust prompts live during a session using the Overwrite Context Interaction. This lets you steer conversations dynamically, ideal for branching flows, escalation logic, or user-specific tailoring.

Good prompting is like good directing. Clear roles, consistent tone, and defined limits lead to better, safer interactions.

Why Face-to-Face Matters

You’ve got the tools. Replicas. Personas. Real-time control. Visual perception. Smart prompting. But all of it leads to one core idea... why face-to-face?

Because it changes how people engage.

We don’t just process words. We respond to presence. Faces carry tone, emotion, and trust in a way no text or audio can fully replicate.

Emotional registration: The face reacts faster than words
Safe roleplay: Let users explore, practice, or simulate without judgment
Trust and connection: Visual presence builds credibility and comfort
True communication: Facial expressions and eye contact reach deeper parts of the brain

When you start with presence, not just prompts, you can create something special. Give your AI a face, a voice, and personality. Because when someone meets your agent’s gaze, conversations become real. Now let’s bring your AI to life with Tavus.

‍

Developer

How creating Sparrow made me a better conversationalist

Conversational AI video APIs

Build immersive AI-generated video experiences in your application

Get a Demo

Building Real-Time AI Video Agents with LiveKit and Tavus (NEW)

Bringing Conversational AI Video to Vapi with Tavus

11+ Best Text to Speech APIs [2024]

Conversational video AI cost comparison

Smarter, faster, fairer: How AI is reshaping the future of recruiting

How creating Sparrow made me a better conversationalist

Conversational AI video APIs