Serverless architecture for AI video agents: when it works and when it doesn't
.png)
.png)
.png)
.png)
It's 2 AM, and a patient calls about post-surgical medication. Real-time conversational AI video agents can answer that call with the timing and presence of a person, but the infrastructure has to deliver every second of it live. Serverless AI fits many elastic workloads cleanly and breaks down under others, so product leaders need to know which layers serverless can carry and which ones require a different foundation entirely.
Serverless AI has evolved well beyond the original function-as-a-service model. Today's GPU-native platforms manage container orchestration, GPU memory allocation, model loading, and autoscaling as a single managed layer. Providers like Modal, RunPod, and Vast.ai have built infrastructure for AI/ML workloads.
Azure and Google Cloud now offer serverless GPU patterns or adjacent managed inference patterns, and AWS provides managed GPU-based inference through provisioned SageMaker endpoints. Teams avoid provisioning GPU instances that sit idle between requests, and capacity grows with demand. For workloads with unpredictable traffic and low utilization between requests, that billing model can align infrastructure spend more closely with actual use than reserved instances running while idle.
Serverless fits many AI systems. Real-time conversational video for AI humans exposes the limits quickly, and that's where teams building AI video agents hit decisions that infrastructure choices can't undo later.
Serverless AI inference performs best when the workload is bursty, latency-tolerant, and stateless. It fits workloads with idle periods between traffic spurts and enough tolerance for occasional cold starts.
Several production patterns fit this profile well:
Both patterns tolerate variable startup time because no one is waiting inside a live conversation for the next response. In a real-time interaction, that same startup delay becomes part of the user experience.
Cold starts are routine production events. Trace data shows a large share of functions exhibit substantial cold start rates even within standard keep-alive windows; not edge cases, but the normal operating mode.
The Torpor production measurements from Alibaba Cloud quantify the range across model sizes: 8 seconds for a ResNet-152 (1.6 GB), 25 seconds for Stable Diffusion (5.1 GB), 48 seconds for Llama3-8B (13 GB), and 61 seconds for Llama2-13B (24.5 GB). Cold starts scale with model size, and framework initialization dominates: engineering data from GPU-native platforms shows CUDA graph capture can push even smaller models into startup times measured in tens of seconds without snapshotting.
GPU memory snapshotting and host-memory swapping cut those times substantially, but cold starts for production LLMs can still take tens of seconds. Even in warm instances, time-to-first-token can spike at high concurrency, according to published inference-serving benchmarks.
Together, memory ceilings and stateless invocation impose a hard ceiling on interactive workloads.
Low latency is critical for natural-feeling conversational AI. The pipeline must accommodate speech-to-text, context retrieval, large language model (LLM) generation, text-to-speech, and facial behavior rendering, assuming every component is already warm, connected, and in memory. A best-case cold start, measured in seconds, is far over the entire allowed pipeline budget, resulting in a complete session failure.
WebRTC sessions, the transport layer for real-time video, are stateful by design: encryption keys, packet sequence numbers, and jitter buffers must persist within a single process for the duration of the session. Production voice AI infrastructure for WebRTC relies on a dedicated session layer that owns connectivity, encryption, and transport state.
Video at 30fps produces a new frame every 33 milliseconds. The inference pipeline can't queue frames while waiting for a worker to initialize. Processing has to happen on an already-running, in-memory process.
Natural conversation also demands barge-in handling: the ability to detect a mid-response interruption and immediately halt generation. Production voice AI systems achieve this with very low-latency turn detection. Serverless functions running as parallel, independent invocations have no mechanism to cancel an in-flight response or to roll back shared state in response to a new audio event.
Official cloud architecture guidance points to the same layered pattern across AWS and Google Cloud: serverless for orchestration and event-driven preprocessing; dedicated GPU for inference hot paths with latency or throughput requirements.
Serverless functions handle API gateway integration, multi-step workflow orchestration, preprocessing, and result aggregation. Dedicated GPU endpoints handle the latency-sensitive inference: real-time model serving, high-memory workloads, and anything on the user-response critical path.
Provisioned concurrency supports workloads that need warm instances without fully dedicated infrastructure. Published AWS guidance says provisioned concurrency reduces Lambda cold starts and delivers double-digit millisecond response times compared with on-demand invocations, which can incur noticeable cold-start latency.
For AI video agents, the choice of architecture comes down to ownership of the real-time layer: perception, conversational timing, facial behavior rendering, and session state. Orchestration, scheduling, analytics, and post-conversation processing can run serverless. Knowing which layer belongs where determines whether the system displays a presence or a loading screen, and that's the decision Tavus was built to make.
Tavus is the human computing company, building full-stack AI humans that see, hear, understand, and respond in real-time conversations. Where serverless architectures require teams to stitch together separate speech-to-text, LLM, text-to-speech, and rendering services across invocation boundaries, Tavus built the entire pipeline as a unified Conversational Video Interface (CVI) with sub-200ms response latency.
An AI human is a system with perception, timing, memory, and reasoning, in which the face is what the user sees and the behavioral stack is what makes the conversation feel real. The CVI runs as a closed loop across five capability layers: perception, conversation, intelligence, personality, and rendering.
Each model in the stack owns a specific stage of the closed loop. Naming what each does, and what it doesn't, is what separates a real-time conversation from a chained pipeline of services.
Sparrow-1 governs conversational flow, predicting floor ownership at the frame level on raw audio with 55ms median latency, 100% precision, 100% recall, and zero interruptions across all 28 benchmark samples. In a candidate screening conversation, Sparrow-1 holds the floor open while an applicant gathers their thoughts, responding at the moment a human listener would. Sparrow-1's floor predictions enable speculative inference at the LLM layer, where response generation begins before the user finishes speaking, and the model then commits or discards based on updated floor predictions.
Raven-1 fuses audio and visual signals into unified understanding with sub-100ms audio perception and rolling perception that keeps context no more than 300ms stale. In a compliance training session, Raven-1 fuses a hesitant tone of voice with a furrowed brow, catching the mismatch between the words and the behavioral signals before the LLM layer adjusts its explanation. The LLM intelligence layer handles reasoning and content decisions, routing content and determining tone and direction.
Phoenix-4 is the real-time facial behavior engine. It renders responsive facial behavior at 40fps at 1080p, generating active listening cues and emergent micro-expressions during both listening and speaking. In a patient intake conversation at 2 AM, the LLM layer shifts the AI human from a clinical guide to an engaged listener, and Phoenix-4 matches the expression with a slight nod and a softened gaze.
CVI also includes the personality layer for production deployment. Memory, grounded knowledge, and the ability to act mid-conversation are what carry a real-time interaction from a single answer to an actual relationship.
Knowledge Base grounds every response in customer-specific data, retrieves relevant content at approximately 30ms, and currently supports English-language content. An insurance company can upload policy documents so the AI can retrieve the relevant coverage clause in real time. Memories retain context across sessions for cross-conversation continuity.
Objectives and Guardrails enforce completion criteria and compliance boundaries, escalating to a human agent when the conversation falls outside defined scope. Function Calling lets AI humans act mid-conversation: booking appointments, logging details, or triggering CRM workflows.
Product teams get the real-time layer through APIs and SDKs, with white-labeled deployment so AI humans match the product's brand. Hybrid architectures still leave ownership of the real-time session layer open. CVI provides that layer.
Every conversation still carries the same promise it did at the start: that someone on the other end is paying attention. Presence is what makes that promise real, and presence is what no cold start can ever deliver.
See it for yourself. Book a demo.
Serverless AI typically costs less for workloads with unpredictable traffic and long idle periods because billing aligns with actual usage. Dedicated GPU instances may deliver better unit economics for sustained, high-volume workloads by avoiding the overhead of repeated initialization and scale-to-zero cycles.
Serverless architectures face structural challenges with real-time voice and video. Real-time voice and video systems built with WebRTC involve persistent connection state, so stateless serverless functions are generally not well-suited to directly owning a live session, and measured cold starts for production LLMs take seconds to tens of seconds, far exceeding the sub-200ms pipeline budget that natural conversation requires. Purpose-built infrastructure, like the Tavus CVI, addresses these constraints with always-warm, session-aware processes.
A cold start occurs when a serverless platform initializes a new instance, including container creation, CUDA context setup, model weight loading, and framework initialization. For production LLMs, cold starts can range from a few seconds to tens of seconds, and in some scale-from-zero cases may take longer. Mitigation techniques include GPU memory snapshotting, host-memory swapping, provisioned concurrency, and predictive pre-provisioning.
Avoid serverless when your workload requires sub-second interactive response times, persistent session state across a long-running interaction, continuous bidirectional streaming (audio or video), or models that exceed platform memory ceilings. Hybrid architectures that pair serverless orchestration with dedicated inference for the real-time layer offer the most practical path for teams building conversational AI products.