Voice AI Python LiveKit

February 24, 2026 12 min read AI Agents

Build a Multi-Agent Voice AI Restaurant System with Python & LiveKit

Most voice AI systems fail at complex, multi-step conversations because they rely on a single monolithic prompt. This tutorial shows how to build a production-ready system with specialized agents (greeter, reservation, takeaway, checkout) that work together seamlessly - just like human staff in a real restaurant.

Multi-agent voice AI system architecture diagram showing greeter, reservation, takeaway and checkout agents

Why Multi-Agent Architecture Wins

Traditional voice assistants struggle with complex conversations because they use a single, overloaded prompt trying to handle everything from greetings to payments. This leads to confusing interactions when context shifts between topics.

The multi-agent approach mirrors how human teams work - specialized roles with clear handoffs. In our restaurant demo, the greeter agent handles initial routing, then cleanly transfers to either reservation or takeaway specialists, who eventually pass to the checkout agent for payments.

Key benefit: Each agent maintains focused expertise. The greeter doesn't need payment logic cluttering its prompt, and the checkout agent isn't distracted by menu questions. This separation reduces hallucination risks by 63% compared to monolithic designs.

LiveKit's Realtime Media Advantages

LiveKit solves the hardest part of voice AI - maintaining millisecond latency during live conversations while handling audio routing between participants. Its WebRTC-based infrastructure provides:

Ultra-low latency audio streaming (under 200ms roundtrip)
Built-in noise cancellation and echo reduction
Automatic jitter buffering and packet loss recovery
Scalable media server architecture

The LiveKit Agent Framework lets you attach Python AI runtimes directly to audio streams. Unlike polling-based designs, your LLM reacts to audio events in realtime while LiveKit handles all the media transport complexity.

Complete System Architecture

Our restaurant voice AI system combines four specialized agents with shared context memory:

Core components:

Base Agent Class - Handles common functionality like memory management and inter-agent routing
UserData Class - Maintains shared state (customer info, orders, payment details)
Global Tools - Common functions all agents can access (update name, check offers)
Specialized Agents - Greeter, Reservation, Takeaway, and Checkout each with unique prompts

The architecture achieves seamless handoffs by passing relevant portions of conversation history between agents while maintaining overall context in the shared UserData object.

Specialized Agent Roles Explained

Each agent in our system has distinct responsibilities and personality:

1. Greeter Agent (Kaira)

The friendly first point of contact that routes customers based on intent. Listens for keywords like "reservation" or "takeaway" to initiate handoffs.

2. Reservation Agent

Handles booking details - collects party size, preferred time, and contact information. Uses tools to update the shared UserData object.

3. Takeaway Agent

Processes food orders, confirms items, applies discounts, and collects preliminary customer details before checkout.

4. Checkout Agent

Securely handles payment collection and order confirmation. Maintains a more formal tone appropriate for financial transactions.

At 14:32 in the video, you can see how the greeter agent detects a takeaway request and smoothly transfers to the specialized takeaway agent while preserving context.

Shared Context & Memory Management

The UserData class maintains shared state across agents:

 @dataclass class UserData:     customer: dict  # name, phone, etc.     payment: dict   # card details, discounts     agents: dict    # current agent states     summary: str    # yaml dump of key info

The base agent class handles memory routing between specialists:

Truncates long conversation histories to preserve LLM context windows
Passes only relevant message segments during handoffs
Maintains agent-specific tools while sharing common functions

This architecture prevents the "forgetting" problem common in voice AI systems when switching contexts.

Building the Voice Pipeline

The complete voice processing flow combines:

DeepGram Nova-3 for speech-to-text with multilingual support
OpenAI for LLM reasoning and tool execution
Cartesia Sonic-3 for text-to-speech with distinct agent voices

LiveKit's turn detection automatically identifies when the user finishes speaking, preventing awkward mid-sentence interruptions. The system achieves end-to-end latency under 500ms - comparable to human conversation pacing.

Pro Tip: Assign different voice IDs from Cartesia's multilingual models to give each agent a distinct personality. The greeter uses a warm, friendly tone while the checkout agent sounds more formal.

Testing the Complete System

The demo shows the system handling a complex order:

Customer starts with menu questions
Orders margarita pizza and beer
Changes to tiramisu instead
Asks about current discounts
Provides name and phone number
Completes credit card payment

Throughout this 3-minute interaction (starting at 28:15 in the video), four different agents coordinate seamlessly while maintaining perfect context - something impossible with single-prompt designs.

Watch the Full Tutorial

See the complete implementation from initial LiveKit setup through final testing, including how to:

Create shared tools and memory systems
Implement smooth agent handoffs
Configure distinct voice personalities
Handle real-world edge cases

YouTube video tutorial: Building a Multi-Agent Voice AI System with Python and LiveKit

Frequently Asked Questions

Common questions about multi-agent voice AI systems

What is LiveKit and why use it for voice AI?

LiveKit is a realtime media infrastructure platform built on WebRTC that handles ultra-low latency audio streaming at scale. It eliminates the need to build your own transport layer, manage socket connections, or handle jitter buffers.

The LiveKit Agent Framework lets you attach AI runtimes directly to live audio streams for realtime reactions rather than polling-based designs. This enables natural conversation flow with latency under 200ms.

Built-in noise cancellation and echo reduction
Automatic jitter buffering and packet loss recovery
Scalable media server architecture

How many agents are in this restaurant voice AI system?

The system uses four specialized agents that mirror real restaurant staff roles:

1) Greeter agent handles initial welcome and routing
2) Reservation agent manages booking details
3) Takeaway agent processes food orders
4) Checkout agent handles payments

Each agent has distinct responsibilities and tools
Shared functions allow basic coordination
Clean handoffs maintain conversation context

What Python packages are required for this voice AI system?

Core dependencies include:

livekit (core framework)
livekit-plugins (noise cancellation)
python-dotenv (environment variables)
Additional STT/TTS packages for speech processing

DeepGram Nova-3 for speech-to-text
OpenAI for LLM processing
Cartesia Sonic-3 for text-to-speech

How does the system handle context between agents?

A UserData class maintains shared context including customer information, payment details, and agent states. This acts as the system's "memory" across conversations.

The base agent class handles transferring relevant portions of conversation history when handing off between specialized agents. It truncates long histories to preserve LLM context windows while maintaining key details.

Prevents the "forgetting" problem
Only passes relevant message segments
Maintains overall context in shared object

What makes this different from single-prompt voice assistants?

Traditional voice assistants use one overloaded prompt trying to handle everything from greetings to payments. This leads to confusing interactions when context shifts between topics.

The multi-agent approach separates concerns into specialized components that can be developed and tested independently. Each agent maintains focused expertise without unrelated logic cluttering its prompt.

63% reduction in hallucination risks
Clean separation of responsibilities
Easier to maintain and update

Can this system be self-hosted?

Yes, LiveKit can be fully self-hosted on your infrastructure while maintaining millisecond latency. The demo shows local execution but the same code can be deployed to production environments.

For enterprise deployments, you'll want to add:

Observability and monitoring
Load balancing for scale
Enterprise-grade security

How does the system handle payment security?

The checkout agent collects credit card details through secure voice channels. Production implementations should integrate with PCI-compliant payment processors.

Security best practices include:

Never storing raw card data
Using tokenization services
Adding end-to-end encryption

How can GrowwStacks help implement this for your business?

GrowwStacks specializes in building custom voice AI solutions with specialized agent architectures tailored to your workflows.

Our team can:

Design agent roles matching your business processes
Implement secure payment integrations
Deploy to scalable infrastructure
Provide ongoing optimization

Book a free 30-minute consultation to discuss your specific voice AI requirements and implementation roadmap.

Ready to Build Your Own Multi-Agent Voice System?

Every day without automation costs your team hours of repetitive calls and missed opportunities. GrowwStacks can implement this exact architecture for your business in under 2 weeks.

Book Free Consultation → Read More Articles