Build a Multi-Agent Voice AI Restaurant System with Python & LiveKit
Most voice AI systems fail at complex, multi-step conversations because they rely on a single monolithic prompt. This tutorial shows how to build a production-ready system with specialized agents (greeter, reservation, takeaway, checkout) that work together seamlessly - just like human staff in a real restaurant.
Why Multi-Agent Architecture Wins
Traditional voice assistants struggle with complex conversations because they use a single, overloaded prompt trying to handle everything from greetings to payments. This leads to confusing interactions when context shifts between topics.
The multi-agent approach mirrors how human teams work - specialized roles with clear handoffs. In our restaurant demo, the greeter agent handles initial routing, then cleanly transfers to either reservation or takeaway specialists, who eventually pass to the checkout agent for payments.
Key benefit: Each agent maintains focused expertise. The greeter doesn't need payment logic cluttering its prompt, and the checkout agent isn't distracted by menu questions. This separation reduces hallucination risks by 63% compared to monolithic designs.
LiveKit's Realtime Media Advantages
LiveKit solves the hardest part of voice AI - maintaining millisecond latency during live conversations while handling audio routing between participants. Its WebRTC-based infrastructure provides:
- Ultra-low latency audio streaming (under 200ms roundtrip)
- Built-in noise cancellation and echo reduction
- Automatic jitter buffering and packet loss recovery
- Scalable media server architecture
The LiveKit Agent Framework lets you attach Python AI runtimes directly to audio streams. Unlike polling-based designs, your LLM reacts to audio events in realtime while LiveKit handles all the media transport complexity.
Complete System Architecture
Our restaurant voice AI system combines four specialized agents with shared context memory:
Core components:
- Base Agent Class - Handles common functionality like memory management and inter-agent routing
- UserData Class - Maintains shared state (customer info, orders, payment details)
- Global Tools - Common functions all agents can access (update name, check offers)
- Specialized Agents - Greeter, Reservation, Takeaway, and Checkout each with unique prompts
The architecture achieves seamless handoffs by passing relevant portions of conversation history between agents while maintaining overall context in the shared UserData object.
Specialized Agent Roles Explained
Each agent in our system has distinct responsibilities and personality:
1. Greeter Agent (Kaira)
The friendly first point of contact that routes customers based on intent. Listens for keywords like "reservation" or "takeaway" to initiate handoffs.
2. Reservation Agent
Handles booking details - collects party size, preferred time, and contact information. Uses tools to update the shared UserData object.
3. Takeaway Agent
Processes food orders, confirms items, applies discounts, and collects preliminary customer details before checkout.
4. Checkout Agent
Securely handles payment collection and order confirmation. Maintains a more formal tone appropriate for financial transactions.
At 14:32 in the video, you can see how the greeter agent detects a takeaway request and smoothly transfers to the specialized takeaway agent while preserving context.
Shared Context & Memory Management
The UserData class maintains shared state across agents:
@dataclass class UserData: customer: dict # name, phone, etc. payment: dict # card details, discounts agents: dict # current agent states summary: str # yaml dump of key info The base agent class handles memory routing between specialists:
- Truncates long conversation histories to preserve LLM context windows
- Passes only relevant message segments during handoffs
- Maintains agent-specific tools while sharing common functions
This architecture prevents the "forgetting" problem common in voice AI systems when switching contexts.
Building the Voice Pipeline
The complete voice processing flow combines:
- DeepGram Nova-3 for speech-to-text with multilingual support
- OpenAI for LLM reasoning and tool execution
- Cartesia Sonic-3 for text-to-speech with distinct agent voices
LiveKit's turn detection automatically identifies when the user finishes speaking, preventing awkward mid-sentence interruptions. The system achieves end-to-end latency under 500ms - comparable to human conversation pacing.
Pro Tip: Assign different voice IDs from Cartesia's multilingual models to give each agent a distinct personality. The greeter uses a warm, friendly tone while the checkout agent sounds more formal.
Testing the Complete System
The demo shows the system handling a complex order:
- Customer starts with menu questions
- Orders margarita pizza and beer
- Changes to tiramisu instead
- Asks about current discounts
- Provides name and phone number
- Completes credit card payment
Throughout this 3-minute interaction (starting at 28:15 in the video), four different agents coordinate seamlessly while maintaining perfect context - something impossible with single-prompt designs.
Watch the Full Tutorial
See the complete implementation from initial LiveKit setup through final testing, including how to:
- Create shared tools and memory systems
- Implement smooth agent handoffs
- Configure distinct voice personalities
- Handle real-world edge cases
Frequently Asked Questions
Common questions about multi-agent voice AI systems
LiveKit is a realtime media infrastructure platform built on WebRTC that handles ultra-low latency audio streaming at scale. It eliminates the need to build your own transport layer, manage socket connections, or handle jitter buffers.
The LiveKit Agent Framework lets you attach AI runtimes directly to live audio streams for realtime reactions rather than polling-based designs. This enables natural conversation flow with latency under 200ms.
- Built-in noise cancellation and echo reduction
- Automatic jitter buffering and packet loss recovery
- Scalable media server architecture
The system uses four specialized agents that mirror real restaurant staff roles:
1) Greeter agent handles initial welcome and routing
2) Reservation agent manages booking details
3) Takeaway agent processes food orders
4) Checkout agent handles payments
- Each agent has distinct responsibilities and tools
- Shared functions allow basic coordination
- Clean handoffs maintain conversation context
Core dependencies include:
livekit (core framework)
livekit-plugins (noise cancellation)
python-dotenv (environment variables)
Additional STT/TTS packages for speech processing
- DeepGram Nova-3 for speech-to-text
- OpenAI for LLM processing
- Cartesia Sonic-3 for text-to-speech
A UserData class maintains shared context including customer information, payment details, and agent states. This acts as the system's "memory" across conversations.
The base agent class handles transferring relevant portions of conversation history when handing off between specialized agents. It truncates long histories to preserve LLM context windows while maintaining key details.
- Prevents the "forgetting" problem
- Only passes relevant message segments
- Maintains overall context in shared object
Traditional voice assistants use one overloaded prompt trying to handle everything from greetings to payments. This leads to confusing interactions when context shifts between topics.
The multi-agent approach separates concerns into specialized components that can be developed and tested independently. Each agent maintains focused expertise without unrelated logic cluttering its prompt.
- 63% reduction in hallucination risks
- Clean separation of responsibilities
- Easier to maintain and update
Yes, LiveKit can be fully self-hosted on your infrastructure while maintaining millisecond latency. The demo shows local execution but the same code can be deployed to production environments.
For enterprise deployments, you'll want to add:
- Observability and monitoring
- Load balancing for scale
- Enterprise-grade security
The checkout agent collects credit card details through secure voice channels. Production implementations should integrate with PCI-compliant payment processors.
Security best practices include:
- Never storing raw card data
- Using tokenization services
- Adding end-to-end encryption
GrowwStacks specializes in building custom voice AI solutions with specialized agent architectures tailored to your workflows.
Our team can:
- Design agent roles matching your business processes
- Implement secure payment integrations
- Deploy to scalable infrastructure
- Provide ongoing optimization
Book a free 30-minute consultation to discuss your specific voice AI requirements and implementation roadmap.
Ready to Build Your Own Multi-Agent Voice System?
Every day without automation costs your team hours of repetitive calls and missed opportunities. GrowwStacks can implement this exact architecture for your business in under 2 weeks.