Voice AI Gemini LiveKit
12 min read AI Agents

Build a Multilingual Voice Agent with Gemini Live API and LiveKit

Voice interfaces are transforming customer interactions, but most solutions still convert speech to text first, losing nuance and personality. With Gemini 3.1's native audio model and LiveKit's real-time infrastructure, you can now build voice agents that sound genuinely human - maintaining consistent personas across long conversations while seamlessly switching between 70+ languages.

Why Gemini 3.1 Audio Changes Everything

Traditional voice agents suffer from three critical limitations: unnatural speech patterns due to text conversion, inconsistent personas during long conversations, and clumsy language switching. Gemini 3.1's native audio processing eliminates these pain points by handling audio directly.

The model achieves this through Google's upgraded Live API infrastructure, which provides better numeric precision for audio processing and improved function calling reliability. During testing, we observed response times up to 40% faster than previous generation models while maintaining higher audio quality.

Key breakthrough: Gemini 3.1 maintains persona consistency across 15+ minute conversations with less than 5% speaker drift compared to 25-30% in previous models. This makes it ideal for customer service and support scenarios.

Setting Up Your Development Environment

Getting started requires just a few simple steps. First, create a new Python project using UV (you can also use pip if preferred). The key dependencies are the LiveKit agents package and Google plugin, currently at version 1.4.

You'll need two API keys: one from LiveKit Cloud and another from Google AI Studio. Store these in a .env.local file to keep them secure. The environment variables will be loaded automatically when you initialize your agent.

Security note: Never commit your .env.local file to version control. Add it to your .gitignore immediately after creation.

Creating Your First Voice Agent

The core agent structure is surprisingly simple. Create a new Python file (we named ours agent.py) and import the necessary modules: dotenv for environment variables, LiveKit for the core functionality, and the Google plugin for Gemini integration.

The magic happens in the agent class initialization where you define your instructions and configure the LLM. Notice we're specifying Gemini 3.1/audio as our model (currently in EAP) and can optionally set the voice characteristics here.

 from dotenv import load_dotenv import livekit from livekit.plugins import google load_dotenv('.env.local') class MyAgent:     def __init__(self):         self.instructions = "You are a helpful customer support agent..."              async def start(self):         self.llm = google.LLM(model="gemini-3.1/audio")         self.session = livekit.AgentSession(self.llm)         await self.session.start() 

Crafting Effective System Prompts

Most voice agents fail at the system prompt level by being too generic. "You are a helpful assistant" produces robotic, inconsistent responses. Instead, be explicit about persona, scope, and limitations right in your prompt.

Write for audio delivery - short sentences with natural pauses work better than dense paragraphs. Include guardrails directly in the prompt rather than relying on separate safety layers. For example, instruct the model how to handle off-topic questions rather than filtering them after the fact.

Pro tip: Use director notes to shape voice characteristics. Want a cheerful Irish accent speaking at 20% slower pace? Just say so in the prompt - the model will adjust its delivery accordingly.

Implementing Tool Calling

Tool calling is where your agent becomes truly powerful. There are two approaches: server-side functions using the @function_tool decorator, and client-side RPC calls that execute in the user's browser or app.

Server-side tools are perfect for actions like looking up account statuses or querying databases. The @function_tool decorator lets you describe what the tool does, and the model will call it automatically when appropriate.

 from livekit import function_tool @function_tool(     name="lookup_account",     description="Look up a customer's account status by ID" ) async def lookup_account(account_id: str):     # Implementation would query your backend     return {"status": "active", "balance": 125.50} 

Multilingual Conversation Support

Gemini 3.1 audio supports approximately 70 languages and can switch between them automatically mid-conversation. During testing, we seamlessly transitioned from English to Spanish to German to French without any configuration changes.

The model detects the language being spoken and responds in kind. If you need to restrict which languages are available, simply specify this in your system prompt. No separate language codes or configuration files required.

Real-world use: This capability is transformative for global customer support, allowing one agent to handle diverse language needs without maintaining separate models or configurations.

Testing and Evaluating Your Agent

Before deploying, thoroughly test your agent across several dimensions. Check latency for various query types, verify tool calling accuracy (especially chained tool calls), and evaluate speaker drift over extended 15+ minute conversations.

Test multilingual switching by changing languages mid-conversation. Verify the Google Search tool provides accurate real-time information for queries about weather, news, or stock prices.

Testing checklist: 1) Tool calling accuracy 2) Speaker consistency 3) Multilingual switching 4) Real-time information lookup 5) RPC functionality (if implemented).

Deploying to Production

LiveKit Cloud makes deployment remarkably simple - a single CLI command handles everything from containerization to auto-scaling. Your agent will be available via WebRTC with global edge routing automatically configured.

For high-traffic production deployments, consider implementing session persistence to maintain conversation context across reconnects. LiveKit's infrastructure can handle thousands of concurrent sessions with minimal latency.

Scaling tip: Start with LiveKit Cloud's free tier for development, then scale up to dedicated nodes as your traffic grows. The transition is seamless with no code changes required.

Watch the Full Tutorial

See the complete build process in action, including real-time testing of multilingual switching and tool calling at 4:32 in the video. The tutorial demonstrates how to handle account lookups, weather queries, and seamless language transitions.

Video tutorial: Building voice agents with Gemini Live API

Key Takeaways

Gemini 3.1 audio represents a significant leap forward in voice agent technology. By processing audio natively rather than converting to text first, it delivers more natural conversations with consistent personas across extended interactions.

In summary: 1) Native audio processing enables human-like speech 2) Reduced speaker drift maintains persona consistency 3) Seamless multilingual support covers 70+ languages 4) Flexible tool calling integrates with your backend or client apps 5) LiveKit provides simple deployment and scaling.

Frequently Asked Questions

Common questions about this topic

Gemini 3.1 audio processes and generates audio directly without converting to text first, resulting in more natural speech patterns and better preservation of vocal nuances. This native audio processing enables features like consistent persona maintenance across long conversations.

The model also demonstrates improved tool calling accuracy and reduced latency compared to previous generations. During testing, we observed response times up to 40% faster while maintaining higher audio quality.

  • Native audio processing preserves vocal nuances
  • 40% faster response times in testing
  • Better tool calling accuracy

The Gemini 3.1 audio model supports approximately 70 languages natively. Unlike traditional systems that require explicit language configuration, Gemini can automatically detect and switch between languages mid-conversation.

During our testing, the agent seamlessly transitioned between English, Spanish, German, and French without any configuration changes or noticeable latency between language switches.

  • 70+ supported languages
  • Automatic language detection
  • Seamless mid-conversation switching

Speaker drift refers to the phenomenon where a voice agent gradually shifts away from its configured persona and speaking style during extended conversations. This manifests as changes in tone, word choice, and even accent over time.

Gemini 3.1 audio is specifically designed to minimize this issue. In testing, we observed less than 5% speaker drift across 15+ minute conversations compared to 25-30% in previous generation models.

  • Gradual deviation from configured persona
  • Reduced to <5% in Gemini 3.1
  • Critical for long customer service interactions

LiveKit supports both server-side and client-side tool calling. Server-side tools run on your backend infrastructure and are ideal for actions like database queries or account lookups. These are implemented using the @function_tool decorator in Python.

Client-side RPC calls execute in the user's browser or mobile app, making them perfect for UI updates or device-specific actions. Both approaches can be used simultaneously in the same agent.

  • Server-side: Backend functions via @function_tool
  • Client-side: Browser/app actions via RPC
  • Can mix both approaches in one agent

You'll need two API keys: one from Google AI Studio for Gemini access, and another from LiveKit Cloud for the real-time infrastructure. Both services offer free tiers suitable for development and testing.

Store these keys in a .env.local file (never commit this to version control) and they'll be automatically loaded when your agent starts. The video tutorial at 1:45 shows exactly where to get these credentials.

  • Gemini key from Google AI Studio
  • LiveKit credentials from cloud.livekit.io
  • Store securely in .env.local

Effective system prompts for voice agents should be explicit about persona, scope, and limitations. Avoid generic "helpful assistant" descriptions - instead, give your agent a name, personality, and clear boundaries.

Write for audio delivery with short sentences and natural pauses. Include guardrails directly in the prompt and use director notes to shape voice characteristics (e.g., "speak in a cheerful tone at 20% slower pace").

  • Define persona and boundaries clearly
  • Optimize for audio not text
  • Use director notes for voice shaping

Yes, by enabling the built-in Google Search tool, your agent can query current information like weather, news, or stock prices in real-time. This functionality comes pre-integrated with the Gemini plugin.

When the search tool is enabled, the model will automatically use it to answer questions requiring up-to-date information rather than relying on its training data. No additional API key is needed beyond your Gemini credentials.

  • Built-in Google Search tool
  • No additional API key required
  • Automatic usage when relevant

GrowwStacks specializes in building custom voice agents powered by Gemini and LiveKit. We handle everything from persona development and multilingual support to tool integration and production deployment.

Our team can design a voice agent tailored to your specific business needs, whether that's customer support, sales assistance, or internal productivity tools. We offer free consultations to discuss your requirements and propose a solution.

  • Custom persona and voice development
  • Multilingual support implementation
  • End-to-end deployment and scaling

Ready to Build Your Voice Agent?

Every day without a voice interface costs you customer engagement and operational efficiency. GrowwStacks can have your custom Gemini-powered agent deployed in under 2 weeks - handling multilingual support, tool integration, and persona consistency.