How to Add Telephony to Your Gemini Live Agent with Twilio
Most AI assistants live trapped in apps and browsers - but what if your customers could call them like a real assistant? This Twilio integration brings your Gemini Live agent to any phone, handling audio conversion, WebSocket connections, and deployment through Google Cloud Run.
Why Phone Access Matters for AI Agents
Despite the rise of messaging apps, phone calls remain the preferred communication channel for 61% of customers when dealing with important matters. Your AI assistant might handle text queries perfectly, but until it can answer phone calls, you're missing a critical touchpoint with customers.
The breakthrough comes from combining Gemini Live's real-time conversation abilities with Twilio's telephony infrastructure. At 2:15 in the video, you'll see how the agent handles natural phone conversation flow, including pauses, interruptions, and follow-up questions just like a human operator.
Customer expectation gap: 78% of customers expect phone support options, but only 12% of businesses have AI capable of handling calls. This integration bridges that gap at a fraction of the cost of human operators.
Twilio + Gemini Architecture
The integration uses a three-layer architecture that keeps your AI logic separate from telephony complexities. The Twilio handler acts as a middleware converting between telephony protocols and Gemini's WebSocket API.
Key components include:
- Twilio Media Streams for real-time audio transport
- WebSocket proxy for bidirectional communication
- Audio conversion middleware (24kHz PCM ↔ 8kHz μ-law)
- FastAPI server handling routing and session management
This separation means you can update your Gemini agent's personality and capabilities without touching the telephony layer.
WebSocket Connection Setup
WebSockets provide the real-time duplex channel needed for natural phone conversations. The integration establishes two parallel WebSocket connections:
- Between Twilio and your Cloud Run instance
- Between Cloud Run and Gemini Live API
The Python SDK example shows how to manage these connections with async handlers. Critical sections handle connection drops, timeouts, and reconnection logic to maintain call quality.
Latency optimization: The WebSocket implementation keeps end-to-end latency under 500ms by minimizing proxy hops and using binary message framing for audio chunks.
Handling Audio Conversion
Audio format mismatch is the most common integration challenge. Gemini Live outputs 24kHz 16-bit PCM, while Twilio's telephony infrastructure expects 8kHz μ-law encoded audio.
The solution involves:
- Real-time sample rate conversion (24kHz → 8kHz)
- Bit depth adjustment (16-bit → 8-bit)
- μ-law encoding for Twilio compatibility
- Buffering to prevent underruns during conversion
The provided Python handler implements this using optimized libsoxr bindings, processing audio in 20ms chunks to balance latency and quality.
Google Cloud Run Deployment
Cloud Run provides the ideal hosting environment, automatically scaling to handle call volume spikes. Deployment involves three key steps:
Step 1: Enable Required Services
Activate Cloud Run, Cloud Build, and Secret Manager in your Google Cloud project. Secret Manager securely stores your Twilio credentials and Gemini API key.
Step 2: Configure Build Pipeline
The Cloud Build pipeline packages your Python application into a Docker container with all dependencies pre-installed.
Step 3: Deploy and Expose Endpoints
Final deployment exposes two critical endpoints:
- /twilio/inbound - Handles incoming calls
- /twilio/outbound - Initiates outgoing calls
At 5:40 in the video, you can see the complete deployment process from start to finish in under 3 minutes.
Inbound vs Outbound Calls
The system handles both incoming and outgoing calls through separate but parallel pathways:
Inbound Call Flow
- Call arrives at your Twilio number
- Twilio POSTs to your /twilio/inbound endpoint
- WebSocket connection establishes to Gemini
- Media stream begins flowing both directions
Outbound Call Flow
- Your app POSTs to /twilio/outbound
- Twilio SDK initiates call to destination
- On answer, media stream establishes
- Gemini receives audio and responds
This dual-path architecture supports all common telephony use cases while maintaining code reuse.
When to Use Partner Integrations
While the DIY approach works well, partner integrations make sense when you need:
- Multi-provider telephony (PSTN, SIP, WebRTC)
- Advanced features like call recording or transcription
- Compliance with industry regulations
- Dedicated support SLAs
Gemini's certified partners like Voximplant and Agora provide pre-built adapters that handle:
- Automatic audio format conversion
- Global low-latency routing
- Failover and redundancy
- Detailed call analytics
The video at 7:30 shows a side-by-side comparison of the DIY approach versus a partner integration.
Watch the Full Tutorial
See the complete implementation from start to finish, including the moment at 4:15 where we demonstrate accent handling across different callers. The video walks through both the code and live call examples.
Key Takeaways
Adding telephony to your Gemini Live agent opens new customer service channels while reducing costs. The integration handles the complex audio and protocol conversions so you can focus on your agent's personality and capabilities.
In summary: Use the WebSocket API to connect Twilio's telephony with Gemini Live, deploy on Cloud Run for automatic scaling, and consider partner integrations when you need advanced features or global reliability.
Frequently Asked Questions
Common questions about this topic
Gemini sends 24kHz 16-bit PCM audio while Twilio expects 8kHz μ-law format. The integration handles this conversion automatically through the WebSocket connection.
The conversion process involves both sample rate reduction and encoding format change. We use optimized audio processing libraries to maintain voice quality while minimizing latency.
- 24kHz PCM → 8kHz μ-law conversion
- Automatic gain control for consistent volume
- Packet loss concealment for network stability
Yes, the same WebSocket API approach works with other providers like Voximplant or Agora. The core architecture remains similar regardless of the telephony backend.
We recommend using one of Gemini's partner integrations for easiest setup. These pre-built adapters handle provider-specific requirements so you don't have to.
- Voximplant - Best for global PSTN connectivity
- Agora - Optimized for low-latency voice
- LiveKit - Ideal for WebRTC applications
The system initiates outbound calls through a POST request to your web server, which then uses the Twilio client with your account credentials to place the call.
Once the call connects, the same media stream handling processes the audio in both directions. This maintains consistency between inbound and outbound call quality.
- Initiate via simple API request
- Same audio processing pipeline
- Call status events via webhooks
You'll need to enable Cloud Run, Cloud Build, and Secret Manager services. These form the core infrastructure for hosting and securing your telephony integration.
Secret Manager handles your API keys securely while Cloud Run hosts the web server. Cloud Build automates the container deployment process.
- Cloud Run - Container hosting
- Cloud Build - CI/CD pipeline
- Secret Manager - Credential storage
Yes, Twilio provides test credentials and sandbox environments that let you validate the integration without using real phone numbers or incurring charges.
For local testing, tools like ngrok create secure tunnels to your development machine. This lets you test the complete call flow before Cloud Run deployment.
- Twilio test credentials
- Ngrok for local testing
- Staging Cloud Run instances
Gemini Live supports multiple languages natively. The audio stream passes through unchanged, allowing Gemini to process the natural speech patterns and accents directly.
During testing, we've verified comprehension across 15+ languages and numerous regional accents. The telephony layer doesn't interfere with language processing.
- No accent modification
- Native multilingual support
- Context-aware responses
Typical latency is under 500ms end-to-end. The WebSocket connection and optimized audio codecs minimize delay for natural conversations.
We achieve this through regional Cloud Run deployments, efficient audio processing, and direct peering between Twilio and Google Cloud networks.
- Sub-500ms typical latency
- Regional deployment options
- Network path optimization
GrowwStacks specializes in AI telephony integrations. We can deploy a turnkey solution with your Gemini Live agent connected to Twilio or other providers, handling all the technical setup and optimization.
Our implementation includes custom tuning for your industry, compliance requirements, and call volume patterns. We ensure low-latency, high-quality voice interactions from day one.
- End-to-end telephony integration
- Custom agent personality design
- Performance optimization
Ready to Bring Your AI Agent to the Phone?
Every day without phone support means missed customer connections and frustrated callers. Our team can have your Gemini Live agent answering calls in under 48 hours.