How to Build Voice Agents in Azure Using Voice Live API and Foundry Agents V1/V2
Businesses are increasingly adopting conversational AI, but building voice-enabled agents from scratch requires complex speech processing infrastructure. Microsoft's Voice Live API combined with Foundry agents provides a streamlined solution. This guide walks through creating both V1 and V2 voice agents with practical code examples.
Voice Live API Overview
Many businesses want to add voice interfaces to their applications but struggle with the complexity of speech recognition and synthesis. Microsoft's Voice Live API solves this by providing real-time speech processing through a WebSocket connection. The API handles both speech-to-text and text-to-speech conversion, allowing developers to focus on the conversation logic rather than audio processing.
The Voice Live API supports multiple languages and voices, with the ability to auto-detect input language. It's particularly powerful when combined with Azure AI Foundry (now Microsoft Foundry) agents, as shown in the tutorial. The API recently updated from cognitive.services.azure.com to services.ai.azure.com and changed its API version to 2025-10-1.
Key benefit: The Voice Live API eliminates the need to build and maintain your own speech processing infrastructure while providing enterprise-grade performance and scalability through Azure's cloud infrastructure.
Foundry Agents Explained
Microsoft Foundry (formerly Azure AI Foundry) provides a platform for deploying and managing AI agents. These agents can range from simple chatbots to complex conversational AI systems. The tutorial demonstrates two versions of Foundry agents that showcase the platform's evolution.
Version 1 agents use a straightforward architecture where you deploy an agent with specific instructions (like "talk like a pirate") and connect to it using an agent ID. Version 2 agents, introduced with Microsoft Foundry, offer more advanced capabilities including better tool integration, monitoring, and support for the Microsoft Agent Framework.
As shown in the video at 4:32, you can toggle between the old and new Foundry interfaces in the Azure portal. The new interface provides additional features like Docker support and enhanced telemetry that were announced at Microsoft Ignite.
Prerequisites and Setup
Before creating voice agents, you'll need several Azure resources configured properly. First, create a Foundry resource in Azure that includes a project. Within this project, you'll deploy your agents and connect them to the Voice Live API.
The tutorial uses Python for the implementation, so you'll need Python 3.8 or later installed. You'll also need the Azure CLI for authentication and the following Python packages: websockets, azure-identity, and sounddevice for audio processing.
Key setup steps include:
- Creating a Foundry resource in Azure
- Deploying an LLM (like GPT-4) in the model catalog
- Setting up environment variables for your project name, agent ID/name, and credentials
- Configuring a user with the Cognitive Services User role and proper scope
At 6:15 in the video, the tutorial shows how to configure these permissions in Azure AD, which are crucial for the authentication flow to work properly.
V1 Agent Implementation
The V1 agent implementation demonstrates connecting to an older Azure AI Foundry agent using its agent ID. The example creates a pirate-themed agent that transforms all responses into pirate speak.
The core of the implementation is the BasicVoiceAssistant class that wraps the Voice Live API connection. This class handles:
- WebSocket connection establishment
- Authentication using Azure AD credentials
- Audio capture and playback
- Session management with the agent
As shown at 10:45 in the video, the connect() method is particularly important as it sets up the WebSocket connection with the proper query parameters including the agent ID, project name, and authentication token. The method also configures text-to-speech voice settings for the session.
Implementation tip: The tutorial logs all conversations and technical details to files in a logs folder, which is invaluable for debugging and understanding the interaction flow between your code and the voice agent.
V2 Agent Implementation
The V2 agent implementation follows a similar pattern but connects to the newer Microsoft Foundry agents using a friendly name rather than an ID. The example creates a poet-themed agent that responds in old English style.
The main difference in code is the replacement of agent_id with agent_name in the query parameters for the WebSocket connection. This reflects Microsoft's move toward more user-friendly identifiers in the new Foundry platform.
At 15:30 in the video, you can see the V2 agent in action. Unlike the V1 agent that provides an immediate greeting, the V2 agent waits for user input before responding. This behavioral difference may be due to ongoing development of the V2 agent framework.
The V2 implementation includes the same logging capabilities as V1, allowing you to compare the interaction patterns between the two versions. The logs show that V2 currently doesn't include instructions in its responses, which may change in future updates.
Key Differences Between V1 and V2
While both versions achieve similar outcomes, there are important technical and functional differences between V1 and V2 agents that developers should understand when choosing which to implement.
The primary technical difference is the identifier used to connect to the agent - V1 uses an agent ID (starting with "as_") while V2 uses a friendly name. This reflects Microsoft's effort to make the platform more accessible and user-friendly.
Functional differences observed in the tutorial include:
- V1 agents provide immediate greeting messages while V2 waits for user input
- V2 agents appear in the new Microsoft Foundry interface with additional management capabilities
- V2 supports the Microsoft Agent Framework and Docker deployment
- V2 provides enhanced monitoring and telemetry features
At 13:20 in the video, the side-by-side code comparison clearly shows how minimal the code changes are between versions, making migration relatively straightforward.
Best Practices
Based on the tutorial implementation and Microsoft's documentation, here are key best practices for working with Voice Live API and Foundry agents:
For authentication, ensure your Azure AD user has exactly the right permissions - the Cognitive Services User role with api://AzureAIFoundry/.default scope. Missing or incorrect permissions are a common source of connection failures.
When designing your agent's personality and responses:
- Keep responses concise for better voice interaction
- Design for interruption as users may speak over the agent
- Include clear conversation markers for voice-only interactions
- Test with different voices and languages if targeting global users
For development and debugging:
- Implement comprehensive logging like the tutorial demonstrates
- Use the Foundry playground to test agent behavior before voice integration
- Monitor API response times and audio quality during peak usage
Pro tip: Start with a V1 agent for simpler implementations, but plan to migrate to V2 to take advantage of Microsoft's ongoing investments in the Foundry platform and agent framework.
Watch the Full Tutorial
For a complete walkthrough of the code and to see both the pirate (V1) and poet (V2) agents in action, watch the full tutorial video. At 9:15, you'll see the pirate agent demonstration, and at 15:30, the poet agent comes to life with old English responses.
Key Takeaways
Microsoft's Voice Live API combined with Foundry agents provides a powerful platform for building voice-enabled AI applications without developing complex speech processing capabilities in-house. The tutorial demonstrates both the original (V1) and newer (V2) approaches to agent implementation.
In summary: Voice Live API handles real-time speech processing via WebSocket, Foundry provides the agent framework, and the combination enables natural voice interactions. V2 agents represent Microsoft's future direction with improved tooling and management capabilities.
Frequently Asked Questions
Common questions about this topic
The Voice Live API is a Microsoft Azure service that provides real-time speech-to-text and text-to-speech capabilities through WebSocket connections. It allows developers to integrate natural voice interactions with their applications without building the speech processing infrastructure themselves.
The API supports multiple languages and voices, and can connect directly to Azure AI Foundry agents. It handles all the complex audio processing while your application focuses on the conversation logic and business requirements.
- Provides real-time speech processing via WebSocket
- Supports multiple languages and voice profiles
- Eliminates need for custom speech infrastructure
Foundry V1 agents are the original Azure AI Foundry implementation, while V2 agents are part of the newer Microsoft Foundry platform. V2 agents offer improved tool integration, better monitoring capabilities, and support for the Microsoft Agent Framework.
The main technical difference in implementation is that V1 agents use an agent ID while V2 agents use a friendly name for connection. V2 represents Microsoft's future direction for agent development with more features and flexibility.
- V1 uses agent IDs, V2 uses friendly names
- V2 has enhanced monitoring and tool integration
- V2 supports Docker and Microsoft Agent Framework
You need an Azure AD credential with the Cognitive Services User role and the specific scope api://AzureAIFoundry/.default. This requires being logged in via Azure CLI. The credential is used to generate a JWT token that authenticates your WebSocket connection to the Voice Live API endpoint.
Proper authentication setup is crucial - the tutorial shows at 6:15 how to configure these permissions in Azure AD. Missing or incorrect permissions are a common source of connection failures when first setting up voice agents.
- Azure AD credential with Cognitive Services User role
- Specific scope api://AzureAIFoundry/.default
- Azure CLI login required for token generation
Yes, you can deploy your own LLM in Microsoft Foundry and connect it to the Voice Live API. The tutorial shows using GPT-4, but you can configure any supported model. The agent acts as the intermediary between the Voice Live API and your LLM, handling the conversation flow and processing the inputs and outputs.
This architecture gives you flexibility to choose the best LLM for your specific use case while still benefiting from Azure's managed speech processing capabilities. You maintain control over the conversation logic and personality while offloading the complex audio processing.
- Yes, supports custom LLM deployment
- Agent mediates between Voice Live API and your LLM
- Maintain control over conversation logic
The official quickstart provides Python examples, but since the API uses standard WebSocket connections, you can implement it in any language with WebSocket support. The key requirements are handling the authentication flow and properly formatting the WebSocket messages according to the Voice Live API specification.
Popular choices besides Python include JavaScript/Node.js for web applications, C# for .NET developers, and Java for enterprise systems. The authentication tokens and WebSocket message formats are language-agnostic.
- Official examples in Python
- Any language with WebSocket support works
- JavaScript, C#, Java are common alternatives
Microsoft Foundry provides a playground where you can test your agents. For voice-specific testing, you can use the Voice Live Playground in the Azure portal, which lets you configure input language detection, speech output voice selection, and directly interact with your agent through microphone input and audio output.
The tutorial shows this testing interface at 7:45. Comprehensive logging like implemented in the tutorial code (saving to the logs folder) is also invaluable for debugging and understanding the interaction flow during development.
- Use Foundry playground for agent testing
- Voice Live Playground for voice-specific testing
- Implement logging for debugging interactions
Voice agents are ideal for customer service applications, interactive voice response systems, accessibility tools, and educational applications. The tutorial demonstrates creating personality-driven agents (pirate and poet), but real-world applications could include technical support bots, appointment scheduling assistants, or interactive learning tools.
Businesses are using voice agents for 24/7 customer support, hands-free operation in industrial settings, and personalized education. The combination of Voice Live API and Foundry agents makes these applications more accessible to develop and deploy at scale.
- Customer service and support
- Interactive voice response systems
- Accessibility tools and educational applications
GrowwStacks specializes in building custom voice agent solutions on Microsoft Azure. Our team can design and deploy voice agents tailored to your specific business needs, integrate them with your existing systems, and ensure optimal performance.
We offer end-to-end implementation from concept to deployment, including custom LLM integration if needed. Our expertise with Voice Live API and Foundry agents ensures your voice solution will be scalable, maintainable, and deliver real business value.
- Custom voice agent design and deployment
- Integration with your existing systems
- Free consultation to discuss your requirements
Ready to Implement Voice Agents for Your Business?
Manual voice interface development can take months and require specialized expertise. With GrowwStacks, you can deploy custom voice agents in weeks, not months, using proven Azure infrastructure.