Voice AI Azure Microsoft Foundry

November 27, 2025 9 min read AI Automation

How to Build Voice Agents in Azure Using Voice Live API and Foundry Agents V1/V2

Businesses are increasingly adopting conversational AI, but building voice-enabled agents from scratch requires complex speech processing infrastructure. Microsoft's Voice Live API combined with Foundry agents provides a streamlined solution. This guide walks through creating both V1 and V2 voice agents with practical code examples.

Voice Agents in Azure tutorial thumbnail

Voice Live API Overview

Many businesses want to add voice interfaces to their applications but struggle with the complexity of speech recognition and synthesis. Microsoft's Voice Live API solves this by providing real-time speech processing through a WebSocket connection. The API handles both speech-to-text and text-to-speech conversion, allowing developers to focus on the conversation logic rather than audio processing.

The Voice Live API supports multiple languages and voices, with the ability to auto-detect input language. It's particularly powerful when combined with Azure AI Foundry (now Microsoft Foundry) agents, as shown in the tutorial. The API recently updated from cognitive.services.azure.com to services.ai.azure.com and changed its API version to 2025-10-1.

Key benefit: The Voice Live API eliminates the need to build and maintain your own speech processing infrastructure while providing enterprise-grade performance and scalability through Azure's cloud infrastructure.

Foundry Agents Explained

Microsoft Foundry (formerly Azure AI Foundry) provides a platform for deploying and managing AI agents. These agents can range from simple chatbots to complex conversational AI systems. The tutorial demonstrates two versions of Foundry agents that showcase the platform's evolution.

Version 1 agents use a straightforward architecture where you deploy an agent with specific instructions (like "talk like a pirate") and connect to it using an agent ID. Version 2 agents, introduced with Microsoft Foundry, offer more advanced capabilities including better tool integration, monitoring, and support for the Microsoft Agent Framework.

As shown in the video at 4:32, you can toggle between the old and new Foundry interfaces in the Azure portal. The new interface provides additional features like Docker support and enhanced telemetry that were announced at Microsoft Ignite.

Prerequisites and Setup

Before creating voice agents, you'll need several Azure resources configured properly. First, create a Foundry resource in Azure that includes a project. Within this project, you'll deploy your agents and connect them to the Voice Live API.

The tutorial uses Python for the implementation, so you'll need Python 3.8 or later installed. You'll also need the Azure CLI for authentication and the following Python packages: websockets, azure-identity, and sounddevice for audio processing.

Key setup steps include:

Creating a Foundry resource in Azure
Deploying an LLM (like GPT-4) in the model catalog
Setting up environment variables for your project name, agent ID/name, and credentials
Configuring a user with the Cognitive Services User role and proper scope

At 6:15 in the video, the tutorial shows how to configure these permissions in Azure AD, which are crucial for the authentication flow to work properly.

V1 Agent Implementation

The V1 agent implementation demonstrates connecting to an older Azure AI Foundry agent using its agent ID. The example creates a pirate-themed agent that transforms all responses into pirate speak.

The core of the implementation is the BasicVoiceAssistant class that wraps the Voice Live API connection. This class handles:

WebSocket connection establishment
Authentication using Azure AD credentials
Audio capture and playback
Session management with the agent

As shown at 10:45 in the video, the connect() method is particularly important as it sets up the WebSocket connection with the proper query parameters including the agent ID, project name, and authentication token. The method also configures text-to-speech voice settings for the session.

Implementation tip: The tutorial logs all conversations and technical details to files in a logs folder, which is invaluable for debugging and understanding the interaction flow between your code and the voice agent.

V2 Agent Implementation

The V2 agent implementation follows a similar pattern but connects to the newer Microsoft Foundry agents using a friendly name rather than an ID. The example creates a poet-themed agent that responds in old English style.

The main difference in code is the replacement of agent_id with agent_name in the query parameters for the WebSocket connection. This reflects Microsoft's move toward more user-friendly identifiers in the new Foundry platform.

At 15:30 in the video, you can see the V2 agent in action. Unlike the V1 agent that provides an immediate greeting, the V2 agent waits for user input before responding. This behavioral difference may be due to ongoing development of the V2 agent framework.

The V2 implementation includes the same logging capabilities as V1, allowing you to compare the interaction patterns between the two versions. The logs show that V2 currently doesn't include instructions in its responses, which may change in future updates.

Key Differences Between V1 and V2

While both versions achieve similar outcomes, there are important technical and functional differences between V1 and V2 agents that developers should understand when choosing which to implement.

The primary technical difference is the identifier used to connect to the agent - V1 uses an agent ID (starting with "as_") while V2 uses a friendly name. This reflects Microsoft's effort to make the platform more accessible and user-friendly.

Functional differences observed in the tutorial include:

V1 agents provide immediate greeting messages while V2 waits for user input
V2 agents appear in the new Microsoft Foundry interface with additional management capabilities
V2 supports the Microsoft Agent Framework and Docker deployment
V2 provides enhanced monitoring and telemetry features

At 13:20 in the video, the side-by-side code comparison clearly shows how minimal the code changes are between versions, making migration relatively straightforward.

Best Practices

Based on the tutorial implementation and Microsoft's documentation, here are key best practices for working with Voice Live API and Foundry agents:

For authentication, ensure your Azure AD user has exactly the right permissions - the Cognitive Services User role with api://AzureAIFoundry/.default scope. Missing or incorrect permissions are a common source of connection failures.

When designing your agent's personality and responses:

Keep responses concise for better voice interaction
Design for interruption as users may speak over the agent
Include clear conversation markers for voice-only interactions
Test with different voices and languages if targeting global users

For development and debugging:

Implement comprehensive logging like the tutorial demonstrates
Use the Foundry playground to test agent behavior before voice integration
Monitor API response times and audio quality during peak usage

Pro tip: Start with a V1 agent for simpler implementations, but plan to migrate to V2 to take advantage of Microsoft's ongoing investments in the Foundry platform and agent framework.

Watch the Full Tutorial

For a complete walkthrough of the code and to see both the pirate (V1) and poet (V2) agents in action, watch the full tutorial video. At 9:15, you'll see the pirate agent demonstration, and at 15:30, the poet agent comes to life with old English responses.

Key Takeaways

Microsoft's Voice Live API combined with Foundry agents provides a powerful platform for building voice-enabled AI applications without developing complex speech processing capabilities in-house. The tutorial demonstrates both the original (V1) and newer (V2) approaches to agent implementation.

In summary: Voice Live API handles real-time speech processing via WebSocket, Foundry provides the agent framework, and the combination enables natural voice interactions. V2 agents represent Microsoft's future direction with improved tooling and management capabilities.

Frequently Asked Questions

Common questions about this topic

What is the Voice Live API in Azure?

The Voice Live API is a Microsoft Azure service that provides real-time speech-to-text and text-to-speech capabilities through WebSocket connections. It allows developers to integrate natural voice interactions with their applications without building the speech processing infrastructure themselves.

The API supports multiple languages and voices, and can connect directly to Azure AI Foundry agents. It handles all the complex audio processing while your application focuses on the conversation logic and business requirements.

Provides real-time speech processing via WebSocket
Supports multiple languages and voice profiles
Eliminates need for custom speech infrastructure

What's the difference between Foundry V1 and V2 agents?

Foundry V1 agents are the original Azure AI Foundry implementation, while V2 agents are part of the newer Microsoft Foundry platform. V2 agents offer improved tool integration, better monitoring capabilities, and support for the Microsoft Agent Framework.

The main technical difference in implementation is that V1 agents use an agent ID while V2 agents use a friendly name for connection. V2 represents Microsoft's future direction for agent development with more features and flexibility.

V1 uses agent IDs, V2 uses friendly names
V2 has enhanced monitoring and tool integration
V2 supports Docker and Microsoft Agent Framework

What credentials are needed to connect to Voice Live API?

You need an Azure AD credential with the Cognitive Services User role and the specific scope api://AzureAIFoundry/.default. This requires being logged in via Azure CLI. The credential is used to generate a JWT token that authenticates your WebSocket connection to the Voice Live API endpoint.

Proper authentication setup is crucial - the tutorial shows at 6:15 how to configure these permissions in Azure AD. Missing or incorrect permissions are a common source of connection failures when first setting up voice agents.

Azure AD credential with Cognitive Services User role
Specific scope api://AzureAIFoundry/.default
Azure CLI login required for token generation

Can I use my own LLM with Voice Live API?

Yes, you can deploy your own LLM in Microsoft Foundry and connect it to the Voice Live API. The tutorial shows using GPT-4, but you can configure any supported model. The agent acts as the intermediary between the Voice Live API and your LLM, handling the conversation flow and processing the inputs and outputs.

This architecture gives you flexibility to choose the best LLM for your specific use case while still benefiting from Azure's managed speech processing capabilities. You maintain control over the conversation logic and personality while offloading the complex audio processing.

Yes, supports custom LLM deployment
Agent mediates between Voice Live API and your LLM
Maintain control over conversation logic

What programming languages can I use with Voice Live API?

The official quickstart provides Python examples, but since the API uses standard WebSocket connections, you can implement it in any language with WebSocket support. The key requirements are handling the authentication flow and properly formatting the WebSocket messages according to the Voice Live API specification.

Popular choices besides Python include JavaScript/Node.js for web applications, C# for .NET developers, and Java for enterprise systems. The authentication tokens and WebSocket message formats are language-agnostic.

Official examples in Python
Any language with WebSocket support works
JavaScript, C#, Java are common alternatives

How do I test my voice agent before deployment?

Microsoft Foundry provides a playground where you can test your agents. For voice-specific testing, you can use the Voice Live Playground in the Azure portal, which lets you configure input language detection, speech output voice selection, and directly interact with your agent through microphone input and audio output.

The tutorial shows this testing interface at 7:45. Comprehensive logging like implemented in the tutorial code (saving to the logs folder) is also invaluable for debugging and understanding the interaction flow during development.

Use Foundry playground for agent testing
Voice Live Playground for voice-specific testing
Implement logging for debugging interactions

What are some practical use cases for voice agents?

Voice agents are ideal for customer service applications, interactive voice response systems, accessibility tools, and educational applications. The tutorial demonstrates creating personality-driven agents (pirate and poet), but real-world applications could include technical support bots, appointment scheduling assistants, or interactive learning tools.

Businesses are using voice agents for 24/7 customer support, hands-free operation in industrial settings, and personalized education. The combination of Voice Live API and Foundry agents makes these applications more accessible to develop and deploy at scale.

Customer service and support
Interactive voice response systems
Accessibility tools and educational applications

How can GrowwStacks help implement voice agents for my business?

GrowwStacks specializes in building custom voice agent solutions on Microsoft Azure. Our team can design and deploy voice agents tailored to your specific business needs, integrate them with your existing systems, and ensure optimal performance.

We offer end-to-end implementation from concept to deployment, including custom LLM integration if needed. Our expertise with Voice Live API and Foundry agents ensures your voice solution will be scalable, maintainable, and deliver real business value.

Custom voice agent design and deployment
Integration with your existing systems
Free consultation to discuss your requirements

Ready to Implement Voice Agents for Your Business?

Manual voice interface development can take months and require specialized expertise. With GrowwStacks, you can deploy custom voice agents in weeks, not months, using proven Azure infrastructure.

Book Free Consultation → Read More Articles