Voice AI AI Agents Observability

February 17, 2026 7 min read AI Automation

How to Track Voice Agent Performance with Galileo + ElevenLabs (Multi-Turn Evaluation)

Q: What metrics does Galileo provide for ElevenLabs voice agents?

Galileo provides turn-by-turn metrics including completeness (how fully the agent addressed the request), correctness (accuracy of information), and efficiency (how quickly the request was resolved). These metrics help identify where conversations break down and how to improve agent performance.

Q: How do I integrate Galileo with my ElevenLabs voice agent?

Integration requires adding Galileo's SDK handlers to track conversation start/end points and agent/user turns. The example repo shows a working implementation with just four key handlers: getGalileo, startSession, endSession and conversation tracking.

Q: Can I filter sessions by performance metrics?

Yes, Galileo lets you filter all recorded sessions by metrics like completeness score, efficiency rating, or correctness percentage. This helps identify patterns in when your agent struggles to properly handle user requests.

Q: What's an example of a completeness score rationale?

Galileo provides specific rationales for scoring. For example, if a user asks how to attract Facebook users to Instagram in 2010, but the agent only discusses general differentiation without time-specific tactics, it might score low on completeness with the rationale 'Failed to address the 2010 timeframe specifically'.

Q: How does this compare to ElevenLabs' native analytics?

While ElevenLabs shows basic conversation logs, Galileo adds quantitative scoring per turn (0-100 scales), identifies specific failure points, and provides aggregate metrics across all conversations. This enables data-driven improvements to your agent's performance.

Q: What types of voice agents benefit most from this?

Complex multi-turn agents handling specific domains (like product marketing in the example) benefit most, as they require precise responses. Simple FAQ bots with short answers need less granular monitoring.

Q: Can I see the actual conversation transcripts?

Yes, Galileo shows the full conversation flow with metrics applied to each turn. You can click into any exchange to see the exact agent response, user prompt, and scoring rationale side-by-side.

Q: How can GrowwStacks help implement this for your business?

GrowwStacks specializes in building and optimizing AI voice agents with integrated observability. We can implement Galileo tracking for your ElevenLabs agent, analyze conversation metrics, and improve completion rates by 40-60% through targeted prompt engineering and workflow adjustments. Book a free consultation to discuss your voice agent goals.

Most voice agents fail silently - they seem to converse well but actually miss key user requests. With Galileo's turn-by-turn metrics, you'll finally see exactly where your ElevenLabs agent struggles with completeness, correctness and efficiency. We'll show you how to implement this in 4 simple steps.

Monitoring ElevenLabs voice agent performance with Galileo observability platform

The Problem With Basic Voice Agent Analytics

ElevenLabs provides conversation logs showing what your voice agent said and how users responded. But raw transcripts don't answer critical questions: Did the agent actually solve the user's problem? How many turns did it take? Were key requests missed entirely?

This creates an "invisible failure" problem - agents seem to converse smoothly while actually failing to complete user goals. Without quantitative metrics, you can't systematically improve performance or even identify failure patterns.

72% of voice agent failures occur in multi-turn conversations where the agent misses subtle context shifts or fails to fully address layered requests. Basic analytics can't detect these breakdowns.

Galileo's Turn-by-Turn Metrics Explained

Galileo introduces three core metrics that transform voice agent evaluation:

Completeness (0-100): How fully the agent addressed the user's request
Correctness (0-100): Accuracy of information provided
Efficiency (0-100): How quickly/directly the request was resolved

Each conversation turn gets scored individually, then rolled up into session-level metrics. For example, in our product marketing agent demo (at 2:15 in the video), the agent scored just 43 on completeness when asked about stealing Facebook users in 2010 - it discussed general differentiation but missed time-specific tactics.

Implementing Galileo in 4 Steps

The integration requires just four key handlers in your ElevenLabs voice agent code:

Step 1: Initialize Galileo

Add the getGalileo handler to establish the monitoring session with your API credentials.

Step 2: Mark conversation start

Use startSession when a new user interaction begins to create the trace container.

Step 3: Track turns

Log each agent response and user prompt with timestamps and raw transcripts.

Step 4: Close session

Call endSession when the conversation completes to finalize metrics calculation.

Implementation time: Most teams add Galileo tracking in under 2 hours using the example repo's pre-built handlers.

Analyzing a Real Agent Conversation

In our demo (visible at 3:40 in the video), the product marketing agent struggled with a multi-part request about attracting Facebook users to Instagram in 2010. While it correctly identified Instagram's photo filter advantage, Galileo's metrics revealed three specific gaps:

Time context missed: Didn't tailor advice to 2010's social media landscape
Audience specificity: Generalized about millennials rather than Facebook users
Tactical depth: Lacked concrete acquisition strategies beyond "highlight differences"

These insights (with exact scoring rationales) only emerge through Galileo's turn-by-turn evaluation.

Filtering Sessions by Performance

With multiple conversations logged, Galileo's dashboard lets you filter sessions by:

Completeness ranges (e.g., show all sessions below 50)
Efficiency thresholds (e.g., sessions taking >5 turns)
Correctness scores (e.g., responses with <80% accuracy)

This reveals patterns - maybe your agent consistently struggles with time-bound requests, or loses efficiency on Fridays when call volume peaks. Such insights drive targeted improvements.

Improving Agent Performance

Armed with Galileo's metrics, you can make data-driven upgrades:

Prompt engineering: Add time context handling to our product marketing agent
Knowledge expansion: Incorporate historical social media tactics for 2010
Flow optimization: Shorten turns where efficiency scores dip

Teams using Galileo typically see 40-60% improvements in completeness scores within 2-3 iteration cycles by focusing on the weakest metrics.

Watch the Full Tutorial

See the complete implementation and dashboard walkthrough in the video below (key demo starts at 1:20). The example repo contains all needed handlers to add Galileo tracking to your ElevenLabs voice agent today.

ElevenLabs voice agent observability tutorial with Galileo

Key Takeaways

Voice agents need quantitative performance tracking just like human teams. Without metrics like completeness and efficiency, you're optimizing blind.

In summary: 1) Add Galileo's four handlers to your ElevenLabs agent 2) Review turn-by-turn metrics 3) Filter sessions to find failure patterns 4) Iterate prompts based on lowest scores. This data-driven approach typically doubles agent effectiveness in weeks.

Frequently Asked Questions

Common questions about voice agent observability

What metrics does Galileo provide for ElevenLabs voice agents?

Galileo provides turn-by-turn metrics including completeness (how fully the agent addressed the request), correctness (accuracy of information), and efficiency (how quickly the request was resolved).

These metrics help identify where conversations break down and how to improve agent performance. Each is scored 0-100 per turn, with detailed rationales explaining the scoring.

Completeness: Measures if all aspects of the request were addressed
Correctness: Validates factual accuracy of responses
Efficiency: Tracks how directly the solution was reached

How do I integrate Galileo with my ElevenLabs voice agent?

Integration requires adding Galileo's SDK handlers to track conversation start/end points and agent/user turns.

The example repo shows a working implementation with just four key handlers: getGalileo (initialization), startSession (new conversation), endSession (finalize metrics), and automatic turn tracking between them.

No changes to your core agent logic needed
Handlers wrap existing conversation flow
Typical implementation takes <2 hours

Can I filter sessions by performance metrics?

Yes, Galileo lets you filter all recorded sessions by metrics like completeness score, efficiency rating, or correctness percentage.

This helps identify patterns in when your agent struggles to properly handle user requests. For example, you might discover it consistently scores poorly on time-bound questions or complex multi-part requests.

Filter by score ranges (e.g., <50 completeness)
Compare performance across time periods
Identify conversation types with lowest scores

What's an example of a completeness score rationale?

Galileo provides specific rationales for scoring. For example, if a user asks how to attract Facebook users to Instagram in 2010, but the agent only discusses general differentiation without time-specific tactics.

The rationale might state: "Score 43 - Addressed general differentiation but failed to provide 2010-specific tactics for user migration from Facebook. Missed historical context of social media landscape."

Identifies exactly which aspects were missed
Provides actionable improvement direction
Links to similar low-scoring turns

How does this compare to ElevenLabs' native analytics?

While ElevenLabs shows basic conversation logs, Galileo adds quantitative scoring per turn (0-100 scales), identifies specific failure points, and provides aggregate metrics across all conversations.

ElevenLabs tells you what was said; Galileo tells you how well it was said and where improvements are needed. This enables data-driven improvements to your agent's performance.

ElevenLabs: Raw transcripts, basic usage stats
Galileo: Turn scoring, failure analysis, trends
They complement each other well

What types of voice agents benefit most from this?

Complex multi-turn agents handling specific domains (like product marketing in the example) benefit most, as they require precise responses to layered requests.

Simple FAQ bots with short answers need less granular monitoring. The more nuanced the conversation and the higher the stakes of incorrect answers, the more valuable Galileo's metrics become.

Best for: Support agents, sales bots, expert advisors
Less critical for: Simple FAQ, single-turn responses
Ideal when: Accuracy and completeness impact business outcomes

Can I see the actual conversation transcripts?

Yes, Galileo shows the full conversation flow with metrics applied to each turn. You can click into any exchange to see the exact agent response, user prompt, and scoring rationale side-by-side.

This contextual view helps understand why certain responses scored poorly and how to improve them. The transcripts remain searchable and filterable alongside the metrics.

View full conversation history
See scoring applied to each turn
Search transcripts by keyword

How can GrowwStacks help implement this for your business?

GrowwStacks specializes in building and optimizing AI voice agents with integrated observability. We can implement Galileo tracking for your ElevenLabs agent, analyze conversation metrics, and improve completion rates by 40-60% through targeted prompt engineering and workflow adjustments.

Our typical engagement includes: 1) Galileo integration 2) Baseline performance assessment 3) Metric-driven optimization cycles 4) Ongoing monitoring and improvement.

Free consultation to assess your agent's needs
Implementation in as little as 2 days
Performance improvement guarantees

Stop Guessing About Your Voice Agent's Performance

Without Galileo's metrics, you're flying blind with expensive voice agents. We'll implement turn-by-turn tracking and improve your completion scores by 50%+ in the first 30 days.

Book Free Consultation → Read More Articles