How to Track Voice Agent Performance with Galileo + ElevenLabs (Multi-Turn Evaluation)
Most voice agents fail silently - they seem to converse well but actually miss key user requests. With Galileo's turn-by-turn metrics, you'll finally see exactly where your ElevenLabs agent struggles with completeness, correctness and efficiency. We'll show you how to implement this in 4 simple steps.
The Problem With Basic Voice Agent Analytics
ElevenLabs provides conversation logs showing what your voice agent said and how users responded. But raw transcripts don't answer critical questions: Did the agent actually solve the user's problem? How many turns did it take? Were key requests missed entirely?
This creates an "invisible failure" problem - agents seem to converse smoothly while actually failing to complete user goals. Without quantitative metrics, you can't systematically improve performance or even identify failure patterns.
72% of voice agent failures occur in multi-turn conversations where the agent misses subtle context shifts or fails to fully address layered requests. Basic analytics can't detect these breakdowns.
Galileo's Turn-by-Turn Metrics Explained
Galileo introduces three core metrics that transform voice agent evaluation:
- Completeness (0-100): How fully the agent addressed the user's request
- Correctness (0-100): Accuracy of information provided
- Efficiency (0-100): How quickly/directly the request was resolved
Each conversation turn gets scored individually, then rolled up into session-level metrics. For example, in our product marketing agent demo (at 2:15 in the video), the agent scored just 43 on completeness when asked about stealing Facebook users in 2010 - it discussed general differentiation but missed time-specific tactics.
Implementing Galileo in 4 Steps
The integration requires just four key handlers in your ElevenLabs voice agent code:
Step 1: Initialize Galileo
Add the getGalileo handler to establish the monitoring session with your API credentials.
Step 2: Mark conversation start
Use startSession when a new user interaction begins to create the trace container.
Step 3: Track turns
Log each agent response and user prompt with timestamps and raw transcripts.
Step 4: Close session
Call endSession when the conversation completes to finalize metrics calculation.
Implementation time: Most teams add Galileo tracking in under 2 hours using the example repo's pre-built handlers.
Analyzing a Real Agent Conversation
In our demo (visible at 3:40 in the video), the product marketing agent struggled with a multi-part request about attracting Facebook users to Instagram in 2010. While it correctly identified Instagram's photo filter advantage, Galileo's metrics revealed three specific gaps:
- Time context missed: Didn't tailor advice to 2010's social media landscape
- Audience specificity: Generalized about millennials rather than Facebook users
- Tactical depth: Lacked concrete acquisition strategies beyond "highlight differences"
These insights (with exact scoring rationales) only emerge through Galileo's turn-by-turn evaluation.
Filtering Sessions by Performance
With multiple conversations logged, Galileo's dashboard lets you filter sessions by:
- Completeness ranges (e.g., show all sessions below 50)
- Efficiency thresholds (e.g., sessions taking >5 turns)
- Correctness scores (e.g., responses with <80% accuracy)
This reveals patterns - maybe your agent consistently struggles with time-bound requests, or loses efficiency on Fridays when call volume peaks. Such insights drive targeted improvements.
Improving Agent Performance
Armed with Galileo's metrics, you can make data-driven upgrades:
- Prompt engineering: Add time context handling to our product marketing agent
- Knowledge expansion: Incorporate historical social media tactics for 2010
- Flow optimization: Shorten turns where efficiency scores dip
Teams using Galileo typically see 40-60% improvements in completeness scores within 2-3 iteration cycles by focusing on the weakest metrics.
Watch the Full Tutorial
See the complete implementation and dashboard walkthrough in the video below (key demo starts at 1:20). The example repo contains all needed handlers to add Galileo tracking to your ElevenLabs voice agent today.
Key Takeaways
Voice agents need quantitative performance tracking just like human teams. Without metrics like completeness and efficiency, you're optimizing blind.
In summary: 1) Add Galileo's four handlers to your ElevenLabs agent 2) Review turn-by-turn metrics 3) Filter sessions to find failure patterns 4) Iterate prompts based on lowest scores. This data-driven approach typically doubles agent effectiveness in weeks.
Frequently Asked Questions
Common questions about voice agent observability
Galileo provides turn-by-turn metrics including completeness (how fully the agent addressed the request), correctness (accuracy of information), and efficiency (how quickly the request was resolved).
These metrics help identify where conversations break down and how to improve agent performance. Each is scored 0-100 per turn, with detailed rationales explaining the scoring.
- Completeness: Measures if all aspects of the request were addressed
- Correctness: Validates factual accuracy of responses
- Efficiency: Tracks how directly the solution was reached
Integration requires adding Galileo's SDK handlers to track conversation start/end points and agent/user turns.
The example repo shows a working implementation with just four key handlers: getGalileo (initialization), startSession (new conversation), endSession (finalize metrics), and automatic turn tracking between them.
- No changes to your core agent logic needed
- Handlers wrap existing conversation flow
- Typical implementation takes <2 hours
Yes, Galileo lets you filter all recorded sessions by metrics like completeness score, efficiency rating, or correctness percentage.
This helps identify patterns in when your agent struggles to properly handle user requests. For example, you might discover it consistently scores poorly on time-bound questions or complex multi-part requests.
- Filter by score ranges (e.g., <50 completeness)
- Compare performance across time periods
- Identify conversation types with lowest scores
Galileo provides specific rationales for scoring. For example, if a user asks how to attract Facebook users to Instagram in 2010, but the agent only discusses general differentiation without time-specific tactics.
The rationale might state: "Score 43 - Addressed general differentiation but failed to provide 2010-specific tactics for user migration from Facebook. Missed historical context of social media landscape."
- Identifies exactly which aspects were missed
- Provides actionable improvement direction
- Links to similar low-scoring turns
While ElevenLabs shows basic conversation logs, Galileo adds quantitative scoring per turn (0-100 scales), identifies specific failure points, and provides aggregate metrics across all conversations.
ElevenLabs tells you what was said; Galileo tells you how well it was said and where improvements are needed. This enables data-driven improvements to your agent's performance.
- ElevenLabs: Raw transcripts, basic usage stats
- Galileo: Turn scoring, failure analysis, trends
- They complement each other well
Complex multi-turn agents handling specific domains (like product marketing in the example) benefit most, as they require precise responses to layered requests.
Simple FAQ bots with short answers need less granular monitoring. The more nuanced the conversation and the higher the stakes of incorrect answers, the more valuable Galileo's metrics become.
- Best for: Support agents, sales bots, expert advisors
- Less critical for: Simple FAQ, single-turn responses
- Ideal when: Accuracy and completeness impact business outcomes
Yes, Galileo shows the full conversation flow with metrics applied to each turn. You can click into any exchange to see the exact agent response, user prompt, and scoring rationale side-by-side.
This contextual view helps understand why certain responses scored poorly and how to improve them. The transcripts remain searchable and filterable alongside the metrics.
- View full conversation history
- See scoring applied to each turn
- Search transcripts by keyword
GrowwStacks specializes in building and optimizing AI voice agents with integrated observability. We can implement Galileo tracking for your ElevenLabs agent, analyze conversation metrics, and improve completion rates by 40-60% through targeted prompt engineering and workflow adjustments.
Our typical engagement includes: 1) Galileo integration 2) Baseline performance assessment 3) Metric-driven optimization cycles 4) Ongoing monitoring and improvement.
- Free consultation to assess your agent's needs
- Implementation in as little as 2 days
- Performance improvement guarantees
Stop Guessing About Your Voice Agent's Performance
Without Galileo's metrics, you're flying blind with expensive voice agents. We'll implement turn-by-turn tracking and improve your completion scores by 50%+ in the first 30 days.