Voice AI AI Agents Quality Assurance
10 min read AI Automation

Retell's New AI QA Tool Just Changed Voice Agents Forever - Here's How

Most businesses deploying voice AI have no way to systematically catch hallucinations, latency issues, or knowledge gaps - until now. Retell's new quality assurance feature automatically analyzes calls to surface exactly where your agent is failing, with real metrics from our dental client case study showing a 12% hallucination rate on critical questions.

The Hidden Problems Voice AI Faces Without QA

Most businesses deploying voice AI agents operate blind. Without systematic quality assurance, critical issues like hallucinations, knowledge gaps, and latency problems go undetected - silently eroding customer trust. At 2:15 in the video, we see a perfect example: a dental patient asking "Can you call the office to find out who called me?" - a question our agent had zero coverage for.

Traditional monitoring approaches fail because human review doesn't scale, and basic analytics miss conversational nuances. Retell's AI QA changes this by automatically evaluating calls against configurable rules and metrics, surfacing both technical and conversational failures.

12% hallucination rate: Our test revealed the AI was making up answers to certain patient questions at an alarming rate - something we would never have caught through random call sampling alone.

How Retell's AI QA Actually Works

Retell's quality assurance system automatically evaluates a sample set of calls using customizable rules and metrics. Unlike basic call recording review, it analyzes both high-level trends (like average latency) and deep call-level diagnostics (such as interruption frequency indicating turn-taking issues).

The tool examines seven critical dimensions of call quality: latency, transcription accuracy, tool call success, hallucination rate, interruption frequency, user sentiment, and agent naturalness. Each metric can be weighted based on your priorities, creating a comprehensive quality score.

Setting Up QA: Step-by-Step Configuration

Configuring Retell's QA begins with selecting which agents to analyze and setting your date range. The system shows estimated analysis costs upfront (100 free minutes, then 10¢/minute). You can filter calls by duration, disconnection reason, or specific outcomes - like focusing only on calls where users said "not interested."

The real power comes in defining your success criteria. The system comes pre-loaded with common metrics but allows complete customization. For our dental client, we set:

  • Latency: P50 ≤1.5s, P95 ≤2.5s, P99 ≤3s
  • Word error rate: ≤12%
  • Tool call inaccuracy: ≤5%
  • Hallucination rate: ≤10%
  • Interruptions: ≤10 per call

These thresholds created clear pass/fail benchmarks for automatic evaluation.

The 7 Key Metrics That Reveal Agent Issues

Retell's QA dashboard surfaces insights through seven core metrics, each diagnosing different aspects of agent performance:

  1. Latency percentiles: P50 shows typical experience, while P95/P99 reveal worst-case delays that frustrate users
  2. Transcription accuracy: Word error rate exposes ASR quality issues
  3. Tool call success: Tracks whether API integrations fire correctly
  4. Hallucination rate: Percentage of calls where AI invents incorrect information
  5. Interruption frequency: High counts indicate unnatural flow or slow responses
  6. User sentiment: Positive/negative ratios show customer satisfaction
  7. Agent naturalness: Scores how human-like the conversation feels

Together, these create a complete picture of where your agent needs improvement.

Real Results From Our Dental Client Test

Applying Retell's QA to Dream Dental's voice agent revealed surprising insights. While most calls (88%) handled questions successfully, we discovered:

  • A 12% hallucination rate on certain patient questions
  • Average interruptions of 2.55 per call (indicating responsiveness issues)
  • Worst-case latency spikes to 4.25 seconds
  • Critical knowledge gap: "Can you call the office to find out who called me?" had zero coverage

31 hallucinations in 300 calls: Without systematic QA, these incorrect answers would have continued eroding patient trust. The tool automatically surfaced every instance where the AI fabricated information.

Interpreting the Data: What We Discovered

The QA dashboard's "Top Questions" view proved particularly valuable, showing which patient inquiries were being handled successfully and which weren't. This revealed not just failures, but why they were happening.

For example, high interruption rates correlated with longer latency periods. This indicated we needed to adjust the agent's responsiveness settings to better match human conversation pacing. Meanwhile, the hallucination findings prompted immediate knowledge base updates.

Perhaps most importantly, the tool let us drill into specific problematic calls (like the 4.25s latency example at 7:30 in the video) to understand root causes through full transcripts and audio.

Actionable Insights From QA Analysis

Retell's QA doesn't just identify problems - it provides clear paths to solutions. Our dental client test led to three immediate improvements:

  1. Knowledge base expansion: Added coverage for the "who called me" question and other gaps
  2. Responsiveness adjustment: Increased agent interruptibility to reduce average interruptions
  3. Latency monitoring: Set alerts for calls exceeding 3s response time

The system's ability to track metrics over time (like the rising hallucination rate shown at 5:45) enables continuous improvement. We scheduled bi-weekly QA runs to monitor the impact of these changes.

Watch the Full Tutorial

See the complete walkthrough of Retell's AI QA setup and our dental client results in the video below. At 4:20, we demonstrate how to configure custom success criteria, and at 8:15 you'll see the shocking hallucination examples we uncovered.

Retell AI QA tool tutorial video

Key Takeaways

Retell's AI QA tool fundamentally changes how businesses monitor and improve voice agents. By automatically analyzing calls against configurable metrics, it surfaces issues that would otherwise go unnoticed - from knowledge gaps to latency spikes to dangerous hallucinations.

In summary: Regular QA testing catches 3x more agent issues than manual review, helps prioritize knowledge base updates, and provides measurable benchmarks for continuous improvement - all for just 10¢ per analyzed minute after the first 100 free minutes.

Frequently Asked Questions

Common questions about Retell's AI QA tool

Retell's AI QA tool automatically evaluates voice agent calls at scale using configurable rules and metrics. It analyzes call quality, latency, sentiment, tool accuracy, and hallucinations across a sample of calls to identify issues that would otherwise go unnoticed.

The system provides both high-level trend analysis and detailed call-level diagnostics, helping businesses maintain quality as they scale their voice AI deployments.

  • Automatically evaluates call samples
  • Tracks both technical and conversational metrics
  • Surfaces hidden issues like knowledge gaps

The tool tracks seven key metrics that comprehensively assess agent performance:

These include technical measurements like latency and transcription accuracy, as well as conversational quality indicators like interruption frequency and user sentiment.

  • Latency (P50, P95, P99 response times)
  • Word error rate in transcriptions
  • Tool call inaccuracy rate
  • Agent hallucination rate

Retell provides 100 free minutes of AI QA analysis for new users. After that, the service costs just 10 cents per minute of analyzed call time.

This pricing model makes it affordable to regularly monitor agent performance, with typical monthly costs ranging from $20-$100 depending on call volume and sampling rate.

  • First 100 minutes free
  • Then $0.10 per analyzed minute
  • No monthly subscription required

The QA tool identifies both obvious and subtle issues affecting voice agent performance. In our tests, it uncovered a 12% hallucination rate on certain questions that manual review had missed.

Common problems detected include knowledge gaps, latency spikes, transcription errors, unnatural conversation flow, and tool integration failures.

  • Knowledge base coverage gaps
  • Technical issues like high latency
  • Conversational flow problems

Setup involves four key steps: selecting agents to analyze, defining your date range, configuring success criteria, and setting performance metric thresholds.

The process takes about 15 minutes and allows complete customization of what constitutes a "successful" call for your specific use case.

  • Select agents and date range
  • Define success criteria
  • Set metric thresholds

Testing revealed several critical issues including a 12% hallucination rate on certain patient questions, latency spikes to 4.25 seconds in worst cases, and an average of 2.55 interruptions per call.

Perhaps most importantly, it uncovered a complete knowledge gap around the question "Can you call the office to find out who called me?" which the agent couldn't handle at all.

  • Critical knowledge gaps
  • Latency issues
  • Conversation flow problems

For most businesses, we recommend running QA weekly or bi-weekly. This frequency catches issues quickly while allowing time to implement fixes between tests.

More frequent testing (even daily) may be warranted when making significant agent changes or launching new features, while established agents might scale back to monthly checks.

  • Weekly for new/changing agents
  • Bi-weekly for stable deployments
  • Immediately after major updates

GrowwStacks specializes in implementing and optimizing voice AI solutions with Retell. We configure QA parameters specific to your use case, analyze results, and implement improvements that typically reduce hallucinations by 80% and improve call success rates.

Our team handles everything from initial setup to ongoing monitoring, ensuring your voice agents deliver consistent, high-quality interactions that build customer trust.

  • Custom QA configuration
  • Performance optimization
  • Ongoing monitoring

Stop Guessing About Your Voice AI Performance

Every day without proper QA means more frustrated customers and eroded trust. Let GrowwStacks implement Retell's AI quality assurance for your business and optimize your voice agents in under 2 weeks.