Skip to content

[DOC] - AI error handling patterns #323

Open
@briankwest

Description

@briankwest

AI Error Handling and User-Facing Messages Documentation

Executive Summary

This document outlines the comprehensive error handling strategy implemented in the SignalWire AI system, focusing on user-facing error messages delivered via ais_say() and the conditions that trigger session termination. The system employs a graceful degradation approach with polite, human-like error messages to maintain user experience even during failures.

Error Message Categories

1. Fatal System Errors (Session Terminating)

1.1 "I'm sorry I am having a migrane. I have to hangup now."

Context: Fatal error in AI text generation/streaming
Trigger Conditions:

  • Fatal error flag is set during AI response processing
  • Occurs in ai_send_text() function during streaming operations
  • Typically indicates severe API communication failures or malformed responses

Technical Details:

  • Sets ais->offhook = 0 to terminate the call
  • Fires a calling.ai.error event with fatal flag
  • Logs error details to system logs
  • Used when the AI system cannot recover from the error

1.2 "I'm sorry I am having a problem. I have to hangup now."

Context: Maximum retry attempts exceeded
Trigger Conditions:

  • Error count exceeds settings->max_tries (default: 3, post-processing: 10)
  • Only triggered during streaming operations (if (stream))
  • Indicates persistent communication or processing failures

Technical Details:

  • Only spoken during streaming mode
  • Implements exponential backoff retry logic before this message
  • Sets ais->offhook = 0 to terminate the call
  • Default max_tries is 3 for normal operations, 10 for post-processing

2. Recoverable Errors (Session Continuing)

2.1 "I'm sorry, can you hold on for a second?"

Context: First retry attempt
Trigger Conditions:

  • errors == 1 (first error encountered)
  • System will retry the operation after this message
  • Implements exponential backoff (waits errors seconds before retry)

Technical Details:

  • Polite way to indicate temporary processing delay
  • Implements exponential backoff (1 second, then 2 seconds, etc.)
  • Session continues after retry delay
  • Does not terminate the call

3. Voice/TTS Configuration Errors (Non-Terminating)

3.1 "There was an error with your voice selection, please check your syntax. using a default voice instead."

Context: Voice configuration failure with fallback
Trigger Conditions:

  • TTS engine initialization fails with specified voice parameters
  • System falls back to default Google Cloud voice (gcloud, en-US-Neural2-J)
  • Occurs during session initialization and runtime voice changes

Technical Details:

  • Graceful degradation to default voice
  • Session continues with fallback configuration
  • Provides user feedback about configuration issue
  • Suggests syntax checking for voice parameters

3.2 "There was an error with the current voice, sorry for the trouble."

Context: Runtime voice switching failure
Trigger Conditions:

  • Voice switching fails during active conversation
  • Occurs in the output thread during TTS processing
  • System attempts to recover by switching to default voice

Technical Details:

  • Attempts fallback to default voice
  • If fallback fails, sets ais->running = 0
  • More apologetic tone for runtime interruption
  • Indicates temporary service disruption

Session Termination Patterns

Termination Triggers (ais->offhook = 0)

The system terminates sessions by setting ais->offhook = 0 in the following scenarios:

  1. Fatal AI Processing Errors

    • Unrecoverable API communication failures
    • Malformed response data
    • Critical system errors
  2. Maximum Retry Exceeded

    • Persistent communication failures
    • Repeated processing errors
    • System reliability threshold exceeded
  3. Function-Triggered Hangup

    • Explicit hangup command from AI functions
    • Controlled session termination
    • User or system-initiated disconnect
  4. Transfer Operations

    • Call transfer scenarios
    • Session handoff to other systems
    • Controlled termination for routing

Error Recovery Mechanisms

Retry Logic

  • Exponential Backoff: Wait time increases with each retry (1s, 2s, 3s...)
  • Maximum Attempts: Configurable via max_tries (default: 3, post-processing: 10)
  • Graceful Messages: User-friendly notifications during retry attempts

Fallback Strategies

  • Default Voice: Falls back to gcloud en-US-Neural2-J on voice failures
  • Service Degradation: Continues with reduced functionality rather than termination
  • Error Logging: Comprehensive logging for debugging while maintaining user experience

Event System

  • Error Events: Fires calling.ai.error events for external monitoring
  • Fatal Flag: Distinguishes between recoverable and fatal errors
  • Structured Data: JSON error objects with detailed information

Best Practices Implemented

User Experience

  1. Polite Language: All error messages use apologetic, human-like language
  2. Clear Communication: Messages explain the situation without technical jargon
  3. Expectation Setting: Users are informed about delays and recovery attempts
  4. Graceful Degradation: System continues operation when possible

Technical Robustness

  1. Comprehensive Logging: All errors logged with appropriate severity levels
  2. Event Notification: External systems notified of error conditions
  3. Resource Cleanup: Proper memory and resource management during errors
  4. State Management: Consistent session state during error conditions

Error Classification

  1. Fatal vs Recoverable: Clear distinction between error types
  2. Context Awareness: Different handling for different operational contexts
  3. Retry Strategies: Appropriate retry logic for different error types
  4. Fallback Mechanisms: Multiple levels of service degradation

Configuration Parameters

Retry Settings

  • max_tries: Maximum retry attempts (default: 3, post-processing: 10)
  • Exponential backoff timing: errors * 1 second

Voice Fallback

  • Default engine: gcloud
  • Default voice: en-US-Neural2-J
  • Automatic fallback on configuration errors

Error Reporting

  • Event type: calling.ai.error
  • Includes fatal flag and error details
  • Structured JSON error objects

Monitoring and Debugging

Log Levels

  • ERROR: Fatal conditions and session terminations
  • WARNING: Retry attempts and recoverable errors
  • INFO: General error recovery information
  • DEBUG: Detailed technical information

Event Monitoring

  • Monitor calling.ai.error events for system health
  • Track fatal vs non-fatal error ratios
  • Analyze retry patterns and success rates

Performance Metrics

  • Track error rates by error type
  • Monitor retry success rates
  • Measure recovery times and user impact

Conclusion

The SignalWire AI system implements a sophisticated error handling strategy that prioritizes user experience while maintaining system reliability. The use of human-like error messages, graceful degradation, and comprehensive retry logic ensures that users receive a professional experience even during system difficulties. The clear separation between fatal and recoverable errors allows for appropriate response strategies while maintaining system stability.

The error handling patterns demonstrate a mature approach to production AI system reliability, with comprehensive logging, event notification, and fallback mechanisms that ensure both user satisfaction and system maintainability.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions