layout | title | nav_order | permalink |
---|---|---|---|
default |
Features |
2 |
/FEATURES |
SONATA offers a comprehensive suite of audio transcription and analysis features. This document provides details on each major feature.
SONATA uses WhisperX, an enhanced version of Whisper that provides:
- State-of-the-art transcription accuracy across multiple languages
- Word-level timestamps for precise text alignment
- Support for various Whisper models (tiny, base, small, medium, large, large-v2, large-v3)
- Automatic language detection capabilities
- Model optimization for various hardware (CPU, CUDA, MPS)
SONATA identifies non-speech sounds - from laughter and crying to ambient noises like traffic or music. Our system can detect over 523 different audio events with precise confidence scoring.
🔊 See complete list of detectable audio events
SONATA supports 10 languages:
- English (en)
- Korean (ko)
- Chinese (zh)
- Japanese (ja)
- French (fr)
- German (de)
- Spanish (es)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Identify and label different speakers in multi-speaker audio
- Set minimum and maximum speaker constraints
- Integrated with PyAnnote's diarization models
- Speaker-attributed transcripts with formatting options
- Word-level timestamps for all transcribed content
- Precise timing for audio events
- Multiple output formats with varying levels of timestamp detail
- Support for extracting specific time ranges
- Audio format conversion for maximum compatibility
- Silence detection and trimming to improve transcription quality
- Audio segmentation for long files
- Custom segment length and overlap controls
- Concise: Simple text with integrated audio event tags
- Default: Text with timestamps
- Extended: Includes confidence scores
- JSON output with comprehensive metadata
- Python API for integration into other applications
- Command-line interface for quick usage
- Batch processing capabilities
- Progress indicators for long-running operations