FluidInference · Alex-Wengg · Nov 5, 2025 · Nov 3, 2025 · Nov 4, 2025 · Nov 4, 2025
diff --git a/BENCHMARK.md b/BENCHMARK.md
@@ -0,0 +1,224 @@
+# Benchmark Guide
+
+This guide explains how to run benchmarks to evaluate model performance on your hardware.
+
+## Quick Start
+
+### Install Dependencies
+
+```bash
+pip install openvino-genai soundfile numpy
+```
+
+Or using uv:
+
+```bash
+uv pip install openvino-genai soundfile numpy
+```
+
+### Run Benchmarks
+
+#### All Models Comparison
+
+Compare Parakeet V2, V3, and Whisper on your hardware:
+
+```bash
+uv run python benchmarks/benchmark_whisper_ov.py
+```
+
+#### FLEURS Multilingual Benchmark
+
+Test on specific languages with the FLEURS dataset:
+
+```bash
+# English only, 10 samples, NPU device
+uv run python benchmarks/benchmark_fleurs.py --languages en_us --samples 10 --device NPU
+
+# Multiple languages, 25 samples each
+uv run python benchmarks/benchmark_fleurs.py --languages en_us es_419 fr_fr --samples 25 --device CPU
+
+# All available languages
+uv run python benchmarks/benchmark_fleurs.py --all-languages --samples 5 --device NPU
+```
+
+**FLEURS Options:**
+- `--languages`: Specific language codes (e.g., `en_us`, `es_419`, `fr_fr`)
+- `--all-languages`: Test all 24 supported languages
+- `--samples`: Number of audio samples per language (default: 10)
+- `--device`: Target device - `NPU`, `CPU`, or `GPU`
+
+#### LibriSpeech Benchmark (C++)
+
+For detailed accuracy testing on LibriSpeech test-clean:
+
+```bash
+# Build the benchmark
+cmake --build build --config Release --target benchmark_librispeech
+
+# Run on 25 files
+build/examples/cpp/Release/benchmark_librispeech.exe --max-files 25
+
+# Run on all files (2620 total)
+build/examples/cpp/Release/benchmark_librispeech.exe
+```
+
+## Benchmark Metrics
+
+### RTFx (Real-Time Factor)
+
+Measures processing speed relative to audio duration:
+- **RTFx = 1.0**: Processes at real-time speed (1 min audio = 1 min processing)
+- **RTFx > 1.0**: Faster than real-time (RTFx = 10 means 1 min audio in 6 seconds)
+- **RTFx < 1.0**: Slower than real-time
+
+### WER (Word Error Rate)
+
+Measures transcription accuracy:
+- **Lower is better**
+- Calculated as: `(Substitutions + Deletions + Insertions) / Total Words × 100`
+- Industry standard metric for ASR evaluation
+
+### Confidence Score
+
+Per-token confidence from the model:
+- **Range**: 0.0 to 1.0 (higher is better)
+- Useful for filtering uncertain predictions
+
+## Benchmark Results
+
+See [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for detailed performance data on Intel Core Ultra 7 155H.
+
+## Dataset Information
+
+### LibriSpeech
+
+- **Source**: [OpenSLR](http://www.openslr.org/12)
+- **License**: CC-BY-4.0
+- **Language**: English only
+- **Test-clean subset**: 2,620 samples, ~5.4 hours
+- **Use case**: High-quality English ASR evaluation
+
+### FLEURS
+
+- **Source**: [Google Research](https://huggingface.co/datasets/google/fleurs)
+- **License**: CC-BY-4.0
+- **Languages**: 102 languages (eddy supports 24)
+- **Use case**: Multilingual ASR evaluation
+
+## Supported Languages (Parakeet V3)
+
+English, Spanish, Italian, French, German, Dutch, Russian, Polish, Ukrainian, Slovak, Bulgarian, Finnish, Romanian, Croatian, Czech, Swedish, Estonian, Hungarian, Lithuanian, Danish, Maltese, Slovenian, Latvian, Greek
+
+**Language Codes for FLEURS:**
+- `en_us` - English
+- `es_419` - Spanish
+- `it_it` - Italian
+- `fr_fr` - French
+- `de_de` - German
+- `nl_nl` - Dutch
+- `ru_ru` - Russian
+- `pl_pl` - Polish
+- `uk_ua` - Ukrainian
+- `sk_sk` - Slovak
+- `bg_bg` - Bulgarian
+- `fi_fi` - Finnish
+- `ro_ro` - Romanian
+- `hr_hr` - Croatian
+- `cs_cz` - Czech
+- `sv_se` - Swedish
+- `et_ee` - Estonian
+- `hu_hu` - Hungarian
+- `lt_lt` - Lithuanian
+- `da_dk` - Danish
+- `mt_mt` - Maltese
+- `sl_si` - Slovenian
+- `lv_lv` - Latvian
+- `el_gr` - Greek
+
+## Custom Benchmarks
+
+### Python API Example
+
+```python
+from eddy import ParakeetASR
+import time
+
+# Initialize model
+asr = ParakeetASR("parakeet-v3", device="NPU")
+
+# Transcribe and measure performance
+audio_file = "test.wav"
+start_time = time.time()
+result = asr.transcribe(audio_file)
+elapsed = time.time() - start_time
+
+print(f"Text: {result['text']}")
+print(f"Time: {elapsed:.2f}s")
+print(f"RTFx: {result['rtfx']:.2f}×")
+```
+
+### C++ API Example
+
+See [docs/CPP_API.md](docs/CPP_API.md) for C++ integration examples.
+
+## Hardware Recommendations
+
+### Best Performance: Intel NPU
+
+- **Devices**: Intel Core Ultra (Meteor Lake or newer)
+- **Expected RTFx**: 30-40× for Parakeet, 15-20× for Whisper
+- **Power efficiency**: Best for battery-powered devices
+
+### CPU Fallback
+
+- **Expected RTFx**: 5-10× for Parakeet, 0.4-0.5× for Whisper
+- **Works on**: Any modern x86-64 CPU
+- **Use when**: NPU not available
+
+### GPU (Experimental)
+
+- **Expected RTFx**: Varies by GPU (integrated vs discrete)
+- **Note**: Best results with discrete GPUs
+
+## Troubleshooting
+
+### Slow Performance
+
+1. Verify OpenVINO 2025.x is installed
+2. Check device availability: `parakeet_cli.exe --list-devices`
+3. Use `--device NPU` for Intel Core Ultra processors
+4. Ensure Release build (Debug is ~10× slower)
+
+### Out of Memory
+
+- Reduce batch size in benchmark scripts
+- Use smaller model (V2 instead of V3, or Whisper base instead of large)
+- Close other applications
+
+### Dataset Download Issues
+
+LibriSpeech and FLEURS datasets auto-download on first run. If download fails:
+
+```bash
+# Manual download
+wget https://www.openslr.org/resources/12/test-clean.tar.gz
+tar -xzf test-clean.tar.gz
+
+# Or use HuggingFace datasets library
+pip install datasets
+python -c "from datasets import load_dataset; load_dataset('google/fleurs', 'en_us')"
+```
+
+## Contributing Benchmark Results
+
+Share your results with the community:
+
+1. Run benchmarks on your hardware
+2. Note your CPU/GPU model and OS
+3. Submit results via GitHub Issues or Discord
+4. Help us understand performance across different platforms
+
+## Support
+
+- **GitHub Issues**: [github.com/FluidInference/eddy/issues](https://github.com/FluidInference/eddy/issues)
+- **Discord**: [discord.gg/WNsvaCtmDe](https://discord.gg/WNsvaCtmDe)
diff --git a/BENCHMARK_RESULTS.md b/BENCHMARK_RESULTS.md
@@ -0,0 +1,135 @@
+# Benchmark Results
+
+Comprehensive benchmark results for eddy ASR on LibriSpeech test-clean and FLEURS multilingual datasets.
+
+**Hardware**: Intel Core Ultra 7 155H (Meteor Lake) with Intel AI Boost NPU
+**Software**: OpenVINO 2025.3.0
+**Normalization**: OpenAI Whisper English normalizer
+
+---
+
+## LibriSpeech test-clean (English)
+
+### Parakeet V2 (English-only, optimized)
+
+| Metric | Value |
+|--------|-------|
+| **Dataset** | LibriSpeech test-clean |
+| **Files processed** | 2,620 |
+| **Average WER** | 2.87% |
+| **Median WER** | 0.00% |
+| **Average CER** | 1.07% |
+| **Overall RTFx (NPU)** | 37.8× |
+| **Total audio duration** | 19,452.5s (5.4 hours) |
+| **Total processing time** | 514.7s |
+
+**Comparison**:
+- FluidAudio v2 (CoreML): 2.2% WER, 141× RTFx on M4 Pro
+- eddy v2 (OpenVINO NPU): 2.87% WER, 37.8× RTFx on Intel Core Ultra 7 155H
+
+### Parakeet V3 (Multilingual)
+
+| Metric | Value |
+|--------|-------|
+| **Dataset** | LibriSpeech test-clean |
+| **Model** | parakeet-v3 |
+| **Device** | NPU |
+| **Files processed** | 2,620 |
+| **Average WER** | 3.7% |
+| **Median WER** | 0.0% |
+| **Average CER** | 1.9% |
+| **Median CER** | 0.0% |
+| **Median RTFx** | 23.5× |
+| **Overall RTFx (NPU)** | 25.7× |
+| **Total audio duration** | 19,452.5s (5.4 hours) |
+| **Total processing time** | 756.4s |
+| **Benchmark runtime** | 789.8s |
+
+**Comparison**:
+- FluidAudio v3 (CoreML, multilingual): 2.6% WER
+- eddy v3 (OpenVINO NPU, multilingual): 3.7% WER
+
+---
+
+## FLEURS Multilingual Benchmark (24 Languages)
+
+**Model**: Parakeet V3
+**Device**: NPU
+**Dataset**: FLEURS (Federated Learning Evaluation Representation United States)
+
+| Language | WER | Ref WER | CER | RTFx | Samples |
+|----------|-----|---------|-----|------|---------|
+| **Italian (Italy)** | 4.3% | 3.0% | 2.1% | 43.6× | 350 |
+| **Spanish (Spain)** | 5.4% | 3.5% | 2.8% | 43.1× | 350 |
+| **English (US)** | 6.1% | 4.9% | 3.0% | 41.9× | 350 |
+| **German (Germany)** | 7.4% | 5.0% | 2.9% | 42.8× | 350 |
+| **French (France)** | 7.7% | 5.2% | 3.2% | 40.6× | 350 |
+| **Dutch (Netherlands)** | 9.8% | 7.5% | 3.3% | 37.5× | 350 |
+| **Russian (Russia)** | 9.9% | 5.5% | 2.5% | 39.7× | 350 |
+| **Polish (Poland)** | 10.5% | 7.3% | 3.1% | 37.3× | 350 |
+| **Ukrainian (Ukraine)** | 10.7% | 6.8% | 2.9% | 39.3× | 350 |
+| **Slovak (Slovakia)** | 11.1% | 8.8% | 3.5% | 43.7× | 350 |
+| **Bulgarian (Bulgaria)** | 16.8% | 12.6% | 4.7% | 41.7× | 350 |
+| **Finnish (Finland)** | 16.8% | 13.2% | 3.7% | 41.5× | 918 |
+| **Romanian (Romania)** | 17.5% | 12.4% | 5.9% | 38.9× | 883 |
+| **Croatian (Croatia)** | 17.8% | 12.5% | 5.8% | 41.0× | 350 |
+| **Czech (Czechia)** | 18.5% | 11.0% | 5.3% | 43.1× | 350 |
+| **Swedish (Sweden)** | 18.9% | 15.1% | 5.6% | 41.5× | 759 |
+| **Hungarian (Hungary)** | 20.7% | 15.7% | 6.4% | 41.1× | 905 |
+| **Estonian (Estonia)** | 20.8% | 17.7% | 4.9% | 43.4× | 893 |
+| **Lithuanian (Lithuania)** | 24.6% | 20.4% | 6.7% | 40.4× | 986 |
+| **Danish (Denmark)** | 25.4% | 18.4% | 9.3% | 44.0× | 930 |
+| **Maltese (Malta)** | 25.3% | 20.5% | 9.2% | 41.3× | 926 |
+| **Slovenian (Slovenia)** | 28.1% | 24.0% | 9.4% | 38.7× | 834 |
+| **Latvian (Latvia)** | 30.6% | 22.8% | 8.1% | 42.6× | 851 |
+| **Greek (Greece)** | 42.7% | 20.7% | 15.0% | 37.2× | 650 |
+
+### FLEURS Summary
+
+| Metric | Value |
+|--------|-------|
+| **Average WER** | 17.0% |
+| **Reference WER** | 12.7% |
+| **Average CER** | 5.4% |
+| **Average RTFx** | 41.1× |
+| **Languages** | 24 |
+| **Total samples** | ~15,000+ |
+
+---
+
+## Performance Notes
+
+### Best Performing Languages (WER < 10%)
+1. Italian: 4.3%
+2. Spanish: 5.4%
+3. English: 6.1%
+4. German: 7.4%
+5. French: 7.7%
+6. Dutch: 9.8%
+7. Russian: 9.9%
+
+### RTFx Consistency
+- NPU performance is very consistent across languages (37-44× RTFx)
+- Average RTFx: 41.1× across all 24 languages
+- Minimal variance indicates efficient NPU utilization
+
+### Accuracy vs Reference
+- Our WER is ~4.3% higher than reference WER on average
+- This delta is consistent across most languages
+- Likely due to differences in:
+  - Text normalization approach
+  - Model quantization (int8 for NPU optimization)
+  - Greedy vs beam search decoding
+
+---
+
+## Methodology
+
+- **Text Normalization**: OpenAI Whisper English normalizer (industry standard)
+- **WER Calculation**: jiwer library
+- **Audio Format**: 16kHz mono WAV
+- **Inference**: Batch processing with 10-second chunks, 3-second overlap
+- **State Management**: LSTM state continuity across chunks
+- **Deduplication**: 2D search algorithm at chunk boundaries
+
+See [FLEURS_BENCHMARK.md](FLEURS_BENCHMARK.md) for detailed FLEURS benchmark methodology and implementation.