Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Benchmark Guide

This guide explains how to run benchmarks to evaluate model performance on your hardware.

## Quick Start

### Install Dependencies

```bash
pip install openvino-genai soundfile numpy
```

Or using uv:

```bash
uv pip install openvino-genai soundfile numpy
```

### Run Benchmarks

#### All Models Comparison

Compare Parakeet V2, V3, and Whisper on your hardware:

```bash
uv run python benchmarks/benchmark_whisper_ov.py
```

#### FLEURS Multilingual Benchmark

Test on specific languages with the FLEURS dataset:

```bash
# English only, 10 samples, NPU device
uv run python benchmarks/benchmark_fleurs.py --languages en_us --samples 10 --device NPU

# Multiple languages, 25 samples each
uv run python benchmarks/benchmark_fleurs.py --languages en_us es_419 fr_fr --samples 25 --device CPU

# All available languages
uv run python benchmarks/benchmark_fleurs.py --all-languages --samples 5 --device NPU
```

**FLEURS Options:**
- `--languages`: Specific language codes (e.g., `en_us`, `es_419`, `fr_fr`)
- `--all-languages`: Test all 24 supported languages
- `--samples`: Number of audio samples per language (default: 10)
- `--device`: Target device - `NPU`, `CPU`, or `GPU`

#### LibriSpeech Benchmark (C++)

For detailed accuracy testing on LibriSpeech test-clean:

```bash
# Build the benchmark
cmake --build build --config Release --target benchmark_librispeech

# Run on 25 files
build/examples/cpp/Release/benchmark_librispeech.exe --max-files 25

# Run on all files (2620 total)
build/examples/cpp/Release/benchmark_librispeech.exe
```

## Benchmark Metrics

### RTFx (Real-Time Factor)

Measures processing speed relative to audio duration:
- **RTFx = 1.0**: Processes at real-time speed (1 min audio = 1 min processing)
- **RTFx > 1.0**: Faster than real-time (RTFx = 10 means 1 min audio in 6 seconds)
- **RTFx < 1.0**: Slower than real-time

### WER (Word Error Rate)

Measures transcription accuracy:
- **Lower is better**
- Calculated as: `(Substitutions + Deletions + Insertions) / Total Words × 100`
- Industry standard metric for ASR evaluation

### Confidence Score

Per-token confidence from the model:
- **Range**: 0.0 to 1.0 (higher is better)
- Useful for filtering uncertain predictions

## Benchmark Results

See [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for detailed performance data on Intel Core Ultra 7 155H.

## Dataset Information

### LibriSpeech

- **Source**: [OpenSLR](http://www.openslr.org/12)
- **License**: CC-BY-4.0
- **Language**: English only
- **Test-clean subset**: 2,620 samples, ~5.4 hours
- **Use case**: High-quality English ASR evaluation

### FLEURS

- **Source**: [Google Research](https://huggingface.co/datasets/google/fleurs)
- **License**: CC-BY-4.0
- **Languages**: 102 languages (eddy supports 24)
- **Use case**: Multilingual ASR evaluation

## Supported Languages (Parakeet V3)

English, Spanish, Italian, French, German, Dutch, Russian, Polish, Ukrainian, Slovak, Bulgarian, Finnish, Romanian, Croatian, Czech, Swedish, Estonian, Hungarian, Lithuanian, Danish, Maltese, Slovenian, Latvian, Greek

**Language Codes for FLEURS:**
- `en_us` - English
- `es_419` - Spanish
- `it_it` - Italian
- `fr_fr` - French
- `de_de` - German
- `nl_nl` - Dutch
- `ru_ru` - Russian
- `pl_pl` - Polish
- `uk_ua` - Ukrainian
- `sk_sk` - Slovak
- `bg_bg` - Bulgarian
- `fi_fi` - Finnish
- `ro_ro` - Romanian
- `hr_hr` - Croatian
- `cs_cz` - Czech
- `sv_se` - Swedish
- `et_ee` - Estonian
- `hu_hu` - Hungarian
- `lt_lt` - Lithuanian
- `da_dk` - Danish
- `mt_mt` - Maltese
- `sl_si` - Slovenian
- `lv_lv` - Latvian
- `el_gr` - Greek

## Custom Benchmarks

### Python API Example

```python
from eddy import ParakeetASR
import time

# Initialize model
asr = ParakeetASR("parakeet-v3", device="NPU")

# Transcribe and measure performance
audio_file = "test.wav"
start_time = time.time()
result = asr.transcribe(audio_file)
elapsed = time.time() - start_time

print(f"Text: {result['text']}")
print(f"Time: {elapsed:.2f}s")
print(f"RTFx: {result['rtfx']:.2f}×")
```

### C++ API Example

See [docs/CPP_API.md](docs/CPP_API.md) for C++ integration examples.

## Hardware Recommendations

### Best Performance: Intel NPU

- **Devices**: Intel Core Ultra (Meteor Lake or newer)
- **Expected RTFx**: 30-40× for Parakeet, 15-20× for Whisper
- **Power efficiency**: Best for battery-powered devices

### CPU Fallback

- **Expected RTFx**: 5-10× for Parakeet, 0.4-0.5× for Whisper
- **Works on**: Any modern x86-64 CPU
- **Use when**: NPU not available

### GPU (Experimental)

- **Expected RTFx**: Varies by GPU (integrated vs discrete)
- **Note**: Best results with discrete GPUs

## Troubleshooting

### Slow Performance

1. Verify OpenVINO 2025.x is installed
2. Check device availability: `parakeet_cli.exe --list-devices`
3. Use `--device NPU` for Intel Core Ultra processors
4. Ensure Release build (Debug is ~10× slower)

### Out of Memory

- Reduce batch size in benchmark scripts
- Use smaller model (V2 instead of V3, or Whisper base instead of large)
- Close other applications

### Dataset Download Issues

LibriSpeech and FLEURS datasets auto-download on first run. If download fails:

```bash
# Manual download
wget https://www.openslr.org/resources/12/test-clean.tar.gz
tar -xzf test-clean.tar.gz

# Or use HuggingFace datasets library
pip install datasets
python -c "from datasets import load_dataset; load_dataset('google/fleurs', 'en_us')"
```

## Contributing Benchmark Results

Share your results with the community:

1. Run benchmarks on your hardware
2. Note your CPU/GPU model and OS
3. Submit results via GitHub Issues or Discord
4. Help us understand performance across different platforms

## Support

- **GitHub Issues**: [github.com/FluidInference/eddy/issues](https://github.com/FluidInference/eddy/issues)
- **Discord**: [discord.gg/WNsvaCtmDe](https://discord.gg/WNsvaCtmDe)
135 changes: 135 additions & 0 deletions BENCHMARK_RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Benchmark Results

Comprehensive benchmark results for eddy ASR on LibriSpeech test-clean and FLEURS multilingual datasets.

**Hardware**: Intel Core Ultra 7 155H (Meteor Lake) with Intel AI Boost NPU
**Software**: OpenVINO 2025.3.0
**Normalization**: OpenAI Whisper English normalizer

---

## LibriSpeech test-clean (English)

### Parakeet V2 (English-only, optimized)

| Metric | Value |
|--------|-------|
| **Dataset** | LibriSpeech test-clean |
| **Files processed** | 2,620 |
| **Average WER** | 2.87% |
| **Median WER** | 0.00% |
| **Average CER** | 1.07% |
| **Overall RTFx (NPU)** | 37.8× |
| **Total audio duration** | 19,452.5s (5.4 hours) |
| **Total processing time** | 514.7s |

**Comparison**:
- FluidAudio v2 (CoreML): 2.2% WER, 141× RTFx on M4 Pro
- eddy v2 (OpenVINO NPU): 2.87% WER, 37.8× RTFx on Intel Core Ultra 7 155H

### Parakeet V3 (Multilingual)

| Metric | Value |
|--------|-------|
| **Dataset** | LibriSpeech test-clean |
| **Model** | parakeet-v3 |
| **Device** | NPU |
| **Files processed** | 2,620 |
| **Average WER** | 3.7% |
| **Median WER** | 0.0% |
| **Average CER** | 1.9% |
| **Median CER** | 0.0% |
| **Median RTFx** | 23.5× |
| **Overall RTFx (NPU)** | 25.7× |
| **Total audio duration** | 19,452.5s (5.4 hours) |
| **Total processing time** | 756.4s |
| **Benchmark runtime** | 789.8s |

**Comparison**:
- FluidAudio v3 (CoreML, multilingual): 2.6% WER
- eddy v3 (OpenVINO NPU, multilingual): 3.7% WER

---

## FLEURS Multilingual Benchmark (24 Languages)

**Model**: Parakeet V3
**Device**: NPU
**Dataset**: FLEURS (Federated Learning Evaluation Representation United States)

| Language | WER | Ref WER | CER | RTFx | Samples |
|----------|-----|---------|-----|------|---------|
| **Italian (Italy)** | 4.3% | 3.0% | 2.1% | 43.6× | 350 |
| **Spanish (Spain)** | 5.4% | 3.5% | 2.8% | 43.1× | 350 |
| **English (US)** | 6.1% | 4.9% | 3.0% | 41.9× | 350 |
| **German (Germany)** | 7.4% | 5.0% | 2.9% | 42.8× | 350 |
| **French (France)** | 7.7% | 5.2% | 3.2% | 40.6× | 350 |
| **Dutch (Netherlands)** | 9.8% | 7.5% | 3.3% | 37.5× | 350 |
| **Russian (Russia)** | 9.9% | 5.5% | 2.5% | 39.7× | 350 |
| **Polish (Poland)** | 10.5% | 7.3% | 3.1% | 37.3× | 350 |
| **Ukrainian (Ukraine)** | 10.7% | 6.8% | 2.9% | 39.3× | 350 |
| **Slovak (Slovakia)** | 11.1% | 8.8% | 3.5% | 43.7× | 350 |
| **Bulgarian (Bulgaria)** | 16.8% | 12.6% | 4.7% | 41.7× | 350 |
| **Finnish (Finland)** | 16.8% | 13.2% | 3.7% | 41.5× | 918 |
| **Romanian (Romania)** | 17.5% | 12.4% | 5.9% | 38.9× | 883 |
| **Croatian (Croatia)** | 17.8% | 12.5% | 5.8% | 41.0× | 350 |
| **Czech (Czechia)** | 18.5% | 11.0% | 5.3% | 43.1× | 350 |
| **Swedish (Sweden)** | 18.9% | 15.1% | 5.6% | 41.5× | 759 |
| **Hungarian (Hungary)** | 20.7% | 15.7% | 6.4% | 41.1× | 905 |
| **Estonian (Estonia)** | 20.8% | 17.7% | 4.9% | 43.4× | 893 |
| **Lithuanian (Lithuania)** | 24.6% | 20.4% | 6.7% | 40.4× | 986 |
| **Danish (Denmark)** | 25.4% | 18.4% | 9.3% | 44.0× | 930 |
| **Maltese (Malta)** | 25.3% | 20.5% | 9.2% | 41.3× | 926 |
| **Slovenian (Slovenia)** | 28.1% | 24.0% | 9.4% | 38.7× | 834 |
| **Latvian (Latvia)** | 30.6% | 22.8% | 8.1% | 42.6× | 851 |
| **Greek (Greece)** | 42.7% | 20.7% | 15.0% | 37.2× | 650 |

### FLEURS Summary

| Metric | Value |
|--------|-------|
| **Average WER** | 17.0% |
| **Reference WER** | 12.7% |
| **Average CER** | 5.4% |
| **Average RTFx** | 41.1× |
| **Languages** | 24 |
| **Total samples** | ~15,000+ |

---

## Performance Notes

### Best Performing Languages (WER < 10%)
1. Italian: 4.3%
2. Spanish: 5.4%
3. English: 6.1%
4. German: 7.4%
5. French: 7.7%
6. Dutch: 9.8%
7. Russian: 9.9%

### RTFx Consistency
- NPU performance is very consistent across languages (37-44× RTFx)
- Average RTFx: 41.1× across all 24 languages
- Minimal variance indicates efficient NPU utilization

### Accuracy vs Reference
- Our WER is ~4.3% higher than reference WER on average
- This delta is consistent across most languages
- Likely due to differences in:
- Text normalization approach
- Model quantization (int8 for NPU optimization)
- Greedy vs beam search decoding

---

## Methodology

- **Text Normalization**: OpenAI Whisper English normalizer (industry standard)
- **WER Calculation**: jiwer library
- **Audio Format**: 16kHz mono WAV
- **Inference**: Batch processing with 10-second chunks, 3-second overlap
- **State Management**: LSTM state continuity across chunks
- **Deduplication**: 2D search algorithm at chunk boundaries

See [FLEURS_BENCHMARK.md](FLEURS_BENCHMARK.md) for detailed FLEURS benchmark methodology and implementation.
Loading