Skip to content

Conversation

@panv-kw
Copy link
Owner

@panv-kw panv-kw commented Jul 24, 2025

No description provided.

@github-actions
Copy link

Diarization Benchmark Results

Metric Value Target Status
DER 18.7% <30%
JER 22.6% <25%
RTF 0.05x <1.0x

Performance Timing

Stage Time (s) %
Model Download 1.887 3.5
Model Compile 0.705 1.3
Audio Load 0.084 0.2
Segmentation 13.280 24.6
Embedding 37.898 70.3
Clustering 0.025 0.0
Total 53.879 100

Research Comparison

Method DER Year
FluidAudio 18.7% 2025
Powerset BCE 18.5% 2023
EEND 25.3% 2019
x-vector clustering 28.7% 2018

ES2004a • 1049.4s audio • 51.2s inference • Test runtime: 1m 17s • 07/24/2025, 02:10 PM EST

@github-actions
Copy link

VAD Benchmark Results

Performance Comparison

Metric FluidAudio VAD Industry Standard Status
Accuracy 98.0% 85-90%
Precision 96.2% 85-95%
Recall 100.0% 80-90%
F1-Score 98.0% 85.9% (Sohn's VAD)
Processing Time 436.0s (100 files) ~1ms per 30ms chunk

Industry Leaders:

  • Silero VAD: ~90-95% F1 (DNN-based, 1.8MB model)
  • WebRTC VAD: ~75-80% F1 (GMM-based, fast but lower accuracy)
  • Sohn's VAD: 77.5% F1 (traditional approach)
  • Modern DNNs: 85-97% F1 (varies by SNR conditions)
📊 Detailed Research Comparisons
Paper Dataset F1-Score Method
Silero VAD (2021) TEDx 88.1% LSTM-based lightweight model
WebRTC VAD MUSAN 64.4% GMM-based (traditional)
pyannote.audio (2020) AMI 85.9% SincTDNN architecture
MarbleNet (2020) AVA-Speech 87.8% 1D time-channel separable CNN
FluidAudio VAD MUSAN-mini 98.0% CoreML-optimized Silero

Note: Direct comparisons should consider dataset differences. MUSAN contains challenging noise conditions.

@github-actions
Copy link

ASR Benchmark Results

Dataset WER Avg WER Med RTFx Status
test-clean 5.42% 0.00% 1.44x
test-other 3.60% 0.00% 1.08x

100 files per dataset • Test runtime: 10m24s • 07/24/2025, 02:20 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Note: CI RTFx degraded by M1/M2 Mac virtualization. M1 Mac test: ~28x (clean), ~25x (other). Testing per HuggingFace Open ASR Leaderboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant