Model loading 2 rebase #5

panv-kw · 2025-07-24T18:07:31Z

No description provided.

…ve model loading tests

github-actions · 2025-07-24T18:10:22Z

Diarization Benchmark Results

Metric	Value	Target	Status
DER	18.7%	<30%	✅
JER	22.6%	<25%	✅
RTF	0.05x	<1.0x	✅

Performance Timing

Stage	Time (s)	%
Model Download	1.887	3.5
Model Compile	0.705	1.3
Audio Load	0.084	0.2
Segmentation	13.280	24.6
Embedding	37.898	70.3
Clustering	0.025	0.0
Total	53.879	100

Research Comparison

Method	DER	Year
FluidAudio	18.7%	2025
Powerset BCE	18.5%	2023
EEND	25.3%	2019
x-vector clustering	28.7%	2018

_{ES2004a • 1049.4s audio • 51.2s inference • Test runtime: 1m 17s • 07/24/2025, 02:10 PM EST}

github-actions · 2025-07-24T18:19:00Z

VAD Benchmark Results

Performance Comparison

Metric	FluidAudio VAD	Industry Standard	Status
Accuracy	98.0%	85-90%	✅
Precision	96.2%	85-95%	✅
Recall	100.0%	80-90%	✅
F1-Score	98.0%	85.9% (Sohn's VAD)	✅
Processing Time	436.0s (100 files)	~1ms per 30ms chunk	✅

Industry Leaders:

Silero VAD: ~90-95% F1 (DNN-based, 1.8MB model)
WebRTC VAD: ~75-80% F1 (GMM-based, fast but lower accuracy)
Sohn's VAD: 77.5% F1 (traditional approach)
Modern DNNs: 85-97% F1 (varies by SNR conditions)

📊 Detailed Research Comparisons

Paper	Dataset	F1-Score	Method
Silero VAD (2021)	TEDx	88.1%	LSTM-based lightweight model
WebRTC VAD	MUSAN	64.4%	GMM-based (traditional)
pyannote.audio (2020)	AMI	85.9%	SincTDNN architecture
MarbleNet (2020)	AVA-Speech	87.8%	1D time-channel separable CNN
FluidAudio VAD	MUSAN-mini	98.0%	CoreML-optimized Silero

Note: Direct comparisons should consider dataset differences. MUSAN contains challenging noise conditions.

github-actions · 2025-07-24T18:20:17Z

ASR Benchmark Results

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	5.42%	0.00%	1.44x	✅
test-other	3.60%	0.00%	1.08x	✅

_{100 files per dataset • Test runtime: 10m24s • 07/24/2025, 02:20 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

_{Note: CI RTFx degraded by M1/M2 Mac virtualization. M1 Mac test: ~28x (clean), ~25x (other). Testing per HuggingFace Open ASR Leaderboard.}

panv-kw added 3 commits July 24, 2025 19:58

Refactor diarization model loading in to DiarizationModels type

46ebff3

Allow custom configuration in DiarizerModels, update repo path, impro…

0fd0c1c

…ve model loading tests

Add a models parameter to DiarizerManager.initialize

470054d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model loading 2 rebase #5

Model loading 2 rebase #5

Uh oh!

panv-kw commented Jul 24, 2025

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Model loading 2 rebase #5

Are you sure you want to change the base?

Model loading 2 rebase #5

Uh oh!

Conversation

panv-kw commented Jul 24, 2025

Uh oh!

github-actions bot commented Jul 24, 2025

Diarization Benchmark Results

Performance Timing

Research Comparison

Uh oh!

github-actions bot commented Jul 24, 2025

VAD Benchmark Results

Performance Comparison

Uh oh!

github-actions bot commented Jul 24, 2025

ASR Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant