Introduction

Official repository of "FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement."
Paper | Documentation

Install

Please refer to document.

Datasets

Please refer to document.

Training

Please refer to document.

Inference

PyTorch Inference

Pytorch checkpoints and tensorboard logs are provided in releases.
Please refer to document for calculating objective metrics.
Please refer to document for pytorch inference.

ONNXRuntime Inference

ONNX models are provided in releases.
Please refer to document for streaming inference using ONNXRuntime.

Results

Voicebank-Demand 16kHz

Except for GTCRN, we trained each model five times with five different seed and report the average scores.

Table 1. Performance on Voicebank-Demand testset.

Model	Para. (K)	MACs	RTF (Xeon)	RTF (M1)	DNSMOS (P.808)	DNSMOS (P.835)			SCOREQ	SISDR	PESQ	STOI	ESTOI	WER
Model	Para. (K)	MACs	RTF (Xeon)	RTF (M1)	DNSMOS (P.808)	SIG	BAK	OVL	SCOREQ	SISDR	PESQ	STOI	ESTOI	WER
GTCRN^a	24	40M	0.060	0.042	3.43	3.36	4.02	3.08	0.330	18.8	2.87	0.940	0.848	3.6
LiSenNet^b	37	56M	-	-	3.34	3.30	3.90	2.98	0.425	13.5	3.08	0.938	0.842	3.7
LiSenNet^c	37	56M	0.034	0.028	3.42	3.34	4.03	3.07	0.335	18.5	2.98	0.941	0.851	3.4
FSPEN^d	79	64M	0.046	0.038	3.40	3.33	4.00	3.05	0.324	18.4	3.00	0.942	0.850	3.6
BSRNN^d	334	245M	0.059	0.062	3.44	3.36	4.00	3.07	0.303	18.9	3.06	0.942	0.855	3.4
FastEnhancer_B	92	262M	0.022	0.026	3.47	3.38	4.02	3.10	0.285	19.0	3.13	0.945	0.861	3.2

FastEnhancer_T	22	55M	0.012	0.013	3.42	3.34	4.01	3.06	0.334	18.6	2.99	0.940	0.850	3.6
FastEnhancer_B	92	262M	0.022	0.026	3.47	3.38	4.02	3.10	0.285	19.0	3.13	0.945	0.861	3.2
FastEnhancer_S	195	664M	0.034	0.048	3.49	3.40	4.03	3.12	0.265	19.2	3.19	0.947	0.866	3.2
FastEnhancer_M	492	2.9G	0.101	0.173	3.48	3.39	4.02	3.11	0.243	19.4	3.24	0.950	0.873	2.8
FastEnhancer_L	1105	11G	0.313	0.632	3.53	3.44	4.04	3.16	0.239	19.6	3.26	0.952	0.877	3.1

^a Evaluated using the official checkpoint.
^b Trained using the official training code. Not streamable because of input normalization and griffin-lim. Thus, RTFs are not reported.
^c To make the model streamable, input normalization and griffin-lim are removed. Trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).
^d Re-implemented and trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).

DNS-Challenge 16kHz

Trained using DNS-Challenge-3 wideband training dataset.
- Without emotional_speech and singing_voice.
- With VCTK-0.92 clean speech except p232 and p257 speakers.
- RIRs were not convolved to the clean speech.
- Unlike in Voicebank-Demand, we didn't use PESQLoss.
Tested using DNS-Challenge-1 dev-testset-synthetic-no-reverb dataset.
We trained each model only once with one random seed.

Table 2. Performance on DNS-Challenge1 dev-testset-synthetic-no-reverb.

Model	Para. (K)	MACs	RTF (Xeon)	RTF (M1)	DNSMOS (P.808)	DNSMOS (P.835)			SCOREQ	SISDR	PESQ	STOI	ESTOI
Model	Para. (K)	MACs	RTF (Xeon)	RTF (M1)	DNSMOS (P.808)	SIG	BAK	OVL	SCOREQ	SISDR	PESQ	STOI	ESTOI
GTCRN^a	24	40M	0.060	0.042	3.85	3.35	3.98	3.05	0.551	14.8	2.26	0.934	0.871
LiSenNet^b	37	56M	0.034	0.028	3.82	3.39	4.08	3.14	0.487	16.3	2.58	0.947	0.893
FSPEN^b	79	64M	0.046	0.038	3.82	3.37	4.09	3.13	0.510	15.8	2.43	0.943	0.885
BSRNN^b	334	245M	0.059	0.062	3.89	3.41	4.11	3.18	0.441	16.7	2.61	0.951	0.901
FastEnhancer_B	92	262M	0.022	0.026	3.92	3.43	4.12	3.20	0.396	16.7	2.69	0.953	0.903

FastEnhancer_T	22	55M	0.012	0.013	3.81	3.35	4.07	3.10	0.522	15.4	2.43	0.940	0.879
FastEnhancer_B	92	262M	0.022	0.026	3.92	3.43	4.12	3.20	0.396	16.7	2.69	0.953	0.903
FastEnhancer_S	195	664M	0.034	0.048	3.96	3.46	4.13	3.23	0.373	17.5	2.79	0.960	0.914
FastEnhancer_M	492	2.9G	0.101	0.173	3.98	3.48	4.14	3.26	0.345	18.4	2.78	0.965	0.924
FastEnhancer_L	1105	11G	0.313	0.632	4.02	3.51	4.16	3.29	0.298	19.5	2.94	0.971	0.935

^a Evaluated using the official checkpoint. It should be noted that this model was trained for both noise suppression and de-reverberation, whereas FastEnhancers were trained only for noise suppression. If GTCRN is trained for noise suppression only, its performance may be higher.
^b Re-implemented and trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
configs		configs
docs		docs
functional		functional
models		models
onnx		onnx
optim		optim
scripts		scripts
utils		utils
wrappers		wrappers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
losses.py		losses.py
train.py		train.py
train_torchrun.py		train_torchrun.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Install

Datasets

Training

Inference

PyTorch Inference

ONNXRuntime Inference

Results

Voicebank-Demand 16kHz

DNS-Challenge 16kHz

About

Uh oh!

Releases 5

Packages

Contributors 3

Uh oh!

Languages

License

aask1357/fastenhancer

Folders and files

Latest commit

History

Repository files navigation

Introduction

Install

Datasets

Training

Inference

PyTorch Inference

ONNXRuntime Inference

Results

Voicebank-Demand 16kHz

DNS-Challenge 16kHz

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 3

Uh oh!

Languages

Packages