Skip to content

aask1357/fastenhancer

Repository files navigation

Introduction

Official repository of "FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement."
Paper | Documentation

Install

Please refer to document.

Datasets

Please refer to document.

Training

Please refer to document.

Inference

PyTorch Inference

Pytorch checkpoints and tensorboard logs are provided in releases.
Please refer to document for calculating objective metrics.
Please refer to document for pytorch inference.

ONNXRuntime Inference

ONNX models are provided in releases.
Please refer to document for streaming inference using ONNXRuntime.

Results

Voicebank-Demand 16kHz

  • Except for GTCRN, we trained each model five times with five different seed and report the average scores.

Table 1. Performance on Voicebank-Demand testset.

Model Para.
(K)
MACs RTF
(Xeon)
RTF
(M1)
DNSMOS
(P.808)
DNSMOS (P.835) SCOREQ SISDR PESQ STOI ESTOI WER
SIG BAK OVL
GTCRNa 24 40M 0.060 0.042 3.43 3.36 4.02 3.08 0.330 18.8 2.87 0.940 0.848 3.6
LiSenNetb 37 56M - - 3.34 3.30 3.90 2.98 0.425 13.5 3.08 0.938 0.842 3.7
LiSenNetc 37 56M 0.034 0.028 3.42 3.34 4.03 3.07 0.335 18.5 2.98 0.941 0.851 3.4
FSPENd 79 64M 0.046 0.038 3.40 3.33 4.00 3.05 0.324 18.4 3.00 0.942 0.850 3.6
BSRNNd 334 245M 0.059 0.062 3.44 3.36 4.00 3.07 0.303 18.9 3.06 0.942 0.855 3.4
FastEnhancer_B 92 262M 0.022 0.026 3.47 3.38 4.02 3.10 0.285 19.0 3.13 0.945 0.861 3.2
FastEnhancer_T 22 55M 0.012 0.013 3.42 3.34 4.01 3.06 0.334 18.6 2.99 0.940 0.850 3.6
FastEnhancer_B 92 262M 0.022 0.026 3.47 3.38 4.02 3.10 0.285 19.0 3.13 0.945 0.861 3.2
FastEnhancer_S 195 664M 0.034 0.048 3.49 3.40 4.03 3.12 0.265 19.2 3.19 0.947 0.866 3.2
FastEnhancer_M 492 2.9G 0.101 0.173 3.48 3.39 4.02 3.11 0.243 19.4 3.24 0.950 0.873 2.8
FastEnhancer_L 1105 11G 0.313 0.632 3.53 3.44 4.04 3.16 0.239 19.6 3.26 0.952 0.877 3.1

a Evaluated using the official checkpoint.
b Trained using the official training code. Not streamable because of input normalization and griffin-lim. Thus, RTFs are not reported.
c To make the model streamable, input normalization and griffin-lim are removed. Trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).
d Re-implemented and trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).

DNS-Challenge 16kHz

  • Trained using DNS-Challenge-3 wideband training dataset.
    • Without emotional_speech and singing_voice.
    • With VCTK-0.92 clean speech except p232 and p257 speakers.
    • RIRs were not convolved to the clean speech.
    • Unlike in Voicebank-Demand, we didn't use PESQLoss.
  • Tested using DNS-Challenge-1 dev-testset-synthetic-no-reverb dataset.
  • We trained each model only once with one random seed.

Table 2. Performance on DNS-Challenge1 dev-testset-synthetic-no-reverb.

Model Para.
(K)
MACs RTF
(Xeon)
RTF
(M1)
DNSMOS
(P.808)
DNSMOS (P.835) SCOREQ SISDR PESQ STOI ESTOI
SIG BAK OVL
GTCRNa 24 40M 0.060 0.042 3.85 3.35 3.98 3.05 0.551 14.8 2.26 0.934 0.871
LiSenNetb 37 56M 0.034 0.028 3.82 3.39 4.08 3.14 0.487 16.3 2.58 0.947 0.893
FSPENb 79 64M 0.046 0.038 3.82 3.37 4.09 3.13 0.510 15.8 2.43 0.943 0.885
BSRNNb 334 245M 0.059 0.062 3.89 3.41 4.11 3.18 0.441 16.7 2.61 0.951 0.901
FastEnhancer_B 92 262M 0.022 0.026 3.92 3.43 4.12 3.20 0.396 16.7 2.69 0.953 0.903
FastEnhancer_T 22 55M 0.012 0.013 3.81 3.35 4.07 3.10 0.522 15.4 2.43 0.940 0.879
FastEnhancer_B 92 262M 0.022 0.026 3.92 3.43 4.12 3.20 0.396 16.7 2.69 0.953 0.903
FastEnhancer_S 195 664M 0.034 0.048 3.96 3.46 4.13 3.23 0.373 17.5 2.79 0.960 0.914
FastEnhancer_M 492 2.9G 0.101 0.173 3.98 3.48 4.14 3.26 0.345 18.4 2.78 0.965 0.924
FastEnhancer_L 1105 11G 0.313 0.632 4.02 3.51 4.16 3.29 0.298 19.5 2.94 0.971 0.935

a Evaluated using the official checkpoint. It should be noted that this model was trained for both noise suppression and de-reverberation, whereas FastEnhancers were trained only for noise suppression. If GTCRN is trained for noise suppression only, its performance may be higher.
b Re-implemented and trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).

About

Speed-optimized streaming neural speech enhancement network

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages