Official repository of "FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement."
Paper | Documentation
Please refer to document.
Please refer to document.
Please refer to document.
Pytorch checkpoints and tensorboard logs are provided in releases.
Please refer to document for calculating objective metrics.
Please refer to document for pytorch inference.
ONNX models are provided in releases.
Please refer to document for streaming inference using ONNXRuntime.
- Except for GTCRN, we trained each model five times with five different seed and report the average scores.
Table 1. Performance on Voicebank-Demand testset.
| Model | Para. (K) |
MACs | RTF (Xeon) |
RTF (M1) |
DNSMOS (P.808) |
DNSMOS (P.835) | SCOREQ | SISDR | PESQ | STOI | ESTOI | WER | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SIG | BAK | OVL | ||||||||||||
| GTCRNa | 24 | 40M | 0.060 | 0.042 | 3.43 | 3.36 | 4.02 | 3.08 | 0.330 | 18.8 | 2.87 | 0.940 | 0.848 | 3.6 |
| LiSenNetb | 37 | 56M | - | - | 3.34 | 3.30 | 3.90 | 2.98 | 0.425 | 13.5 | 3.08 | 0.938 | 0.842 | 3.7 |
| LiSenNetc | 37 | 56M | 0.034 | 0.028 | 3.42 | 3.34 | 4.03 | 3.07 | 0.335 | 18.5 | 2.98 | 0.941 | 0.851 | 3.4 |
| FSPENd | 79 | 64M | 0.046 | 0.038 | 3.40 | 3.33 | 4.00 | 3.05 | 0.324 | 18.4 | 3.00 | 0.942 | 0.850 | 3.6 |
| BSRNNd | 334 | 245M | 0.059 | 0.062 | 3.44 | 3.36 | 4.00 | 3.07 | 0.303 | 18.9 | 3.06 | 0.942 | 0.855 | 3.4 |
| FastEnhancer_B | 92 | 262M | 0.022 | 0.026 | 3.47 | 3.38 | 4.02 | 3.10 | 0.285 | 19.0 | 3.13 | 0.945 | 0.861 | 3.2 |
| FastEnhancer_T | 22 | 55M | 0.012 | 0.013 | 3.42 | 3.34 | 4.01 | 3.06 | 0.334 | 18.6 | 2.99 | 0.940 | 0.850 | 3.6 |
| FastEnhancer_B | 92 | 262M | 0.022 | 0.026 | 3.47 | 3.38 | 4.02 | 3.10 | 0.285 | 19.0 | 3.13 | 0.945 | 0.861 | 3.2 |
| FastEnhancer_S | 195 | 664M | 0.034 | 0.048 | 3.49 | 3.40 | 4.03 | 3.12 | 0.265 | 19.2 | 3.19 | 0.947 | 0.866 | 3.2 |
| FastEnhancer_M | 492 | 2.9G | 0.101 | 0.173 | 3.48 | 3.39 | 4.02 | 3.11 | 0.243 | 19.4 | 3.24 | 0.950 | 0.873 | 2.8 |
| FastEnhancer_L | 1105 | 11G | 0.313 | 0.632 | 3.53 | 3.44 | 4.04 | 3.16 | 0.239 | 19.6 | 3.26 | 0.952 | 0.877 | 3.1 |
a Evaluated using the official checkpoint.
b Trained using the official training code. Not streamable because of input normalization and griffin-lim. Thus, RTFs are not reported.
c To make the model streamable, input normalization and griffin-lim are removed. Trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).
d Re-implemented and trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).
- Trained using DNS-Challenge-3 wideband training dataset.
- Without
emotional_speechandsinging_voice. - With VCTK-0.92 clean speech except
p232andp257speakers. - RIRs were not convolved to the clean speech.
- Unlike in Voicebank-Demand, we didn't use PESQLoss.
- Without
- Tested using DNS-Challenge-1 dev-testset-synthetic-no-reverb dataset.
- We trained each model only once with one random seed.
Table 2. Performance on DNS-Challenge1 dev-testset-synthetic-no-reverb.
| Model | Para. (K) |
MACs | RTF (Xeon) |
RTF (M1) |
DNSMOS (P.808) |
DNSMOS (P.835) | SCOREQ | SISDR | PESQ | STOI | ESTOI | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SIG | BAK | OVL | |||||||||||
| GTCRNa | 24 | 40M | 0.060 | 0.042 | 3.85 | 3.35 | 3.98 | 3.05 | 0.551 | 14.8 | 2.26 | 0.934 | 0.871 |
| LiSenNetb | 37 | 56M | 0.034 | 0.028 | 3.82 | 3.39 | 4.08 | 3.14 | 0.487 | 16.3 | 2.58 | 0.947 | 0.893 |
| FSPENb | 79 | 64M | 0.046 | 0.038 | 3.82 | 3.37 | 4.09 | 3.13 | 0.510 | 15.8 | 2.43 | 0.943 | 0.885 |
| BSRNNb | 334 | 245M | 0.059 | 0.062 | 3.89 | 3.41 | 4.11 | 3.18 | 0.441 | 16.7 | 2.61 | 0.951 | 0.901 |
| FastEnhancer_B | 92 | 262M | 0.022 | 0.026 | 3.92 | 3.43 | 4.12 | 3.20 | 0.396 | 16.7 | 2.69 | 0.953 | 0.903 |
| FastEnhancer_T | 22 | 55M | 0.012 | 0.013 | 3.81 | 3.35 | 4.07 | 3.10 | 0.522 | 15.4 | 2.43 | 0.940 | 0.879 |
| FastEnhancer_B | 92 | 262M | 0.022 | 0.026 | 3.92 | 3.43 | 4.12 | 3.20 | 0.396 | 16.7 | 2.69 | 0.953 | 0.903 |
| FastEnhancer_S | 195 | 664M | 0.034 | 0.048 | 3.96 | 3.46 | 4.13 | 3.23 | 0.373 | 17.5 | 2.79 | 0.960 | 0.914 |
| FastEnhancer_M | 492 | 2.9G | 0.101 | 0.173 | 3.98 | 3.48 | 4.14 | 3.26 | 0.345 | 18.4 | 2.78 | 0.965 | 0.924 |
| FastEnhancer_L | 1105 | 11G | 0.313 | 0.632 | 4.02 | 3.51 | 4.16 | 3.29 | 0.298 | 19.5 | 2.94 | 0.971 | 0.935 |
a Evaluated using the official checkpoint. It should be noted that this model was trained for both noise suppression and de-reverberation, whereas FastEnhancers were trained only for noise suppression. If GTCRN is trained for noise suppression only, its performance may be higher.
b Re-implemented and trained following the experimental setup of FastEnhancer (same loss function, same optimizer, etc. Only differences are the model architectures).