WER are we? An attempt at tracking states-of-the-art results and their corresponding codes on speech recognition. Feel free to correct! (Inspired by wer_are_we)
(Possibly trained on more data than HKUST.)
CER Test | Paper | Published | Notes | Codes |
---|---|---|---|---|
21.2% | Improving Transformer-based Speech Recognition Using Unsupervised Pre-training | October 2019 | Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Pre-training on 10,000 hours unlabeled speech | athena-team/Athena |
22.75% | Improving Transformer-based Speech Recognition Using Unsupervised Pre-training | October 2019 | Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Self data Pre-training | athena-team/Athena |
23.09% | CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition | February 2020 | CIF + SAN-based models (AM + LM) + speed perturbation + SpecAugment | None |
23.5% | A Comparative Study on Transformer vs RNN in Speech Applications | September 2019 | Transformer-CTC MTL + RNN-LM + speed perturbation | espnet/espnet |
23.67% | Purely sequence-trained neural networks for ASR based on lattice-free MMI | 2016 | TDNN/HMM, lattice-free MMI + speed perturbation | kaldi-asr/kaldi |
24.12% | Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping | February 2019 | SAA Model + SAN-LM (joint training) + speed perturbation | None |
27.67% | Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin | Feberary 2019 | Extended-RNA + RNN-LM (joint training) | None |
28.0% | Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM | June 2017 | CTC-Attention MTL + joint decoding ( one-pass) + VGG Net + RNN-LM (seperate) + speed perturbation | espnet/espnet |
29.9% | Joint CTC/attention decoding for end-to-end speech recognition | 2017 | CTC-Attention MTL-large + joint decoding (one pass) + speed perturbation | espnet/espnet |
CER Dev | CER Test | Paper | Published | Notes | Codes |
---|---|---|---|---|---|
None | 6.6% | Improving Transformer-based Speech Recognition Using Unsupervised Pre-training | October 2019 | Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Self data Pre-training | athena-team/Athena |
None | 6.34% | CAT: CRF-Based ASR Toolkit | November 2019 | VGG + BLSTM + CTC-CRF + 3-gram LM + speed perturbation | thu-spmi/CAT |
6.0% | 6.7% | A Comparative Study on Transformer vs RNN in Speech Applications | September 2019 | Transformer-CTC MTL + RNN-LM + speed perturbation | espnet/espnet |
None | 7.43% | Purely sequence-trained neural networks for ASR based on lattice-free MMI | 2016 | TDNN/HMM, lattice-free MMI + speed perturbation | kaldi-asr/kaldi |
CER Word Task 0db white / car / cafeteria |
PER Phone Task 0db white / car / cafeteria |
Paper | Published | Notes | Codes |
---|---|---|---|---|---|
75.01% / 32.13% / 56.37% | 46.95% / 15.96% / 32.56% | THCHS-30: A Free Chinese Speech Corpus | December 2015 | DNN + DAE-based noise cancellation | kaldi-asr/kaldi |
65.87% / 25.07% / 51.92% | 39.80% / 11.48% / 30.55% | None | None | DNN + DAE-based noise cancellation | kaldi-asr/kaldi |
(Possibly trained on more data than LibriSpeech.)
WER test-clean | WER test-other | Paper | Published | Notes | Codes |
---|---|---|---|---|---|
5.83% | 12.69% | Humans Deep Speech 2: End-to-End Speech Recognition in English and Mandarin | December 2015 | Humans | None |
2.0% | 4.1% | End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures | November 2019 | Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring + 60k hours unlabeled | facebookresearch/wav2letter |
2.3% | 4.9% | Transformer-based Acoustic Modeling for Hybrid Speech Recognition | October 2019 | Transformer AM (chenones) + 4-gram LM + Neural LM rescore (data augmentation:Speed perturbation and SpecAugment) | None |
2.3% | 5.0% | RWTH ASR Systems for LibriSpeech: Hybrid vs Attention | September 2019, Interspeech | HMM-DNN + lattice-based sMBR + LSTM LM + Transformer LM rescoring (no data augmentation) | rwth-i6/returnn rwth-i6/returnn-experiments |
2.3% | 5.2% | End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures | November 2019 | Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring | facebookresearch/wav2letter |
2.2% | 5.8% | State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions | October 2019 | Multi-stream self-attention in hybrid ASR + 4-gram LM + Neural LM rescore (no data augmentation) | s-omranpour/ ConvolutionalSpeechRecognition (not official) |
2.5% | 5.8% | SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | April 2019 | Listen Attend Spell | DemisEom/SpecAugment (not official) |
3.2% | 7.6% | From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition | October 2019 | LC-BLSTM AM (chenones) + 4-gram LM (data augmentation:Speed perturbation and SpecAugment) | None |
3.19% | 7.64% | The CAPIO 2017 Conversational Speech Recognition System | April 2018 | TDNN + TDNN-LSTM + CNN-bLSTM + Dense TDNN-LSTM across two kinds of trees + N-gram LM + Neural LM rescore | None |
2.44% | 8.29% | Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System | September 2019, Interspeech | encoder-attention-decoder + Transformer LM | None |
3.80% | 8.76% | Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks | Interspeech, Sept 2018 | 17-layer TDNN-F + iVectors | kaldi-asr/kaldi |
2.8% | 9.3% | RWTH ASR Systems for LibriSpeech: Hybrid vs Attention | September 2019, Interspeech | encoder-attention-decoder + BPE + Transformer LM (no data augmentation) | rwth-i6/returnn rwth-i6/returnn-experiments |
3.26% | 10.47% | Fully Convolutional Speech Recognition | December 2018 | End-to-end CNN on the waveform + conv LM | None |
3.82% | 12.76% | Improved training of end-to-end attention models for speech recognition | Interspeech, Sept 2018 | encoder-attention-decoder end-to-end model | rwth-i6/returnn rwth-i6/returnn-experiments |
4.28% | Purely sequence-trained neural networks for ASR based on lattice-free MMI | September 2016 | HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations | kaldi-asr/kaldi | |
4.83% | A time delay neural network architecture for efficient modeling of long temporal contexts | 2015 | HMM-TDNN + iVectors | kaldi-asr/kaldi | |
5.15% | 12.73% | Deep Speech 2: End-to-End Speech Recognition in English and Mandarin | December 2015 | 9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters trained on 11940h | PaddlePaddle/DeepSpeech |
5.51% | 13.97% | LibriSpeech: an ASR Corpus Based on Public Domain Audio Books | 2015 | HMM-DNN + pNorm* | kaldi-asr/kaldi |
4.8% | 14.5% | Letter-Based Speech Recognition with Gated ConvNets | December 2017 | (Gated) ConvNet for AM going to letters + 4-gram LM | None |
8.01% | 22.49% | same, Kaldi | 2015 | HMM-(SAT)GMM | kaldi-asr/kaldi |
12.51% | Audio Augmentation for Speech Recognition | 2015 | TDNN + pNorm + speed up/down speech | kaldi-asr/kaldi |
(Possibly trained on more data than WSJ.)
WER eval'92 | WER eval'93 | Paper | Published | Notes | Codes |
---|---|---|---|---|---|
5.03% | 8.08% | Humans Deep Speech 2: End-to-End Speech Recognition in English and Mandarin | December 2015 | Humans | None |
2.9% | None | End-to-end Speech Recognition Using Lattice-Free MMI | September 2018 | HMM-DNN LF-MMI trained (biphone) | None |
3.10% | None | Deep Speech 2: End-to-End Speech Recognition in English and Mandarin | December 2015 | 9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters | PaddlePaddle/DeepSpeech |
3.47% | None | Deep Recurrent Neural Networks for Acoustic Modelling | April 2015 | TC-DNN-BLSTM-DNN | None |
3.5% | 6.8% | Fully Convolutional Speech Recognition | December 2018 | End-to-end CNN on the waveform + conv LM | None |
3.63% | 5.66% | LibriSpeech: an ASR Corpus Based on Public Domain Audio Books | 2015 | test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm* | kaldi-asr/kaldi |
4.1% | None | End-to-end Speech Recognition Using Lattice-Free MMI | September 2018 | HMM-DNN E2E LF-MMI trained (word n-gram) | None |
5.6% | None | Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal | 2014 | CNN over RAW speech (wav) | None |
5.7% | 8.7% | End-to-end Speech Recognition from the Raw Waveform | June 2018 | End-to-end CNN on the waveform | None |
(So far, all results trained on TIMIT and tested on the core test set.)
(Possibly trained on more data than SWB, but test set = full Hub5'00.)
WER (SWB) | WER (CH) | Paper | Published | Notes | Codes |
---|---|---|---|---|---|
5.0% | 9.1% | The CAPIO 2017 Conversational Speech Recognition System | December 2017 | 2 Dense LSTMs + 3 CNN-bLSTMs across 3 phonesets from previous Capio paper & AM adaptation using parameter averaging (5.6% SWB / 10.5% CH single systems) | None |
5.1% | 9.9% | Language Modeling with Highway LSTM | September 2017 | HW-LSTM LM trained with Switchboard+Fisher+Gigaword+Broadcast News+Conversations, AM from previous IBM paper | None |
5.1% | None | The Microsoft 2017 Conversational Speech Recognition System | August 2017 | ~2016 system + character-based dialog session aware (turns of speech) LSTM LM | None |
5.3% | 10.1% | Deep Learning-based Telephony Speech Recognition in the Wild | August 2017 | Ensemble of 3 CNN-bLSTM (5.7% SWB / 11.3% CH single systems) | None |
5.5% | 10.3% | English Conversational Telephone Speech Recognition by Humans and Machines | March 2017 | ResNet + BiLSTMs acoustic model, with 40d FMLLR + i-Vector inputs, trained on SWB+Fisher+CH, n-gram + model-M + LSTM + Strided (à trous) convs-based LM trained on Switchboard+Fisher+Gigaword+Broadcast | None |
6.3% | 11.9% | The Microsoft 2016 Conversational Speech Recognition System | September 2016 | VGG/Resnet/LACE/BiLSTM acoustic model trained on SWB+Fisher+CH, N-gram + RNNLM language model trained on Switchboard+Fisher+Gigaword+Broadcast | None |
6.6% | 12.2% | The IBM 2016 English Conversational Telephone Speech Recognition System | June 2016 | RNN + VGG + LSTM acoustic model trained on SWB+Fisher+CH, N-gram + "model M" + NNLM language model | None |
6.8% | 14.1% | SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | April 2019 | Listen Attend Spell | DemisEom/SpecAugment (not official) |
8.5% | 13% | Purely sequence-trained neural networks for ASR based on lattice-free MMI | September 2016 | HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher | kaldi-asr/kaldi |
9.2% | 13.3% | Purely sequence-trained neural networks for ASR based on lattice-free MMI | September 2016 | HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher (10% / 15.1% respectively trained on SWBD only) | kaldi-asr/kaldi |
12.6% | 16% | Deep Speech: Scaling up end-to-end speech recognition | December 2014 | CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trained only on SWB | mozilla/DeepSpeech (not official) |
11% | 17.1% | A time delay neural network architecture for efficient modeling of long temporal contexts | 2015 | HMM-TDNN + iVectors | kaldi-asr/kaldi |
12.6% | 18.4% | Sequence-discriminative training of deep neural networks | 2013 | HMM-DNN +sMBR | kaldi-asr/kaldi |
12.9% | 19.3% | Audio Augmentation for Speech Recognition | 2015 | HMM-TDNN + pNorm + speed up/down speech | kaldi-asr/kaldi |
15% | 19.1% | Building DNN Acoustic Models for Large Vocabulary Speech Recognition | June 2014 | DNN + Dropout | None |
10.4% | None | Joint Training of Convolutional and Non-Convolutional Neural Networks | 2014 | CNN on MFSC/fbanks + 1 non-conv layer for FMLLR/I-Vectors concatenated in a DNN | None |
11.5% | None | Deep Convolutional Neural Networks for LVCSR | 2013 | CNN | None |
12.2% | None | Very Deep Multilingual Convolutional Neural Networks for LVCSR | September 2015 | Deep CNN (10 conv, 4 FC layers), multi-scale feature maps | None |
11.8% | 25.7% | Improved training of end-to-end attention models for speech recognition | Interspeech, Sept 2018 | encoder-attention-decoder end-to-end model, trained on 300h SWB | rwth-i6/returnn rwth-i6/returnn-experiments |
- WER: word error rate
- PER: phone error rate
- LM: language model
- HMM: hidden markov model
- GMM: Gaussian mixture model
- DNN: deep neural network
- CNN: convolutional neural network
- DBN: deep belief network (RBM-based DNN)
- TDNN-F: a factored form of time delay neural networks (TDNN)
- RNN: recurrent neural network
- LSTM: long short-term memory
- CTC: connectionist temporal classification
- MMI: maximum mutual information (MMI),
- MPE: minimum phone error
- sMBR: state-level minimum Bayes risk
- SAT: speaker adaptive training
- MLLR: maximum likelihood linear regression
- LDA: (in this context) linear discriminant analysis
- MFCC: Mel frequency cepstral coefficients
- FB/FBANKS/MFSC: Mel frequency spectral coefficients
- IFCC: Instantaneous frequency cosine coefficients (https://github.com/siplabiith/IFCC-Feature-Extraction)
- VGG: very deep convolutional neural networks from Visual Graphics Group, VGG is an architecture of 2 {3x3 convolutions} followed by 1 pooling, repeated