This codebase is a work in progress. There are known and unknown bugs in the implementation, and has not been optimized in any way.
MLPerf has neither finalized on a decision to add a speech recognition benchmark, nor as this implementationn/architecture as a reference implementation.
Speech recognition accepts raw audio samples and produces a corresponding text transcription.
See https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/README.md. This implementation shares significant code with that repository.
"OpenSLR LibriSpeech Corpus" provides over 1000 hours of speech data in the form of raw audio.
What preprocessing is done to the the dataset?
How is the test set extracted?
In what order is the training data traversed?
In what order is the test data traversed?
Describe simulation environment briefly, if applicable.
Cite paper describing model plus any additional attribution requested by code authors
Brief summary of structure of model
How are weights and biases initialized
Transducer Loss
TBD, currently Adam
Word Error Rate (WER) across all words in the output text of all samples in the validation set.
What is the numeric quality target
TBD
TBD