This is a rewrite of Automatic Speech Recognition from NVidia's NeMo framework.
The goal of this project is to have a fast-loading, low-dependency speech recognition software. This program only uses ggml-org/ggml library to work with neural networks.
The program expects raw single-channel s16le 16kHz samples as an input. Standard input denoted as -.
# read from standard input
ffmpeg -hide_banner -loglevel error -i your-file.mp3 -ar 16000 -ac 1 -f s16le - \
| ./nemotron-asr.cpp weights/nemotron-speech-streaming-0.6B-v0.1.Q8_0.gguf - 70 13
# read from a file
ffmpeg -hide_banner -loglevel error -i your-file.mp3 -ar 16000 -ac 1 -f s16le raw-audio.pcm
./nemotron-asr.cpp weights/nemotron-speech-streaming-0.6B-v0.1.Q8_0.gguf raw-audio.pcm 70 13Full and quantized versions of the models can be downloaded from Hugging Face Hub: https://huggingface.co/m1el/nemotron-speech-streaming-0.6B-gguf
Notice: the weights are Licensed by NVIDIA Corporation under the NVIDIA Open Model License
Or converted from https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b using convert_to_gguf.py
The original tensors in the nvidia/nemotron-speech-streaming-en-0.6b require transposition to be used in matrix multiplication in ggml.
Additionally, those changes also help with quantization, which has requirements on tensor shape. For details, see TENSOR_SHAPES.md
To develop this project you need to clone ggml-org/ggml to this directory, then build ggml with all your favorite settings. TODO: Maybe make this process more friendly? PRs welcome.
git clone https://github.com/ggml-org/ggml.git
mkdir ggml/build
cd ggml/build
cmake ..
make -j8
cd ../..Once you have built ggml, you can build this binary
make nemotron-asr.cppComparison to whisper.cpp?
- Whisper operates on 30s audio chunks, so it is not feasible to use in interactive applications. NeMotron's ASR model has configurable latency (from 80ms to 1.12s), which can traded for quality and speed. (bigger lookahead gives better quality and bigger chunks require less work)
- Whisper has troubles on long audio streams, where NeMotron's ASR is a streaming model, works for infinite streams. It does get stuck in lowercase mode though.
- Quiality seems to be better
MIT