Indonesian speech/phoneme recognizer powered by Kaldi 2.0 (lhotse, icefall, sherpa). Trained on open source speech data. Deployable on Desktop (via Python/C++), web apps, iOS, and Android.
All models released here are trained on icefall (which runs on PyTorch) and are converted for deployment via sherpa-ncnn. Icefall is Kaldi 2.0 / Next-Gen Kaldi, and unifies the application of k2 for finite state automata (FSA) and lhotse (audio data-loading).
Through this repository, we aim to document and release our open source models for the public's use.
As of the time of writing, we use the following datasets to train our models:
Noticeably, these datasets only contain text annotations and do not contain phoneme annotations. We used g2p ID to phonemize those text annotations.
Moreover, LibriVox Indonesia's original annotation is written with old Indonesian Republican Spelling System (Edjaan Repoeblik). We pre-converted them into EYD (Ejaan yang Disempurnakan) via Doeloe, before phonemizing them.
Model Format | Link |
---|---|
Icefall | Pruned Stateless Zipformer RNN-T Streaming ID |
Sherpa NCNN | Sherpa-ncnn Pruned Stateless Zipformer RNN-T Streaming ID |
Sherpa ONNX | TBA |
Results (PER)
Decoding | LibriVox | FLEURS | Common Voice |
---|---|---|---|
Greedy Search | 4.87% | 11.45% | 14.97% |
Modified Beam Search | 4.71% | 11.25% | 14.31% |
Fast Beam Search | 4.85% | 12.55% | 14.89% |
There are various ways to export and deploy these models for production. Sherpa (Kaldi 2.0's main deployment framework) also has various counterparts for running on NCNN and/or ONNX engines. Or, you can also directly use these models via icefall, but they require a working PyTorch installation and is unoptimized for production.
We will provide a few external links to Sherpa's thorough documentation which you can follow. We will also provide usage examples for Recognize a file and Real-time recognition with a microphone in Python.
Inference Framework | Platform | Language | Link |
---|---|---|---|
Sherpa | Desktop | C++ | Guide |
Sherpa NCNN | Desktop | Python | Guide |
Sherpa NCNN | Android | Kotlin | Guide |
Sherpa NCNN | iOS | Swift | Guide |
Sherpa ONNX | Desktop | Python | Guide |
Sherpa ONNX | Android | Kotlin | Guide |
Sherpa ONNX | iOS | Swift | Guide |
The following code is adapted from this example. View this example running in our live demo!
import wave
import numpy as np
import sherpa_ncnn
path = "./sherpa-ncnn-pruned-transducer-stateless7-streaming-id"
def main():
recognizer = sherpa_ncnn.Recognizer(
tokens=f"{path}/tokens.txt",
encoder_param=f"{path}/encoder_jit_trace-pnnx.ncnn.param",
encoder_bin=f"{path}/encoder_jit_trace-pnnx.ncnn.bin",
decoder_param=f"{path}/decoder_jit_trace-pnnx.ncnn.param",
decoder_bin=f"{path}/decoder_jit_trace-pnnx.ncnn.bin",
joiner_param=f"{path}/joiner_jit_trace-pnnx.ncnn.param",
joiner_bin=f"{path}/joiner_jit_trace-pnnx.ncnn.bin",
num_threads=4,
)
filename = ("path/to/your/audio.wav")
with wave.open(filename) as f:
assert f.getframerate() == recognizer.sample_rate, (
f.getframerate(),
recognizer.sample_rate,
)
assert f.getnchannels() == 1, f.getnchannels()
assert f.getsampwidth() == 2, f.getsampwidth() # it is in bytes
num_samples = f.getnframes()
samples = f.readframes(num_samples)
samples_int16 = np.frombuffer(samples, dtype=np.int16)
samples_float32 = samples_int16.astype(np.float32)
samples_float32 = samples_float32 / 32768
recognizer.accept_waveform(recognizer.sample_rate, samples_float32)
tail_paddings = np.zeros(int(recognizer.sample_rate * 0.5), dtype=np.float32)
recognizer.accept_waveform(recognizer.sample_rate, tail_paddings)
recognizer.input_finished()
print(recognizer.text)
The following code is adapted from this example. View this example running in our live demo!
import sys
import sounddevice as sd
import sherpa_ncnn
path = "./sherpa-ncnn-pruned-transducer-stateless7-streaming-id"
def create_recognizer():
recognizer = sherpa_ncnn.Recognizer(
tokens=f"{path}/tokens.txt",
encoder_param=f"{path}/encoder_jit_trace-pnnx.ncnn.param",
encoder_bin=f"{path}/encoder_jit_trace-pnnx.ncnn.bin",
decoder_param=f"{path}/decoder_jit_trace-pnnx.ncnn.param",
decoder_bin=f"{path}/decoder_jit_trace-pnnx.ncnn.bin",
joiner_param=f"{path}/joiner_jit_trace-pnnx.ncnn.param",
joiner_bin=f"{path}/joiner_jit_trace-pnnx.ncnn.bin",
num_threads=4,
)
return recognizer
def main():
print("Started! Please speak")
recognizer = create_recognizer()
sample_rate = recognizer.sample_rate
samples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 ms
last_result = ""
with sd.InputStream(channels=1, dtype="float32", samplerate=sample_rate) as s:
while True:
samples, _ = s.read(samples_per_read) # a blocking read
samples = samples.reshape(-1)
recognizer.accept_waveform(sample_rate, samples)
result = recognizer.text
if last_result != result:
last_result = result
print(result)
if __name__ == "__main__":
devices = sd.query_devices()
default_input_device_idx = sd.default.device[0]
print(f'Use default device: {devices[default_input_device_idx]["name"]}')
main()
Our models and inference code are released with Apache-2.0 license. Common Voice and LibriVox Indonesia are released under Public Domain, CC-0. FLEURS is licensed under the Creative Commons license (CC-BY).
@inproceedings{commonvoice:2020,
author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
title = {Common Voice: A Massively-Multilingual Speech Corpus},
booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
pages = {4211--4215},
year = 2020
}
@article{fleurs2022arxiv,
title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
journal={arXiv preprint arXiv:2205.12446},
url = {https://arxiv.org/abs/2205.12446},
year = {2022},
}