Automatic Speech Recognition (ASR) enables the recognition and translation of spoken language into text. Typically the ASR Model is trained and used for a specific language. However, Indonesia has more than 700 spoken languages. It is not practicable to provide a speech recognition model for each language.
Therefore, we want to develop a multilingual speech recognition model that can at least support some of the main Indonesian languages without sacrificing model performance for each language.
We want to develop and build a multilingual speech recognition model with the Indonesian, Javanese, and Sundanese datasets. The model should perform well in all these three languages. We also train monolingual models for comparison purposes.
We used the following speech datasets for the training/finetuning:
We used Wav2vec 2.0, a framework for self-supervised learning of speech representations which is now state of the art on the Librispeech benchmark for noisy speech, for Indonesia, Javanese and Sundanese language.
We trained a multilingual Wav2vec 2.0 model with the three languages combined for 200 epochs. We also trained three Wav2vec 2.0 models with a single language for Indonesian, Java, and Sundanese, each for 200 epochs.
We built a multilingual Speech Recognition model and publish it as open source model. We also provide a live demo to test the model.
Following is the comparison of the models and the list of its performance evaluation:
The following figure is the model comparison by Word Error Rate (WER) for the Test split of Indonesian Common Voice 6.1 (less is better)
Lastly, we integrated a language model into our speech recognition pipeline, which reduces the WER from 11.57% to 4.27% on the Test split of Indonesian Common Voice 6.1. We also evaluated the performance of Google Speech To Text, its WER for the Test split of Indonesian Common Voice 6.1 is 9.22%.
The performance evaluation can be found here
- The experiment shows that the multilingual model can perform on par with a model trained on a single language; the Word Error Rate (WER) difference is maximal 0.6 absolute percent. We also trained the multilingual model with more epochs, and it outperforms the monolingual model.
- The monolingual model performs very well in the language we trained for but poorly in other languages.
- The multilingual speech recognition model overcomes the need to have a separate model for each language in Indonesia. Therefore, it significantly reduces hardware resources and simplifies the model deployment.
We plan following for the future:
- Training the model with more data and more Indonesian languages.
- Integrating Language Model to reduce the WER
- Compressing the model size for speeding up the inferencing time and reducing hardware resources
- Developing real-time speech recognition based on this multilingual model.

