Languini is trained on thousands of images of spectrograms of real-world speakers. It uses siamese twins with triplets loss function to train a model to recognise similar sounding sounds, by comparing their spectrograms.
The basic idea is very simple:
- Voice is recorded as .WAV and then converted to an analogue signal
- The analogue signal is converted to a Mel Spectrogram using a fast fourier transform (FFT) (input, for the model)
- Spectrogram of the recording is compared with the spectrogram of the native speaker (anchor, for the model)
- The model embedes both images (as TensorFlow vectors)and compares them through cosine similarity
- The resulting score is weighted against previous attemps to show progress
Create a directory of files with audiorecordings of sounds in the target language. You can generate a model using the languiniai/train_model.py
You can use our train model and languiniai/compare.py to compare two different melspectrograms.
Within the notebooks branch there is a directory with jupyter notebooks that walk through it all.
.png)