Description
🐛 Description
I am trying to run 🐸TTS on new languages and came across a bug - or at least something that I think could be improved in the documentation for new languages, unless I missed something, in which case please point it out to me!
The language I am working with has digraphs in its character set - that is, ɡʷ
is a separate character from ɡ
. In TTS.tts.utils.text.text_to_sequence
, the raw text is transformed to a sequence of character indices, but the function that turns the cleaned text into the sequence (TTS.tts.utils.text._symbols_to_sequence
) just iterates through the string, so when processing ɡʷ
, the index for ɡ
is returned and ʷ
is discarded by _should_keep_symbol
. It appears there is a way to handle Arpabet digraphs using curly braces. Another way of handling this could be to reverse sort the list of characters according to length, then tokenize the raw text according to that sorted list of characters and pass the tokenized list to TTS.tts.utils.text._symbols_to_sequence
. This is what I'm currently doing in a custom cleaner, but is this the "correct" or intended way of handling this? I would love if the tensorboard log tracked text as well, for example by logging a comparison of the raw text with the text reconstructed from the sequence as shown in the unittest below. That would have saved me training a few models and then only figuring out the bug by listening to the audio.
To Reproduce
from TTS.tts.utils.text import text_to_sequence, sequence_to_text
from TTS.tts.utils.text.symbols import make_symbols
class TestMohawkCharacterInputs(TestCase):
def setUp(self) -> None:
self.mohawk_characters = {
"pad": "_",
"eos": "~",
"bos": "^",
"characters": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],key=len,reverse=True),
"punctuations": "!(),-.;? ",
"phonemes": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],key=len,reverse=True),
"unique": True
}
self.custom_symbols = make_symbols(**self.mohawk_characters)[0]
self.mohawk_test_text = ["ɡʷah"]
self.cleaners = ["basic_cleaners"]
def test_text_parity(self):
for utt in self.mohawk_test_text:
seq = text_to_sequence(utt, cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
text = sequence_to_text(seq, tp=self.mohawk_characters, add_blank=False, custom_symbols=self.custom_symbols)
self.assertEqual(text, utt)
Expected behavior
With the above unittest, text_to_sequence("ɡʷah", cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
returns [45, 26, 30]
when it should return [24, 26, 30]
. It would be nice if digraphs were handled by default, but barring that, it should be documented somewhere (here?) how they should be handled, and ideally tensorboard would perform some input text sanity check.
Environment
{
"CUDA": {
"GPU": [
"Tesla V100-SXM2-16GB"
],
"available": true,
"version": "10.2"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.10.1+cu102",
"TTS": "0.5.0",
"numpy": "1.21.2"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.8.12",
"version": "#151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021"
}
}
Additional context
Many thanks to the authors for this excellent project!