Skip to content

[Bug] Document proper handling of digraphs #1072

Closed
@roedoejet

Description

@roedoejet

🐛 Description

I am trying to run 🐸TTS on new languages and came across a bug - or at least something that I think could be improved in the documentation for new languages, unless I missed something, in which case please point it out to me!

The language I am working with has digraphs in its character set - that is, ɡʷ is a separate character from ɡ. In TTS.tts.utils.text.text_to_sequence, the raw text is transformed to a sequence of character indices, but the function that turns the cleaned text into the sequence (TTS.tts.utils.text._symbols_to_sequence) just iterates through the string, so when processing ɡʷ, the index for ɡ is returned and ʷ is discarded by _should_keep_symbol. It appears there is a way to handle Arpabet digraphs using curly braces. Another way of handling this could be to reverse sort the list of characters according to length, then tokenize the raw text according to that sorted list of characters and pass the tokenized list to TTS.tts.utils.text._symbols_to_sequence. This is what I'm currently doing in a custom cleaner, but is this the "correct" or intended way of handling this? I would love if the tensorboard log tracked text as well, for example by logging a comparison of the raw text with the text reconstructed from the sequence as shown in the unittest below. That would have saved me training a few models and then only figuring out the bug by listening to the audio.

To Reproduce

from TTS.tts.utils.text import text_to_sequence, sequence_to_text
from TTS.tts.utils.text.symbols import make_symbols 

class TestMohawkCharacterInputs(TestCase):
    def setUp(self) -> None:
        self.mohawk_characters = {
        "pad": "_",
        "eos": "~",
        "bos": "^",
        "characters": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],key=len,reverse=True),
        "punctuations": "!(),-.;? ",
        "phonemes": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],key=len,reverse=True),
        "unique": True
        }
        self.custom_symbols = make_symbols(**self.mohawk_characters)[0]
        self.mohawk_test_text = ["ɡʷah"]
        self.cleaners = ["basic_cleaners"]
        
    def test_text_parity(self):
        for utt in self.mohawk_test_text:
            seq = text_to_sequence(utt, cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
            text = sequence_to_text(seq, tp=self.mohawk_characters, add_blank=False, custom_symbols=self.custom_symbols)
            self.assertEqual(text, utt)

Expected behavior

With the above unittest, text_to_sequence("ɡʷah", cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False) returns [45, 26, 30] when it should return [24, 26, 30]. It would be nice if digraphs were handled by default, but barring that, it should be documented somewhere (here?) how they should be handled, and ideally tensorboard would perform some input text sanity check.

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.1+cu102",
        "TTS": "0.5.0",
        "numpy": "1.21.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021"
    }
}

Additional context

Many thanks to the authors for this excellent project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingwontfixThis will not be worked on but feel free to help.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions