[Bug] Document proper handling of digraphs

## 🐛 Description

I am trying to run 🐸TTS on new languages and came across a bug - or at least something that I think could be improved in the documentation for new languages, unless I missed something, in which case please point it out to me! 

The language I am working with has digraphs in its character set - that is, `ɡʷ` is a separate character from `ɡ`.  In `TTS.tts.utils.text.text_to_sequence`, the raw text is transformed to a sequence of character indices, but the function that turns the cleaned text into the sequence (`TTS.tts.utils.text._symbols_to_sequence`) just iterates through the string, so when processing `ɡʷ`, the index for `ɡ` is returned and `ʷ` is discarded by `_should_keep_symbol`. It appears there is a way to handle Arpabet digraphs using curly braces. Another way of handling this could be to reverse sort the list of characters according to length, then tokenize the raw text according to that sorted list of characters and pass the tokenized list to `TTS.tts.utils.text._symbols_to_sequence`. This is what I'm currently doing in a custom cleaner, but is this the "correct" or intended way of handling this? I would love if the tensorboard log tracked text as well, for example by logging a comparison of the raw text with the text reconstructed from the sequence as shown in the unittest below. That would have saved me training a few models and then only figuring out the bug by listening to the audio.

### To Reproduce

```from unittest import TestCase
from TTS.tts.utils.text import text_to_sequence, sequence_to_text
from TTS.tts.utils.text.symbols import make_symbols 

class TestMohawkCharacterInputs(TestCase):
    def setUp(self) -> None:
        self.mohawk_characters = {
        "pad": "_",
        "eos": "~",
        "bos": "^",
        "characters": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],key=len,reverse=True),
        "punctuations": "!(),-.;? ",
        "phonemes": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],key=len,reverse=True),
        "unique": True
        }
        self.custom_symbols = make_symbols(**self.mohawk_characters)[0]
        self.mohawk_test_text = ["ɡʷah"]
        self.cleaners = ["basic_cleaners"]
        
    def test_text_parity(self):
        for utt in self.mohawk_test_text:
            seq = text_to_sequence(utt, cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
            text = sequence_to_text(seq, tp=self.mohawk_characters, add_blank=False, custom_symbols=self.custom_symbols)
            self.assertEqual(text, utt)
```


### Expected behavior

With the above unittest, `text_to_sequence("ɡʷah", cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)` returns `[45, 26, 30]` when it should return `[24, 26, 30]`. It would be nice if digraphs were handled by default, but barring that, it should be documented somewhere ([here?](https://tts.readthedocs.io/en/latest/faq.html?highlight=language#how-can-i-train-in-a-different-language)) how they should be handled, and ideally tensorboard would perform some input text sanity check.

### Environment



```
{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.1+cu102",
        "TTS": "0.5.0",
        "numpy": "1.21.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021"
    }
}
```
### Additional context

Many thanks to the authors for this excellent project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Document proper handling of digraphs #1072

🐛 Description

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Document proper handling of digraphs #1072

Description

🐛 Description

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions