Skip to content

Contributor Text to Speech

alexisheloir edited this page Apr 23, 2019 · 1 revision

Text-to-speech using Mary TTS

The blend shapes used by the MaryTTSController are generated during the YALLAH setup, in Blender, and all of them start with the phoneme_ prefix. The Blender code generating the phonemes is in BlenderScripts/addons/yallah/features/MaryTTS/Setup.py.

At run-time, in Unity, the text-to-speech functionality uses a remote MaryTTS server.

You must run an instance of MaryTTS locally on your machine or on a server of your choice. The Mary TTS sever is used through an HTTP interface. You can test if the MaryTTS server works by opening the web interface with your browser, e.g.: http://mary.dfki.de:59125/.

The text-to-speech is implemented using Haxe code and a C# adapter.

The Haxe code for the core logic is in SharedHaxeClasses/MaryTTSBlendSequencer.hx, which Haxe transpiles into YALLAH/Scripts/haxe_lib/MaryTTSBlendSequencer.cs. We will call it the Sequencer.

The C# Adapter is in the Unity project YALLAH/Scripts/tts/MaryTTSController.cs. The adapter takes a text as input and sends two requests to the MaryTTS server, to get:

  • The realised_durations table;
  • The WAV audio file.

The Adapter will:

  • start playing the audio using the Unity API;
  • ask the tts Sequencer to parse the realised_duration table in order to compute the weights of the blend shapes of the mouth.

The Sequencer loads some information from MaryTTS-Info-MBLab1_6.json. The adapter passes this filename to the constructor of the Sequencer. The advantage of having this information in a separate file, makes it possible to update the mappings (maybe needed to support characters with different visemes) without the need to recompile Haxe code.

The MaryTTS-Info is a JSON file with three top-level entries:

  • visemes is the list of BlendShape driving the movement of the mouth that the character face must have.
  • map is a dictionary mapping a phoneme (possibly coming from the MaryTTS output) to a viseme (shape of the mouth). All the visemes used here as target must be present in the above visemes list. A viseme can also map to null, meaning that the mouth will not move. Useful for the explicit pauses, as given by the phoneme _ (underscore).
  • default_viseme is the name of the BlendShape to use when a phoneme is not explicitely listed in the dictionary.

Extract:

{
    "_comments": [
        "A dirty trick",
        "to insert mul-line comments..."
    ],
    "default_viseme": "phoneme_c_01",
    "visemes": [
        "phoneme_a_01",
        "phoneme_a_02",
        "and more..."
    ],
    "map": {
        "fw": "phoneme_f_01",
        "ge": "phoneme_g_01",
        "and": "more...",
    }
}