-
Notifications
You must be signed in to change notification settings - Fork 1
Contributor Text to Speech
The blend shapes used by the MaryTTSController are generated during the YALLAH setup, in Blender, and all of them start with the phoneme_
prefix.
The Blender code generating the phonemes is in BlenderScripts/addons/yallah/features/MaryTTS/Setup.py
.
At run-time, in Unity, the text-to-speech functionality uses a remote MaryTTS server.
You must run an instance of MaryTTS locally on your machine or on a server of your choice. The Mary TTS sever is used through an HTTP interface. You can test if the MaryTTS server works by opening the web interface with your browser, e.g.: http://mary.dfki.de:59125/.
The text-to-speech is implemented using Haxe code and a C# adapter.
The Haxe code for the core logic is in SharedHaxeClasses/MaryTTSBlendSequencer.hx
, which Haxe transpiles into YALLAH/Scripts/haxe_lib/MaryTTSBlendSequencer.cs
. We will call it the Sequencer.
The C# Adapter is in the Unity project YALLAH/Scripts/tts/MaryTTSController.cs
. The adapter takes a text as input and sends two requests to the MaryTTS server, to get:
- The
realised_durations
table; - The WAV audio file.
The Adapter will:
- start playing the audio using the Unity API;
- ask the tts Sequencer to parse the
realised_duration
table in order to compute the weights of the blend shapes of the mouth.
The Sequencer loads some information from MaryTTS-Info-MBLab1_6.json
. The adapter passes this filename to the constructor of the Sequencer.
The advantage of having this information in a separate file, makes it possible to update the mappings (maybe needed to support characters with different visemes) without the need to recompile Haxe code.
The MaryTTS-Info
is a JSON file with three top-level entries:
-
visemes
is the list of BlendShape driving the movement of the mouth that the character face must have. -
map
is a dictionary mapping a phoneme (possibly coming from the MaryTTS output) to a viseme (shape of the mouth). All the visemes used here as target must be present in the above visemes list. A viseme can also map tonull
, meaning that the mouth will not move. Useful for the explicit pauses, as given by the phoneme_
(underscore). -
default_viseme
is the name of the BlendShape to use when a phoneme is not explicitely listed in the dictionary.
Extract:
{
"_comments": [
"A dirty trick",
"to insert mul-line comments..."
],
"default_viseme": "phoneme_c_01",
"visemes": [
"phoneme_a_01",
"phoneme_a_02",
"and more..."
],
"map": {
"fw": "phoneme_f_01",
"ge": "phoneme_g_01",
"and": "more...",
}
}