add VITS model #24085

hollance · 2023-06-07T15:56:06Z

What does this PR do?

Adds the VITS model for text-to-speech, in particular to support the MMS-TTS checkpoints (which use the same model architecture but a different tokenizer).

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

hollance · 2023-06-13T14:43:38Z

Notes about the tokenizer:

This is not the VITS tokenizer but the one for MMS-TTS.
The vocab doesn't have padding (or unknown) tokens in it, but uses token_id 0 for this. That breaks on the HF tokenizers because it will split the input text on the padding token, so if I set pad_token_id = 0 then the letters that token_id 0 corresponds to will disappear from the text.
To fix this issue, I'm adding <pad> and <unk> to the vocab, but then in the model we set such token_ids to 0 before feeding the input into the first layer. It's a bit hacky. Ideas for a nicer solution are appreciated.
The tokenizer also inserts an additional token_id 0 in between every token. No idea why but that's how it works.

HuggingFaceDocBuilderDev · 2023-06-13T15:02:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

hollance · 2023-06-22T15:00:20Z

This is ready for a first review yet.

Two checkpoints are currently available:

Small usage example:

from transformers import VitsMmsTokenizer, VitsModel
import torch

tokenizer = VitsMmsTokenizer.from_pretrained("Matthijs/mms-tts-eng")
model = VitsModel.from_pretrained("Matthijs/mms-tts-eng")
    
inputs = tokenizer(text="Hello, my dog is cute", return_tensors="pt")

outputs = model(inputs["input_ids"])
speech = outputs.audio

The current model is the MMS-TTS version, not the original VITS version. The conversion scripts can handle both, but for original VITS support the tokenizer is still missing.

Still needs to be done:

tests
tokenizer for actual VITS

@Vaibhavs10 For this review, in particular could you verify the names of the layers in the flow layers etc make sense? Thanks!

Vaibhavs10

Just reviewed the flow layer names for the coupling block and the duration predictor. Perfect!I Fantastic from a code PoV. Mega job @hollance 🤗

sanchit-gandhi

Looks really great already @hollance - nice work on simplifying such a new model for the lib. The high-level API works really nicely and the code by-and-large reads well, mostly just nits on my part

docs/source/en/model_doc/vits.md

src/transformers/models/vits/configuration_vits.py

src/transformers/models/vits/modeling_vits.py

sanchit-gandhi · 2023-06-25T17:07:21Z

src/transformers/models/vits/tokenization_vits.py

Hey @ArthurZucker - wonder if you know a more elegant workaround for this tokenization issue?

I think that the vocab should be updated in the convert_original_checkpoint, and just add the special tokens here.
I am not sure I understand the problem, they pad with 0 but 0 maps to "k". So they pad with "k"?

Yes, they pad with token_id 0. The problem is that the tokenizer splits the input on the padding token and tokenizes these subsequences separately, so anywhere there is a k in the text it will disappear.

For example, It stinks here gets tokenized as It stin and s here because k is not treated as a regular token but as padding.

I'm not super familiar with the tokenization logic, so bear with me.

Am I right in understanding:

The model is trained with 0 as a pad token

The token_id 0 maps to "k" in the vocabulary

The tokenizer first splits on the pad token, and then tokenizes

Remapping the padding token to "<pad>" means we keep "k" in our input sequences

We filter any token_ids corresponding to "<pad>" in our model? How is this done - setting them back to 0? With a mask?

Now, what happens if I have the input sentence "I like the letter k"?. When k is mapped to token_id 0, as the model was trained with this as a pad token, will it interpret k as padding?

I think @amyeroberts raised very valid questions.

if they pad with k then k is the padding token non?

if <pad> and k both point to 0, we don't really have a way in transformers to make sure this can be done. But I don't see why it should be done since they pad with k!

if the problem is that our tokenize methods splits on the special tokens, including padding, but you don't want this to happen, then you can just overwrite tokenize. Given that this model does not seem to use bos nor eos you can probably ignore the lstrip and rstip logics, and just take what you have in _tokenize.

Now answering @ArthurZucker's points!

if they pad with k then k is the padding token non?

Explained above - they pad with the 0 token, which also corresponds to k in the vocab

if <pad> and k both point to 0, we don't really have a way in transformers to make sure this can be done

It is indeed the case that both <pad> and k point to 0 - hence the hackiness with the current tokenizer file to make this work, since there isn't a natural way of doing this in transformers

Probably then we have to keep a hacky workaround like this to make it work?

Wondering if you had any thoughts here @ArthurZucker @amyeroberts for the conflicting <pad>/k token both being mapped to 0? Explained a bit above why this works.

Can't we just use the padding token set to k?

~~Ah we can't otherwise k will be skipped when decoding without special and issue can arise. my bad I'll think about it, but a hack is most probably needed~~ EDIT: actually when you decode, k will always be skipped if you set a padding token that maps to 0. Which is why I don't really think we should look into this too much. Either we keep a very simple hack, where in the vocab we make k and <pad> match to 0, and just set the padding token to <pad> ( this will work without further hack, and let's not check if it is or not in the vocab, because if somone want to change it by adding a new token it will not be in the vocab) or we just set the padding token to k.

Gone with this proposed 'hack' - we force the pad token at word boundaries, and then encode any k's to their appropriate id's. Note also that we are unlikely to ever need to decode input_ids with a TTS model - we put the input_ids in to the model, and get speech out.

tests/models/vits/test_modeling_vits.py

ArthurZucker

Just a small review since @sanchit-gandhi left lots of comments. Feel free to ping me again!
(For next review let's get rid of single letter variables)

ArthurZucker · 2023-06-26T04:12:23Z

src/transformers/models/vits/tokenization_vits.py

I think that the vocab should be updated in the convert_original_checkpoint, and just add the special tokens here.
I am not sure I understand the problem, they pad with 0 but 0 maps to "k". So they pad with "k"?

ArthurZucker · 2023-06-26T04:15:53Z

src/transformers/models/vits/modeling_vits.py

I think we can split this by defining intermediate variable like input_derivatives + input_derivatives_plus_one - 2 * input_delta which is re-used twice, same for (inputs - input_cumheights). Also let's remove single letter variables, make the math happening here more understandable! 😉

hollance · 2023-06-27T11:49:30Z

Some of the MMS-TTS checkpoints require the use of the tool uromanize from https://github.com/isi-nlp/uroman to convert the input script into the Latin alphabet. Since this is a separate Perl script, it is not included in Transformers and the user will have to run uromanize.pl themselves before using the tokenizer.

sgugger

I'm not too sure why I'm asked for a review here as all comments from @sanchit-gandhi are being ignored. The goal of integrating a model into Transformers is not to copy the original code without adapting it.

Please let's have all comments addressed before asking for the next round of reviews.

docs/source/en/model_doc/vits.md

src/transformers/models/vits/configuration_vits.py

src/transformers/models/vits/convert_all_checkpoints.py

src/transformers/models/vits/modeling_vits.py

hollance · 2023-06-27T14:35:48Z

I'm not too sure why I'm asked for a review here as all comments from @sanchit-gandhi are being ignored.

No they aren't?! I've integrated most of his suggestions and replied with counterarguments otherwise.

sanchit-gandhi

Just a few more variable names to go - commented regarding the design decisions from the last review directly inline on the comments above

src/transformers/models/vits/modeling_vits.py

sanchit-gandhi

LGTM @hollance - last question from my side is whether we've pinpointed the changes between the original tokenizer and MMS tokenizer? If they're subtle enough that we can use one tokenizer for both models, we can re-name the tokenizer from VitsMmsTokenizer to VitsTokenizer (as suggested by @sgugger) and include a flag to toggle between the two padding behaviours?

src/transformers/models/vits/modeling_vits.py

tests/models/vits/test_tokenization_vits.py

hollance · 2023-06-29T09:51:55Z

Tokenizer can now handle both the original VITS models (which require phonemization) and the MMS-TTS models.

src/transformers/models/vits/tokenization_vits.py

sanchit-gandhi · 2023-08-24T17:38:46Z

All points addressed so ready for a final review @amyeroberts. Thanks for your in-depth reviews here - the PR looks in pretty good shape!

amyeroberts

Amazing work - thanks for iterating on this and adding this model!

Only thing that needs to be addressed before merging are the slow tests to reflect tokenizer changes. Once that's done we're good to merge! 🥳

utils/documentation_tests.txt

tests/models/vits/test_tokenization_vits.py

sanchit-gandhi · 2023-08-31T14:40:04Z

Hey @amyeroberts - to have full compatibility with the text-to-audio pipeline class, we need to indicate the sampling_rate of the predicted audio waveforms in the model config:

transformers/src/transformers/pipelines/text_to_audio.py

Line 82 in 2be8a90

sampling_rate = config.get(sampling_rate_name, None)

The sampling_rate corresponds to the sampling rate of the target audio that the model was trained on. It is not possible to determine in any way other than from the value in the original config of the model. MMS TTS models use a sampling rate of 16kHz, VITS TTS models use a sampling rate of 22kHz, but otherwise their configs are the same. The user needs to have an idea of the sampling rate that the model generates in order to know what rate to playback the audio, otherwise this leaves them prone to silent errors. IMO adding it as an attribute of the main model class should suffice here:

transformers/src/transformers/models/vits/modeling_vits.py

Line 1374 in ff3b08c

self.sampling_rate = config.sampling_rate

Note that we cannot just add the sampling_rate to the config and not the modelling file, this is not allowed by the CI:
https://app.circleci.com/pipelines/github/huggingface/transformers/71815/workflows/c0[…]cb-a064-7c1019e03630/jobs/904373/parallel-runs/0/steps/0-116

cc @ylacombe

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

sanchit-gandhi · 2023-09-01T09:28:38Z

As discussed offline with @amyeroberts, we'll add it as an allowed attribute in the config checker: 8b01633

ydshieh · 2023-09-04T09:10:11Z

Hi @hollance Thank you for adding this model into transformers 🤗 .

There is a test failing

python3 -m pytest -v tests/models/vits/test_modeling_vits.py::VitsModelTest::test_initialization

I skip it on the main branch. Would you like to help us investigate this test if you have some bandwidth? Otherwise we can take this on our side too.

If you decide to take a look, you have to remove the following line

transformers/tests/models/vits/test_modeling_vits.py

Line 172 in ab8cba8

@unittest.skip("currently failing")

so the test will be collected and run by pytest.

Let me know :-) Thank you!

* add VITS model * let's vits * finish TextEncoder (mostly) * rename VITS to Vits * add StochasticDurationPredictor * ads flow model * add generator * correctly set vocab size * add tokenizer * remove processor & feature extractor * add PosteriorEncoder * add missing weights to SDP * also convert LJSpeech and VCTK checkpoints * add training stuff in forward * add placeholder tests for tokenizer * add placeholder tests for model * starting cleanup * let the great renaming begin! * use config * global_conditioning * more cleaning * renaming variables * more renaming * more renaming * it never ends * reticulating the splines * more renaming * HiFi-GAN * doc strings for main model * fixup * fix-copies * don't make it a PreTrainedModel * fixup * rename config options * remove training logic from forward pass * simplify relative position * use actual checkpoint * style * PR review fixes * more review changes * fixup * more unit tests * fixup * fix doc test * add integration test * improve tokenizer tests * add tokenizer integration test * fix tests on GPU (gave OOM) * conversion script can handle repos from hub * add conversion script for all MMS-TTS checkpoints * automatically create a README for the converted checkpoint * small changes to config * push README to hub * only show uroman note for checkpoints that need it * remove conversion script because code formatting breaks the readme * make WaveNet layers configurable * rename variables * simplifying the math * output attentions and hidden states * remove VitsFlip in flow model * also got rid of the other flip * fix tests * rename more variables * rename tokenizer, add phonemization * raise error when phonemizer missing * re-order config docstrings to match method * change config naming * remove redundant str -> list * fix copyright: vits authors -> kakao enterprise * (mean, log_variances) -> (prior_mean, prior_log_variances) * if return dict -> if not return dict * speed -> speaking rate * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * update fused tanh sigmoid * reduce dims in tester * audio -> output_values * audio -> output_values in tuple out * fix return type * fix return type * make _unconstrained_rational_quadratic_spline a function * all nn's to accept a config * add spectro to output * move {speaking rate, noise scale, noise scale duration} to config * path -> attn_path * idxs -> valid idxs -> padded idxs * output values -> waveform * use config for attention * make generation work * harden integration test * add spectrogram to dict output * tokenizer refactor * make style * remove 'fake' padding token * harden tokenizer tests * ron norm test * fprop / save tests deterministic * move uroman to tokenizer as much as possible * better logger message * fix vivit imports * add uroman integration test * make style * up * matthijs -> sanchit-gandhi * fix tokenizer test * make fix-copies * fix dict comprehension * fix config tests * fix model tests * make outputs consistent with reverse/not reverse * fix key concat * more model details * add author * return dict * speaker error * labels error * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/vits/convert_original_checkpoint.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove uromanize * add docstrings * add docstrings for tokenizer * upper-case skip messages * fix return dict * style * finish tests * update checkpoints * make style * remove doctest file * revert * fix docstring * fix tokenizer * remove uroman integration test * add sampling rate * fix docs / docstrings * style * add sr to model output * fix outputs * style / copies * fix docstring * fix copies * remove sr from model outputs * Update utils/documentation_tests.txt Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add sr as allowed attr --------- Co-authored-by: sanchit-gandhi <sanchit@huggingface.co> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

hollance added the New model label Jun 7, 2023

hollance force-pushed the vits branch 2 times, most recently from f049cd2 to 920d480 Compare June 13, 2023 10:09

hollance force-pushed the vits branch 2 times, most recently from f722786 to 64c62fc Compare June 15, 2023 15:06

hollance force-pushed the vits branch 2 times, most recently from d895880 to 0d4fd3b Compare June 22, 2023 12:02

hollance requested review from Vaibhavs10 and sanchit-gandhi June 22, 2023 14:59

Vaibhavs10 reviewed Jun 23, 2023

View reviewed changes

sanchit-gandhi approved these changes Jun 25, 2023

View reviewed changes

ArthurZucker reviewed Jun 26, 2023

View reviewed changes

hollance force-pushed the vits branch 2 times, most recently from 806afdf to b7d0fb9 Compare June 27, 2023 11:27

hollance changed the title ~~[WIP] add VITS model~~ add VITS model Jun 27, 2023

hollance marked this pull request as ready for review June 27, 2023 11:48

hollance requested a review from sgugger June 27, 2023 12:39

sgugger reviewed Jun 27, 2023

View reviewed changes

hollance force-pushed the vits branch from 113f321 to 4c5b434 Compare June 28, 2023 08:53

sanchit-gandhi reviewed Jun 28, 2023

View reviewed changes

hollance force-pushed the vits branch from 3788a75 to 1a06ddc Compare June 28, 2023 15:42

sanchit-gandhi approved these changes Jun 29, 2023

View reviewed changes

src/transformers/models/vits/modeling_vits.py Outdated Show resolved Hide resolved

tests/models/vits/test_tokenization_vits.py Outdated Show resolved Hide resolved

hollance force-pushed the vits branch from 1a06ddc to 721abbd Compare June 29, 2023 09:43

sanchit-gandhi reviewed Jun 29, 2023

View reviewed changes

src/transformers/models/vits/tokenization_vits.py Outdated Show resolved Hide resolved

sanchit-gandhi added 2 commits August 24, 2023 15:07

fix docstring

6c2ec8e

fix tokenizer

03ba786

amyeroberts approved these changes Aug 25, 2023

View reviewed changes

utils/documentation_tests.txt Outdated Show resolved Hide resolved

tests/models/vits/test_tokenization_vits.py Outdated Show resolved Hide resolved

sanchit-gandhi and others added 7 commits August 31, 2023 11:03

remove uroman integration test

d990056

add sampling rate

2e9238e

fix docs / docstrings

72aa49c

style

054df38

Merge branch 'main' into vits

ce52b98

add sr to model output

acacd4b

Merge remote-tracking branch 'origin/vits' into vits

8f3e5eb

sanchit-gandhi and others added 8 commits August 31, 2023 15:41

fix outputs

6b63f1a

style / copies

df374f8

fix docstring

6a36784

fix copies

d2414e7

remove sr from model outputs

54261a6

Update utils/documentation_tests.txt

ff3b08c

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

add sr as allowed attr

8b01633

Merge remote-tracking branch 'origin/vits' into vits

5004f42

sanchit-gandhi merged commit 4ece3b9 into huggingface:main Sep 1, 2023

sanchit-gandhi mentioned this pull request Sep 1, 2023

[MMS] Update docs with HF TTS implementation #25907

Merged

sanchit-gandhi mentioned this pull request Sep 4, 2023

[VITS] Fix init test #25945

Merged

ydshieh mentioned this pull request Sep 18, 2023

Remove utils/documentation_tests.txt #26213

Merged

sanchit-gandhi mentioned this pull request Jan 12, 2024

VitsTokenizer decode without special tokens produces odd results #28173

Closed

4 tasks

Uh oh!

add VITS model #24085

add VITS model #24085

Uh oh!

Conversation

hollance commented Jun 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

hollance commented Jun 13, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Jun 13, 2023

Uh oh!

hollance commented Jun 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hollance commented Jun 27, 2023

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hollance commented Jun 27, 2023

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

hollance commented Jun 7, 2023 •

edited

Loading

hollance commented Jun 22, 2023 •

edited

Loading

sanchit-gandhi Jul 10, 2023 •

edited

Loading

ArthurZucker Jul 13, 2023 •

edited

Loading

sanchit-gandhi left a comment •

edited

Loading

sanchit-gandhi commented Aug 31, 2023 •

edited

Loading