Skip to content

Conversation

@hollance
Copy link
Contributor

@hollance hollance commented Jun 7, 2023

What does this PR do?

Adds the VITS model for text-to-speech, in particular to support the MMS-TTS checkpoints (which use the same model architecture but a different tokenizer).

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@hollance hollance force-pushed the vits branch 2 times, most recently from f049cd2 to 920d480 Compare June 13, 2023 10:09
@hollance
Copy link
Contributor Author

Notes about the tokenizer:

  1. This is not the VITS tokenizer but the one for MMS-TTS.
  2. The vocab doesn't have padding (or unknown) tokens in it, but uses token_id 0 for this. That breaks on the HF tokenizers because it will split the input text on the padding token, so if I set pad_token_id = 0 then the letters that token_id 0 corresponds to will disappear from the text.
  3. To fix this issue, I'm adding <pad> and <unk> to the vocab, but then in the model we set such token_ids to 0 before feeding the input into the first layer. It's a bit hacky. Ideas for a nicer solution are appreciated.
  4. The tokenizer also inserts an additional token_id 0 in between every token. No idea why but that's how it works.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@hollance hollance force-pushed the vits branch 2 times, most recently from f722786 to 64c62fc Compare June 15, 2023 15:06
@hollance hollance force-pushed the vits branch 2 times, most recently from d895880 to 0d4fd3b Compare June 22, 2023 12:02
@hollance
Copy link
Contributor Author

hollance commented Jun 22, 2023

This is ready for a first review yet.

Two checkpoints are currently available:

Small usage example:

from transformers import VitsMmsTokenizer, VitsModel
import torch

tokenizer = VitsMmsTokenizer.from_pretrained("Matthijs/mms-tts-eng")
model = VitsModel.from_pretrained("Matthijs/mms-tts-eng")
    
inputs = tokenizer(text="Hello, my dog is cute", return_tensors="pt")

outputs = model(inputs["input_ids"])
speech = outputs.audio

The current model is the MMS-TTS version, not the original VITS version. The conversion scripts can handle both, but for original VITS support the tokenizer is still missing.

Still needs to be done:

  • tests
  • tokenizer for actual VITS

@Vaibhavs10 For this review, in particular could you verify the names of the layers in the flow layers etc make sense? Thanks!

Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed the flow layer names for the coupling block and the duration predictor. Perfect!I Fantastic from a code PoV. Mega job @hollance 🤗

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great already @hollance - nice work on simplifying such a new model for the lib. The high-level API works really nicely and the code by-and-large reads well, mostly just nits on my part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ArthurZucker - wonder if you know a more elegant workaround for this tokenization issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the vocab should be updated in the convert_original_checkpoint, and just add the special tokens here.
I am not sure I understand the problem, they pad with 0 but 0 maps to "k". So they pad with "k"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they pad with token_id 0. The problem is that the tokenizer splits the input on the padding token and tokenizes these subsequences separately, so anywhere there is a k in the text it will disappear.

For example, It stinks here gets tokenized as It stin and s here because k is not treated as a regular token but as padding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with the tokenization logic, so bear with me.

Am I right in understanding:

  • The model is trained with 0 as a pad token
  • The token_id 0 maps to "k" in the vocabulary
  • The tokenizer first splits on the pad token, and then tokenizes
  • Remapping the padding token to "<pad>" means we keep "k" in our input sequences
  • We filter any token_ids corresponding to "<pad>" in our model? How is this done - setting them back to 0? With a mask?

Now, what happens if I have the input sentence "I like the letter k"?. When k is mapped to token_id 0, as the model was trained with this as a pad token, will it interpret k as padding?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @amyeroberts raised very valid questions.

  • if they pad with k then k is the padding token non?
  • if <pad> and k both point to 0, we don't really have a way in transformers to make sure this can be done. But I don't see why it should be done since they pad with k!
  • if the problem is that our tokenize methods splits on the special tokens, including padding, but you don't want this to happen, then you can just overwrite tokenize. Given that this model does not seem to use bos nor eos you can probably ignore the lstrip and rstip logics, and just take what you have in _tokenize.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now answering @ArthurZucker's points!

if they pad with k then k is the padding token non?

Explained above - they pad with the 0 token, which also corresponds to k in the vocab

if <pad> and k both point to 0, we don't really have a way in transformers to make sure this can be done

It is indeed the case that both <pad> and k point to 0 - hence the hackiness with the current tokenizer file to make this work, since there isn't a natural way of doing this in transformers

Probably then we have to keep a hacky workaround like this to make it work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if you had any thoughts here @ArthurZucker @amyeroberts for the conflicting <pad>/k token both being mapped to 0? Explained a bit above why this works.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use the padding token set to k?

Copy link
Collaborator

@ArthurZucker ArthurZucker Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah we can't otherwise k will be skipped when decoding without special and issue can arise. my bad I'll think about it, but a hack is most probably needed EDIT: actually when you decode, k will always be skipped if you set a padding token that maps to 0. Which is why I don't really think we should look into this too much. Either we keep a very simple hack, where in the vocab we make k and <pad> match to 0, and just set the padding token to <pad> ( this will work without further hack, and let's not check if it is or not in the vocab, because if somone want to change it by adding a new token it will not be in the vocab) or we just set the padding token to k.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gone with this proposed 'hack' - we force the pad token at word boundaries, and then encode any k's to their appropriate id's. Note also that we are unlikely to ever need to decode input_ids with a TTS model - we put the input_ids in to the model, and get speech out.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small review since @sanchit-gandhi left lots of comments. Feel free to ping me again!
(For next review let's get rid of single letter variables)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the vocab should be updated in the convert_original_checkpoint, and just add the special tokens here.
I am not sure I understand the problem, they pad with 0 but 0 maps to "k". So they pad with "k"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can split this by defining intermediate variable like input_derivatives + input_derivatives_plus_one - 2 * input_delta which is re-used twice, same for (inputs - input_cumheights). Also let's remove single letter variables, make the math happening here more understandable! 😉

@hollance hollance force-pushed the vits branch 2 times, most recently from 806afdf to b7d0fb9 Compare June 27, 2023 11:27
@hollance hollance changed the title [WIP] add VITS model add VITS model Jun 27, 2023
@hollance hollance marked this pull request as ready for review June 27, 2023 11:48
@hollance
Copy link
Contributor Author

Some of the MMS-TTS checkpoints require the use of the tool uromanize from https://github.com/isi-nlp/uroman to convert the input script into the Latin alphabet. Since this is a separate Perl script, it is not included in Transformers and the user will have to run uromanize.pl themselves before using the tokenizer.

@hollance hollance requested a review from sgugger June 27, 2023 12:39
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure why I'm asked for a review here as all comments from @sanchit-gandhi are being ignored. The goal of integrating a model into Transformers is not to copy the original code without adapting it.

Please let's have all comments addressed before asking for the next round of reviews.

@hollance
Copy link
Contributor Author

I'm not too sure why I'm asked for a review here as all comments from @sanchit-gandhi are being ignored.

No they aren't?! I've integrated most of his suggestions and replied with counterarguments otherwise.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more variable names to go - commented regarding the design decisions from the last review directly inline on the comments above

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @hollance - last question from my side is whether we've pinpointed the changes between the original tokenizer and MMS tokenizer? If they're subtle enough that we can use one tokenizer for both models, we can re-name the tokenizer from VitsMmsTokenizer to VitsTokenizer (as suggested by @sgugger) and include a flag to toggle between the two padding behaviours?

@hollance
Copy link
Contributor Author

Tokenizer can now handle both the original VITS models (which require phonemization) and the MMS-TTS models.

@sanchit-gandhi
Copy link
Contributor

All points addressed so ready for a final review @amyeroberts. Thanks for your in-depth reviews here - the PR looks in pretty good shape!

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work - thanks for iterating on this and adding this model!

Only thing that needs to be addressed before merging are the slow tests to reflect tokenizer changes. Once that's done we're good to merge! 🥳

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Aug 31, 2023

Hey @amyeroberts - to have full compatibility with the text-to-audio pipeline class, we need to indicate the sampling_rate of the predicted audio waveforms in the model config:

sampling_rate = config.get(sampling_rate_name, None)

The sampling_rate corresponds to the sampling rate of the target audio that the model was trained on. It is not possible to determine in any way other than from the value in the original config of the model. MMS TTS models use a sampling rate of 16kHz, VITS TTS models use a sampling rate of 22kHz, but otherwise their configs are the same. The user needs to have an idea of the sampling rate that the model generates in order to know what rate to playback the audio, otherwise this leaves them prone to silent errors. IMO adding it as an attribute of the main model class should suffice here:

self.sampling_rate = config.sampling_rate

Note that we cannot just add the sampling_rate to the config and not the modelling file, this is not allowed by the CI:
https://app.circleci.com/pipelines/github/huggingface/transformers/71815/workflows/c0[…]cb-a064-7c1019e03630/jobs/904373/parallel-runs/0/steps/0-116

cc @ylacombe

@sanchit-gandhi
Copy link
Contributor

As discussed offline with @amyeroberts, we'll add it as an allowed attribute in the config checker: 8b01633

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 4, 2023

Hi @hollance Thank you for adding this model into transformers 🤗 .

There is a test failing

python3 -m pytest -v tests/models/vits/test_modeling_vits.py::VitsModelTest::test_initialization

I skip it on the main branch. Would you like to help us investigate this test if you have some bandwidth? Otherwise we can take this on our side too.

If you decide to take a look, you have to remove the following line

@unittest.skip("currently failing")

so the test will be collected and run by pytest.

Let me know :-) Thank you!

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
* add VITS model

* let's vits

* finish TextEncoder (mostly)

* rename VITS to Vits

* add StochasticDurationPredictor

* ads flow model

* add generator

* correctly set vocab size

* add tokenizer

* remove processor & feature extractor

* add PosteriorEncoder

* add missing weights to SDP

* also convert LJSpeech and VCTK checkpoints

* add training stuff in forward

* add placeholder tests for tokenizer

* add placeholder tests for model

* starting cleanup

* let the great renaming begin!

* use config

* global_conditioning

* more cleaning

* renaming variables

* more renaming

* more renaming

* it never ends

* reticulating the splines

* more renaming

* HiFi-GAN

* doc strings for main model

* fixup

* fix-copies

* don't make it a PreTrainedModel

* fixup

* rename config options

* remove training logic from forward pass

* simplify relative position

* use actual checkpoint

* style

* PR review fixes

* more review changes

* fixup

* more unit tests

* fixup

* fix doc test

* add integration test

* improve tokenizer tests

* add tokenizer integration test

* fix tests on GPU (gave OOM)

* conversion script can handle repos from hub

* add conversion script for all MMS-TTS checkpoints

* automatically create a README for the converted checkpoint

* small changes to config

* push README to hub

* only show uroman note for checkpoints that need it

* remove conversion script because code formatting breaks the readme

* make WaveNet layers configurable

* rename variables

* simplifying the math

* output attentions and hidden states

* remove VitsFlip in flow model

* also got rid of the other flip

* fix tests

* rename more variables

* rename tokenizer, add phonemization

* raise error when phonemizer missing

* re-order config docstrings to match method

* change config naming

* remove redundant str -> list

* fix copyright: vits authors -> kakao enterprise

* (mean, log_variances) -> (prior_mean, prior_log_variances)

* if return dict -> if not return dict

* speed -> speaking rate

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* update fused tanh sigmoid

* reduce dims in tester

* audio -> output_values

* audio -> output_values in tuple out

* fix return type

* fix return type

* make _unconstrained_rational_quadratic_spline a function

* all nn's to accept a config

* add spectro to output

* move {speaking rate, noise scale, noise scale duration} to config

* path -> attn_path

* idxs -> valid idxs -> padded idxs

* output values -> waveform

* use config for attention

* make generation work

* harden integration test

* add spectrogram to dict output

* tokenizer refactor

* make style

* remove 'fake' padding token

* harden tokenizer tests

* ron norm test

* fprop / save tests deterministic

* move uroman to tokenizer as much as possible

* better logger message

* fix vivit imports

* add uroman integration test

* make style

* up

* matthijs -> sanchit-gandhi

* fix tokenizer test

* make fix-copies

* fix dict comprehension

* fix config tests

* fix model tests

* make outputs consistent with reverse/not reverse

* fix key concat

* more model details

* add author

* return dict

* speaker error

* labels error

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/vits/convert_original_checkpoint.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* remove uromanize

* add docstrings

* add docstrings for tokenizer

* upper-case skip messages

* fix return dict

* style

* finish tests

* update checkpoints

* make style

* remove doctest file

* revert

* fix docstring

* fix tokenizer

* remove uroman integration test

* add sampling rate

* fix docs / docstrings

* style

* add sr to model output

* fix outputs

* style / copies

* fix docstring

* fix copies

* remove sr from model outputs

* Update utils/documentation_tests.txt

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add sr as allowed attr

---------

Co-authored-by: sanchit-gandhi <sanchit@huggingface.co>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants