-
Couldn't load subscription status.
- Fork 31k
add VITS model #24085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add VITS model #24085
Conversation
f049cd2 to
920d480
Compare
|
Notes about the tokenizer:
|
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
f722786 to
64c62fc
Compare
d895880 to
0d4fd3b
Compare
|
This is ready for a first review yet. Two checkpoints are currently available: Small usage example: The current model is the MMS-TTS version, not the original VITS version. The conversion scripts can handle both, but for original VITS support the tokenizer is still missing. Still needs to be done:
@Vaibhavs10 For this review, in particular could you verify the names of the layers in the flow layers etc make sense? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed the flow layer names for the coupling block and the duration predictor. Perfect!I Fantastic from a code PoV. Mega job @hollance 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great already @hollance - nice work on simplifying such a new model for the lib. The high-level API works really nicely and the code by-and-large reads well, mostly just nits on my part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ArthurZucker - wonder if you know a more elegant workaround for this tokenization issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the vocab should be updated in the convert_original_checkpoint, and just add the special tokens here.
I am not sure I understand the problem, they pad with 0 but 0 maps to "k". So they pad with "k"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they pad with token_id 0. The problem is that the tokenizer splits the input on the padding token and tokenizes these subsequences separately, so anywhere there is a k in the text it will disappear.
For example, It stinks here gets tokenized as It stin and s here because k is not treated as a regular token but as padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with the tokenization logic, so bear with me.
Am I right in understanding:
- The model is trained with
0as a pad token - The token_id
0maps to"k"in the vocabulary - The tokenizer first splits on the pad token, and then tokenizes
- Remapping the padding token to
"<pad>"means we keep"k"in our input sequences - We filter any token_ids corresponding to
"<pad>"in our model? How is this done - setting them back to 0? With a mask?
Now, what happens if I have the input sentence "I like the letter k"?. When k is mapped to token_id 0, as the model was trained with this as a pad token, will it interpret k as padding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @amyeroberts raised very valid questions.
- if they pad with
kthenkis the padding token non? - if
<pad>andkboth point to0, we don't really have a way intransformersto make sure this can be done. But I don't see why it should be done since they pad with k! - if the problem is that our
tokenizemethods splits on the special tokens, including padding, but you don't want this to happen, then you can just overwritetokenize. Given that this model does not seem to usebosnoreosyou can probably ignore the lstrip and rstip logics, and just take what you have in_tokenize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now answering @ArthurZucker's points!
if they pad with
kthenkis the padding token non?
Explained above - they pad with the 0 token, which also corresponds to k in the vocab
if
<pad>andkboth point to0, we don't really have a way intransformersto make sure this can be done
It is indeed the case that both <pad> and k point to 0 - hence the hackiness with the current tokenizer file to make this work, since there isn't a natural way of doing this in transformers
Probably then we have to keep a hacky workaround like this to make it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if you had any thoughts here @ArthurZucker @amyeroberts for the conflicting <pad>/k token both being mapped to 0? Explained a bit above why this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just use the padding token set to k?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah we can't otherwise EDIT: actually when you decode, k will be skipped when decoding without special and issue can arise. my bad I'll think about it, but a hack is most probably neededk will always be skipped if you set a padding token that maps to 0. Which is why I don't really think we should look into this too much. Either we keep a very simple hack, where in the vocab we make k and <pad> match to 0, and just set the padding token to <pad> ( this will work without further hack, and let's not check if it is or not in the vocab, because if somone want to change it by adding a new token it will not be in the vocab) or we just set the padding token to k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gone with this proposed 'hack' - we force the pad token at word boundaries, and then encode any k's to their appropriate id's. Note also that we are unlikely to ever need to decode input_ids with a TTS model - we put the input_ids in to the model, and get speech out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small review since @sanchit-gandhi left lots of comments. Feel free to ping me again!
(For next review let's get rid of single letter variables)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the vocab should be updated in the convert_original_checkpoint, and just add the special tokens here.
I am not sure I understand the problem, they pad with 0 but 0 maps to "k". So they pad with "k"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can split this by defining intermediate variable like input_derivatives + input_derivatives_plus_one - 2 * input_delta which is re-used twice, same for (inputs - input_cumheights). Also let's remove single letter variables, make the math happening here more understandable! 😉
806afdf to
b7d0fb9
Compare
|
Some of the MMS-TTS checkpoints require the use of the tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure why I'm asked for a review here as all comments from @sanchit-gandhi are being ignored. The goal of integrating a model into Transformers is not to copy the original code without adapting it.
Please let's have all comments addressed before asking for the next round of reviews.
No they aren't?! I've integrated most of his suggestions and replied with counterarguments otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more variable names to go - commented regarding the design decisions from the last review directly inline on the comments above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @hollance - last question from my side is whether we've pinpointed the changes between the original tokenizer and MMS tokenizer? If they're subtle enough that we can use one tokenizer for both models, we can re-name the tokenizer from VitsMmsTokenizer to VitsTokenizer (as suggested by @sgugger) and include a flag to toggle between the two padding behaviours?
|
Tokenizer can now handle both the original VITS models (which require phonemization) and the MMS-TTS models. |
|
All points addressed so ready for a final review @amyeroberts. Thanks for your in-depth reviews here - the PR looks in pretty good shape! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work - thanks for iterating on this and adding this model!
Only thing that needs to be addressed before merging are the slow tests to reflect tokenizer changes. Once that's done we're good to merge! 🥳
|
Hey @amyeroberts - to have full compatibility with the text-to-audio pipeline class, we need to indicate the
The
Note that we cannot just add the cc @ylacombe |
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
|
As discussed offline with @amyeroberts, we'll add it as an allowed attribute in the config checker: 8b01633 |
|
Hi @hollance Thank you for adding this model into There is a test failing python3 -m pytest -v tests/models/vits/test_modeling_vits.py::VitsModelTest::test_initializationI skip it on the If you decide to take a look, you have to remove the following line
so the test will be collected and run by pytest.
Let me know :-) Thank you! |
* add VITS model
* let's vits
* finish TextEncoder (mostly)
* rename VITS to Vits
* add StochasticDurationPredictor
* ads flow model
* add generator
* correctly set vocab size
* add tokenizer
* remove processor & feature extractor
* add PosteriorEncoder
* add missing weights to SDP
* also convert LJSpeech and VCTK checkpoints
* add training stuff in forward
* add placeholder tests for tokenizer
* add placeholder tests for model
* starting cleanup
* let the great renaming begin!
* use config
* global_conditioning
* more cleaning
* renaming variables
* more renaming
* more renaming
* it never ends
* reticulating the splines
* more renaming
* HiFi-GAN
* doc strings for main model
* fixup
* fix-copies
* don't make it a PreTrainedModel
* fixup
* rename config options
* remove training logic from forward pass
* simplify relative position
* use actual checkpoint
* style
* PR review fixes
* more review changes
* fixup
* more unit tests
* fixup
* fix doc test
* add integration test
* improve tokenizer tests
* add tokenizer integration test
* fix tests on GPU (gave OOM)
* conversion script can handle repos from hub
* add conversion script for all MMS-TTS checkpoints
* automatically create a README for the converted checkpoint
* small changes to config
* push README to hub
* only show uroman note for checkpoints that need it
* remove conversion script because code formatting breaks the readme
* make WaveNet layers configurable
* rename variables
* simplifying the math
* output attentions and hidden states
* remove VitsFlip in flow model
* also got rid of the other flip
* fix tests
* rename more variables
* rename tokenizer, add phonemization
* raise error when phonemizer missing
* re-order config docstrings to match method
* change config naming
* remove redundant str -> list
* fix copyright: vits authors -> kakao enterprise
* (mean, log_variances) -> (prior_mean, prior_log_variances)
* if return dict -> if not return dict
* speed -> speaking rate
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* update fused tanh sigmoid
* reduce dims in tester
* audio -> output_values
* audio -> output_values in tuple out
* fix return type
* fix return type
* make _unconstrained_rational_quadratic_spline a function
* all nn's to accept a config
* add spectro to output
* move {speaking rate, noise scale, noise scale duration} to config
* path -> attn_path
* idxs -> valid idxs -> padded idxs
* output values -> waveform
* use config for attention
* make generation work
* harden integration test
* add spectrogram to dict output
* tokenizer refactor
* make style
* remove 'fake' padding token
* harden tokenizer tests
* ron norm test
* fprop / save tests deterministic
* move uroman to tokenizer as much as possible
* better logger message
* fix vivit imports
* add uroman integration test
* make style
* up
* matthijs -> sanchit-gandhi
* fix tokenizer test
* make fix-copies
* fix dict comprehension
* fix config tests
* fix model tests
* make outputs consistent with reverse/not reverse
* fix key concat
* more model details
* add author
* return dict
* speaker error
* labels error
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/vits/convert_original_checkpoint.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* remove uromanize
* add docstrings
* add docstrings for tokenizer
* upper-case skip messages
* fix return dict
* style
* finish tests
* update checkpoints
* make style
* remove doctest file
* revert
* fix docstring
* fix tokenizer
* remove uroman integration test
* add sampling rate
* fix docs / docstrings
* style
* add sr to model output
* fix outputs
* style / copies
* fix docstring
* fix copies
* remove sr from model outputs
* Update utils/documentation_tests.txt
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* add sr as allowed attr
---------
Co-authored-by: sanchit-gandhi <sanchit@huggingface.co>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
Adds the VITS model for text-to-speech, in particular to support the MMS-TTS checkpoints (which use the same model architecture but a different tokenizer).
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.