-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Add bark #24086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bark #24086
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
PR supersedes #23375 |
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start @ylacombe - left a lot of comments about design which you'll pick up through experience. My main request is changing the design so that we define a base model, and then three models on top of that with LM heads. Each of these models should have a generate method tied to it that does that section of generation. Then the composite model can call each of these sub-methods individually
Didn't look too deep into the last generation loop since we're still figuring out the design there, can take a look once we're a bit more certain!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this will also need to be suno/bark-small for the final PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but since we don't return input_embeds to the user it's okay if we change how this is computed in the future (there won't be a breaking change for the user as the shape of input_embeds is the same in both cases, it'll just be marginally more expensive to compute it over all the codebooks)
I would advocate for doing it the cheaper way for now since we can always update it later to sum over all codebooks then just slice as many as we need
The cheaper way being:
input_embeds = sum([self.transformer.wtes[i](input_ids[:, :, i, None]) for i in range(pred_idx + 1)])
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing in two parts - GH won't let me continue my current review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Modelling code is pretty much there. Some small changes needed for generation but largely good too. Let's discuss the processor design offline on Monday and find a nice design for loading the voice presets - the actual call of the processor is clean, just wonder if there's a better loading mechanism we can employ when we initialise or call from_pretrained
Tests by and large look good too! Think we can spend a bit of time on the docs to really showcase how to use the model effectively - all the tips can be accompanied by codesnippets, and we should have a few more examples for generation either in the docstrings or the docs themselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of this still works if this method belongs to the text model rather than the composite model. We'll do all of the same pre-/post-processing, just under the text model rather than the composite one
- The tokenization happens with the
tokenizer, not the model -> nothing changes here - The
text_encoding_offsetparameter can go in the config no problem, we'll do all of the same token id pre-processing so this will work in the text model as well - Same as above
- We'll do all of the post-processing as we are currently doing, so this will work in the text model as well
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - mainly just nits for the processor. Let me know when you'd like a re-review on the generation loop!
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we clean up the TODOs before getting a final review please? Otherwise just styling comments
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of TODOs and clean-up still required - could you address these before we get a final review? Could you also make sure that any docstrings / comments remain within the 119 line width boundary? Some of the comments are going excessively long off the side of the page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a great docstring - we should describe what the logits processor does, rather than say it does 'Bark behaviour'. Also now that we're in the logits processor file, these logits processors should be general across models (as much as possible!)
|
Hi @sanchit-gandhi , I think it's finally time for the final review! You might want to check the refactoring of |
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @ylacombe for iterating here and cleaning up the code. There are a few merge conflicts that need to be resolved (see the panel where the CI is displayed), but otherwise good to go here
|
Hi @amyeroberts, the PR is ready for review ! I'd be delighted to get your feedback on this when you have a chance. Let me know if I can help with anything! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation of Bark uses Flash Attention. Should we include it side-by-side ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment. We use BetterTransformer as a wrapper to add optimizations like flash attention to our models, however this is only for decoder models atm: https://pytorch.org/blog/out-of-the-box-acceleration/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, two of Bark sub-models should be decoder models, i.e BarkSemanticModel and BarkCoarseModel. Do you think I could/should adapt the code so that it works with BetterTransformer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, we're working on more seamless BetterTransformer support with @fxmarty in optimum: https://huggingface.co/docs/optimum/bettertransformer/overview
Since we're not natively supporting SDPA natively in transformers, we can open a PR to optimum/bettertransformer to add the Bark attention in optimum. We might also need to add some loading / exporting logic to make the three sub-models work, including the non auto-regressive fine model.
Looking at the source code for BetterTransformer in a bit more detail, we see that there is an implementation for GPT Neo scaled-dot product attention (SDPA) that is quite close to what we have in Bark: https://github.com/huggingface/optimum/blob/2678e74df3b9ff020031831f93e7e343a2405a09/optimum/bettertransformer/models/attention.py#L93
We can adapt this to add the forward signature for the Bark attention and any other necessary changes to export it to BetterTransformer.
A user will then simply have to do:
from transformers import BarkModel
from optimum.bettertransformer import BetterTransformer
model = BarkModel.from_pretrained(...)
model = BetterTransformer.transform(model)to export the model to BetterTransformer to use SDPA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to use pickle here, but I'm not sure about the safety of the method. I don't see another way to save/load a nested dictionary of np.ndarray tbh.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First thought when looking at this is that all of the speaker embeddings have to be downloaded at once. Is it likely that a user will use all of these presets in a single inference / training session?
I would instead structure it like so:
In the model's repo, we have a folder speaker_embeddings, similar to the one you currently have. Within this folder, we could do 1 of 2 things:
- Save each of the arrays individually with a suffix e.g.
de_speaker_0_semantic_prompt.npy - Have a nested folder structure i.e.
MODEL_REPO_NAME/bark-xxx/speaker_embeddings/de_speaker_0/semantic_prompt.py
Personally, I'd go for 1.
Then at the top level of the model repo, we have a file speaker_embedding_paths.json, which is a nested dict, listing where all the files are:
{
"de_speaker_0": {
"semantic_prompt": "MODEL_REPO/bark-xxx/speaker_embeddings/de_pseaker_0_semantic_prompt.npy",
"coarse_prompt": "MODEL_REPO/bark-xxx/speaker_embeddings/de_pseaker_0_coarse_prompt.npy",
"fine_prompt": "MODEL_REPO/bark-xxx/speaker_embeddings/de_pseaker_0_fine_prompt.npy",
},
}or possibly shorter oaths just with the filename. I'd go for the full repo path, as that means others can point and use the embeddings in other repos, without having to copy.
When someone requests a speaker embedding, we can first check if it's downloaded, if it is, load from there, otherwise we download from the path specified in the speaker_embeddings dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amyeroberts,
I really like 1, I think it might be the best way to be both safe and consistent with the original Bark speaker_embeddings nested dictionary. I will do it in the following hours. Thanks for your insights!
|
@ylacombe Great! Could you resolve the conflicts? Once that's done I'll review 👍 |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
|
Hi @amyeroberts and @sgugger! Many thanks for the additional review (and thanks @sanchit-gandhi for your insights)! |
|
Hi @amyeroberts, To give a concrete example, you can specify now how many Now that it is done, there are still a few comments to resolve, so I look forward to hearing from you! |
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
…ineModel.test_generate_fp16
|
Hey @amyeroberts, |
|
@ylacombe LGTM! I think we're good to merge 👍 |
This PR aims at integrating Bark, a TTS model, to
transformers.Barkwas designed and trained by Suno-AI team and is made of 4 main components:semantic model(also namedtext model), i.e a causal autoregressive transformer (GPT2-like), which takes into input a tokenized textcoarse acoustics model(also namedcoarse model), also a causal autoregressive transformer, taking into input the results of the last model. It aims at regressing the first two audio codebooks necessary toencodec.fine acoustics model(fine model), this time a non-causal autoencoder transformer, which iteratively predicts the 6 last codebooks based on the sum of the previous codebooks embeddings.encodec, Bark usesencodecto generate the output audio array.Note that each of the first 3 modules can take optional conditional speaker embeddings aiming at conditioning the output audio according to `specific preset voices.