-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MusicGen Melody #28819
Add MusicGen Melody #28819
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks in pretty good shape! Just some comments about the main docs, as well as argument naming for the conditional hidden-states. While this model is non-standard in some regards, it would be good to try and standardise it as much as possible with MusicGen and the general Transformers API, so as to ensure users can run this model as they expect from the Transformers library.
[Hugging Face Hub](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen). | ||
|
||
|
||
## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I don't see your example though!
|
||
The audio prompt should ideally be free of the low-frequency signals usually produced by instruments such as drums and bass. The [Demucs](https://github.com/adefossez/demucs/tree/main) model can be used to separate vocals and other signals from the drums and bass components. | ||
|
||
If you wish to use Demucs, you first need to follow the installation steps [here](https://github.com/adefossez/demucs/tree/main?tab=readme-ov-file#for-musicians) before using the following snippet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: as detailed, the Demucs model is required as a pre-processing step for audio-conditioned generation. There are two options for using the Demucs model here:
- Leverage the original implementation, which has 7.4k GH stars, a huge existing user-base and an intuitive API (as shown below). The requirements for the package are significant, but install reliably.
- Convert the Demucs model to Transformers, for example as started in this PR. Note that adding Demucs to the library would only really be beneficial to the MusicGen Melody integration: it's unlikely to be used standalone by Demucs users (since they already have the original implementation installed) and we don't plan on adding fine-tuning support.
=> given how easy it is to use the original implementation, and the shift in mindset to generally try and support more OS libraries (rather than compete with them), I would be in favour of going with option 1. In doing so, we can build a tighter integration between the Demucs library and the HF Hub, rather than blindly integrate it into Transformers for a single model integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking to @patrickvonplaten, having users perform pre-processing outside of a Diffusers pipeline has worked better than integrating it into the pipeline
E.g. for ControlNet, having users perform pre-processing with the opencv-python library worked better than integrating this into the pipeline: https://huggingface.co/docs/diffusers/en/using-diffusers/controlnet#text-to-image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
The only "issue" with tight integration is maintenance. Some tokenizers natively support other libraries (like moses
) if available.
Doing it outside sounds good to me!
src/transformers/models/musicgen_melody/modeling_musicgen_melody.py
Outdated
Show resolved
Hide resolved
src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py
Outdated
Show resolved
Hide resolved
src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py
Outdated
Show resolved
Hide resolved
src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py
Outdated
Show resolved
Hide resolved
return audio_values | ||
|
||
def get_unconditional_inputs(self, num_samples=1, return_tensors="pt"): | ||
"""# TODO: update, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will give the same behaviour as:
def get_unconditional_inputs(self, num_samples=1): |
(it doesn't matter what the input ids are, since we mask them anyway the model won't attend to them)
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
@ylacombe at a quick glance, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this model!
Mostly just some small nits from me. Overall a really nice PR - the effort put into the tests and model page in particular really stand out.
Overall comment: I realise this is in part due to fitting the model into the transformers library and copying from MusicGen, but this was really quite a difficult PR to review. There's loads of repeated logic in the generate and generate related methods and the forward and generate passes are huge. I'd like to request a follow up PR where a lot of this is abstracted out into smaller, more modular methods.
Only other comment is the compatibility with the rest of our generate library and selecting greedy
versus sample
behaviours as well as logit processors. cc @gante to confirm this is OK.
@@ -755,3 +761,57 @@ def test_amplitude_to_db(self): | |||
amplitude_to_db(spectrogram, min_value=0.0) | |||
with pytest.raises(ValueError): | |||
amplitude_to_db(spectrogram, db_range=-80) | |||
|
|||
@require_librosa | |||
def test_chroma_equivalence(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test :)
src/transformers/models/musicgen_melody/processing_musicgen_melody.py
Outdated
Show resolved
Hide resolved
src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py
Outdated
Show resolved
Hide resolved
src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py
Outdated
Show resolved
Hide resolved
# Initialize projection layers weights and tie text encoder and decoder weights if set accordingly | ||
self.post_init() | ||
|
||
def _init_weights(self, module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on this a bit? It shouldn't be necessary to add this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MusicgenMelodyForConditionalGeneration
doesn't inherit from MusicgenMelodyPreTrainedModel
as it raises issues in offloading features (probably because self.decoder
also inherits from MusicgenMelodyPreTrainedModel
).
So we need to initialize weights that are not yet initialized, namely enc_to_dec_proj
and audio_enc_to_dec_proj
# skip as this model doesn't support all arguments tested | ||
def test_model_outputs_equivalence(self): | ||
pass | ||
|
||
# skip as this model has multiple inputs embeds and lm heads that should not be tied | ||
def test_tie_model_weights(self): | ||
pass | ||
|
||
# skip as this model has multiple inputs embeds and lm heads that should not be tied | ||
def test_tied_weights_keys(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All skipped tests should be skipped with unittest.skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add them for MusicgenMelodyDecoderTest
but will leave MusicgenMelodyTest
's ones, as they are almost all copied from MusicgenTest
and I'd rather keep traceability over unittest.skip
|
||
|
||
@torch.no_grad() | ||
def convert_musicgen_melody_checkpoint(checkpoint, pytorch_dump_folder=None, repo_id=None, device="cpu"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have some valuation on the outputs between the original and converted model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I usually ask to make it optional, as it only holds for the pretrained model and our Integration tests should be where we make sure the converted checkpoint works! )
logits_processor = LogitsProcessorList() | ||
return process_kwargs, logits_processor | ||
|
||
def test_greedy_generate_dict_outputs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long do all these generate tests take to run? We might want to decorate with @slow
|
||
--> | ||
|
||
# MusicGen Melody |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing model page 😍
@amyeroberts 100% agreed! However, I believe the ball is mostly on the The MusicGen PR had the exact same pattern :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, leveraging all the available tools + outside libs! 🤗
I feel like we are giving a bit to much freedom / tools while for maintenance, only supporting one model init will be better + less logic to handle + less things to test.
- is the final checkpoint going to be hosted at
ylacombe/musicgen-melody
? Would be nice to have it on either Meta or a community based repo? (facebook/musicgen-melody-hf
?) - might be nice to split the complex generation logic to a separate file like whisper? Keeping the modeling for the model
- small nits in the doc can be ignored.
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody") | ||
```""" | ||
|
||
# At the moment fast initialization is not supported for composite models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any idea why? Llava works nicely with this I believe so surprised that we have a prob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I essentially copied out the logics from Musicgen, but when removing this snippet, all tests passed. I'll removed it.
cc @sanchit-gandhi, any reasons for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fine to remove - was copied from an older model (encoder-decoder) into MusicGen
# skip as this model doesn't support all arguments tested | ||
def test_model_outputs_equivalence(self): | ||
pass | ||
|
||
# skip as this model has multiple inputs embeds and lm heads that should not be tied | ||
def test_tie_model_weights(self): | ||
pass | ||
|
||
# skip as this model has multiple inputs embeds and lm heads that should not be tied | ||
def test_tied_weights_keys(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➕
>>> model = MusicgenMelodyForConditionalGeneration.from_sub_models_pretrained( | ||
... text_encoder_pretrained_model_name_or_path="t5-base", | ||
... audio_encoder_pretrained_model_name_or_path="facebook/encodec_24khz", | ||
... decoder_pretrained_model_name_or_path="facebook/musicgen-melody", | ||
... ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this is something we want to go vs just supporting init from 3 models, leave it to the user to call from_pretrained. Feels like we are giving a lot of tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in theory possible to use other text encoder and audio encoder. It also allows flexibility to go back and forth from the CausalLM class, if training the decoder only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed quite extensively for MusicGen (slack thread), where we unanimously decided to keep functionality to initialise the composite model from three separate sub-models.
I would be strongly in favour of maintaining consistency with the MusicGen design here, rather than adding a new design for this spin-off model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. For next model I would drive this on usage = have we seen issue / people use this way of initializing or not! 🤗
if self.text_encoder.config.to_dict() != self.config.text_encoder.to_dict(): | ||
logger.warning( | ||
f"Config of the text_encoder: {self.text_encoder.__class__} is overwritten by shared text_encoder config:" | ||
f" {self.config.text_encoder}" | ||
) | ||
if self.audio_encoder.config.to_dict() != self.config.audio_encoder.to_dict(): | ||
logger.warning( | ||
f"Config of the audio_encoder: {self.audio_encoder.__class__} is overwritten by shared audio_encoder config:" | ||
f" {self.config.audio_encoder}" | ||
) | ||
if self.decoder.config.to_dict() != self.config.decoder.to_dict(): | ||
logger.warning( | ||
f"Config of the decoder: {self.decoder.__class__} is overwritten by shared decoder config:" | ||
f" {self.config.decoder}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's expected behaviour to sure we have to warn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean let's not warn here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have a seperate file for the generation part like we do for whisper no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could make the modelling + generation code a lot cleaner for the MusicGen series! Although long-term, the issue would be fully resolved by a refactor to generate to make it more composable for audio models (as suggested by @gante)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we do this as a follow-up PR for MusicGen + MusicGen Melody? (so as not to mix two features into one PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
* first modeling code * make repository * still WIP * update model * add tests * add latest change * clean docstrings and copied from * update docstrings md and readme * correct chroma function * correct copied from and remove unreleated test * add doc to toctree * correct imports * add convert script to notdoctested * Add suggestion from Sanchit Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * correct get_uncoditional_inputs docstrings * modify README according to SANCHIT feedback * add chroma to audio utils * clean librosa and torchaudio hard dependencies * fix FE * refactor audio decoder -> audio encoder for consistency with previous musicgen * refactor conditional -> encoder * modify sampling rate logics * modify license at the beginning * refactor all_self_attns->all_attentions * remove ignore copy from causallm generate * add copied from for from_sub_models * fix make copies * add warning if audio is truncated * add copied from where relevant * remove artefact * fix convert script * fix torchaudio and FE * modify chroma method according to feedback-> better naming * refactor input_values->input_features * refactor input_values->input_features and fix import fe * add input_features to docstrigs * correct inputs_embeds logics * remove dtype conversion * refactor _prepare_conditional_hidden_states_kwargs_for_generation ->_prepare_encoder_hidden_states_kwargs_for_generation * change warning for chroma length * Update src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * change way to save wav, using soundfile * correct docs and change to soundfile * fix import * fix init proj layers * remove line breaks from md * fix issue with docstrings * add FE suggestions * improve is in logics and remove useless imports * remove custom from_pretrained * simplify docstring code * add suggestions for modeling tests * make style * update converting script with sanity check * remove encoder attention mask from conditional generation * replace musicgen melody checkpoints with official orga * rename ylacombe->facebook in checkpoints * fix copies * remove unecessary warning * add shape in code docstrings * add files to slow doc tests * fix md bug and add md to not_tested * make fix-copies * fix hidden states test and batching --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
What does this PR do?
MusicGen Melody was released at the same time than the "o.g" MusicGen that has already been integrated to
transformers
.Contrarily to the already integrated model, you can condition the generation with an audio prompt (instead of continuation of the audio prompt).
Main conceptual difference-> we no longer use cross-attention to condition the generation with the text/audio prompt, but instead we concatenate the text/audio prompt to the decoder hidden states.
This makes the model a bit simpler, since it's no longer a "proper" encoder-decoder architecture but a decoder-only that can be conditioned (a bit like Fuyu).
Note that there are 3 key "modalities":
-> the prompt text that is passed through a text encoder model.
-> the audio prompt that is processed by the feature extractor to give a chromagram.
-> the musicgen decoder, that generate Encodec codes.
Why is this model interesting?
cc @sanchit-gandhi