Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MusicGen Melody #28819

Merged
merged 72 commits into from
Mar 18, 2024
Merged

Add MusicGen Melody #28819

merged 72 commits into from
Mar 18, 2024

Conversation

ylacombe
Copy link
Contributor

@ylacombe ylacombe commented Feb 1, 2024

What does this PR do?

MusicGen Melody was released at the same time than the "o.g" MusicGen that has already been integrated to transformers.

Contrarily to the already integrated model, you can condition the generation with an audio prompt (instead of continuation of the audio prompt).

Main conceptual difference-> we no longer use cross-attention to condition the generation with the text/audio prompt, but instead we concatenate the text/audio prompt to the decoder hidden states.

This makes the model a bit simpler, since it's no longer a "proper" encoder-decoder architecture but a decoder-only that can be conditioned (a bit like Fuyu).

Note that there are 3 key "modalities":
-> the prompt text that is passed through a text encoder model.
-> the audio prompt that is processed by the feature extractor to give a chromagram.
-> the musicgen decoder, that generate Encodec codes.

Why is this model interesting?

  1. Audio prompting instead of audio generation gives really interesting generation
  2. Musicgen is a decoder-only model, and is difficult to train using the original library. I ideally plan to add training capabilities to the model.

cc @sanchit-gandhi

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks in pretty good shape! Just some comments about the main docs, as well as argument naming for the conditional hidden-states. While this model is non-standard in some regards, it would be good to try and standardise it as much as possible with MusicGen and the general Transformers API, so as to ensure users can run this model as they expect from the Transformers library.

[Hugging Face Hub](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen).


## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Do you think it makes sense to include the difference in a diagram? E.g. as per:
Uploading ttm_vs_melody_2.png…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I don't see your example though!

docs/source/en/model_doc/musicgen_melody.md Outdated Show resolved Hide resolved

The audio prompt should ideally be free of the low-frequency signals usually produced by instruments such as drums and bass. The [Demucs](https://github.com/adefossez/demucs/tree/main) model can be used to separate vocals and other signals from the drums and bass components.

If you wish to use Demucs, you first need to follow the installation steps [here](https://github.com/adefossez/demucs/tree/main?tab=readme-ov-file#for-musicians) before using the following snippet:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: as detailed, the Demucs model is required as a pre-processing step for audio-conditioned generation. There are two options for using the Demucs model here:

  1. Leverage the original implementation, which has 7.4k GH stars, a huge existing user-base and an intuitive API (as shown below). The requirements for the package are significant, but install reliably.
  2. Convert the Demucs model to Transformers, for example as started in this PR. Note that adding Demucs to the library would only really be beneficial to the MusicGen Melody integration: it's unlikely to be used standalone by Demucs users (since they already have the original implementation installed) and we don't plan on adding fine-tuning support.

=> given how easy it is to use the original implementation, and the shift in mindset to generally try and support more OS libraries (rather than compete with them), I would be in favour of going with option 1. In doing so, we can build a tighter integration between the Demucs library and the HF Hub, rather than blindly integrate it into Transformers for a single model integration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking to @patrickvonplaten, having users perform pre-processing outside of a Diffusers pipeline has worked better than integrating it into the pipeline

E.g. for ControlNet, having users perform pre-processing with the opencv-python library worked better than integrating this into the pipeline: https://huggingface.co/docs/diffusers/en/using-diffusers/controlnet#text-to-image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!
The only "issue" with tight integration is maintenance. Some tokenizers natively support other libraries (like moses) if available.
Doing it outside sounds good to me!

docs/source/en/model_doc/musicgen_melody.md Show resolved Hide resolved
return audio_values

def get_unconditional_inputs(self, num_samples=1, return_tensors="pt"):
"""# TODO: update,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will give the same behaviour as:

def get_unconditional_inputs(self, num_samples=1):

(it doesn't matter what the input ids are, since we mask them anyway the model won't attend to them)

@gante
Copy link
Member

gante commented Feb 14, 2024

@ylacombe at a quick glance, the generate method looks the same as in MusicGen, which has been previously approved. Are there differences that you'd like me to review? 🤗

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this model!

Mostly just some small nits from me. Overall a really nice PR - the effort put into the tests and model page in particular really stand out.

Overall comment: I realise this is in part due to fitting the model into the transformers library and copying from MusicGen, but this was really quite a difficult PR to review. There's loads of repeated logic in the generate and generate related methods and the forward and generate passes are huge. I'd like to request a follow up PR where a lot of this is abstracted out into smaller, more modular methods.

Only other comment is the compatibility with the rest of our generate library and selecting greedy versus sample behaviours as well as logit processors. cc @gante to confirm this is OK.

@@ -755,3 +761,57 @@ def test_amplitude_to_db(self):
amplitude_to_db(spectrogram, min_value=0.0)
with pytest.raises(ValueError):
amplitude_to_db(spectrogram, db_range=-80)

@require_librosa
def test_chroma_equivalence(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test :)

# Initialize projection layers weights and tie text encoder and decoder weights if set accordingly
self.post_init()

def _init_weights(self, module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on this a bit? It shouldn't be necessary to add this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MusicgenMelodyForConditionalGeneration doesn't inherit from MusicgenMelodyPreTrainedModel as it raises issues in offloading features (probably because self.decoder also inherits from MusicgenMelodyPreTrainedModel).

So we need to initialize weights that are not yet initialized, namely enc_to_dec_proj and audio_enc_to_dec_proj

Comment on lines 221 to 231
# skip as this model doesn't support all arguments tested
def test_model_outputs_equivalence(self):
pass

# skip as this model has multiple inputs embeds and lm heads that should not be tied
def test_tie_model_weights(self):
pass

# skip as this model has multiple inputs embeds and lm heads that should not be tied
def test_tied_weights_keys(self):
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All skipped tests should be skipped with unittest.skip

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add them for MusicgenMelodyDecoderTest but will leave MusicgenMelodyTest's ones, as they are almost all copied from MusicgenTest and I'd rather keep traceability over unittest.skip



@torch.no_grad()
def convert_musicgen_melody_checkpoint(checkpoint, pytorch_dump_folder=None, repo_id=None, device="cpu"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have some valuation on the outputs between the original and converted model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I usually ask to make it optional, as it only holds for the pretrained model and our Integration tests should be where we make sure the converted checkpoint works! )

logits_processor = LogitsProcessorList()
return process_kwargs, logits_processor

def test_greedy_generate_dict_outputs(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long do all these generate tests take to run? We might want to decorate with @slow


-->

# MusicGen Melody
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing model page 😍

@gante
Copy link
Member

gante commented Feb 14, 2024

I'd like to request a follow up PR where a lot of this is abstracted out into smaller, more modular methods.

@amyeroberts 100% agreed! However, I believe the ball is mostly on the generate side -- it should be made more flexible, such that enabling models like MusicGen becomes a < 100-line task.

The MusicGen PR had the exact same pattern :)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, leveraging all the available tools + outside libs! 🤗

I feel like we are giving a bit to much freedom / tools while for maintenance, only supporting one model init will be better + less logic to handle + less things to test.

  • is the final checkpoint going to be hosted at ylacombe/musicgen-melody? Would be nice to have it on either Meta or a community based repo? ( facebook/musicgen-melody-hf?)
  • might be nice to split the complex generation logic to a separate file like whisper? Keeping the modeling for the model
  • small nits in the doc can be ignored.

docs/source/en/model_doc/musicgen_melody.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/musicgen_melody.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/musicgen_melody.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/musicgen_melody.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/musicgen_melody.md Outdated Show resolved Hide resolved
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
```"""

# At the moment fast initialization is not supported for composite models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea why? Llava works nicely with this I believe so surprised that we have a prob

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I essentially copied out the logics from Musicgen, but when removing this snippet, all tests passed. I'll removed it.

cc @sanchit-gandhi, any reasons for this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine to remove - was copied from an older model (encoder-decoder) into MusicGen

Comment on lines 221 to 231
# skip as this model doesn't support all arguments tested
def test_model_outputs_equivalence(self):
pass

# skip as this model has multiple inputs embeds and lm heads that should not be tied
def test_tie_model_weights(self):
pass

# skip as this model has multiple inputs embeds and lm heads that should not be tied
def test_tied_weights_keys(self):
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1580 to 1584
>>> model = MusicgenMelodyForConditionalGeneration.from_sub_models_pretrained(
... text_encoder_pretrained_model_name_or_path="t5-base",
... audio_encoder_pretrained_model_name_or_path="facebook/encodec_24khz",
... decoder_pretrained_model_name_or_path="facebook/musicgen-melody",
... )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is something we want to go vs just supporting init from 3 models, leave it to the user to call from_pretrained. Feels like we are giving a lot of tools

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in theory possible to use other text encoder and audio encoder. It also allows flexibility to go back and forth from the CausalLM class, if training the decoder only

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed quite extensively for MusicGen (slack thread), where we unanimously decided to keep functionality to initialise the composite model from three separate sub-models.

I would be strongly in favour of maintaining consistency with the MusicGen design here, rather than adding a new design for this spin-off model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. For next model I would drive this on usage = have we seen issue / people use this way of initializing or not! 🤗

Comment on lines 1416 to 1430
if self.text_encoder.config.to_dict() != self.config.text_encoder.to_dict():
logger.warning(
f"Config of the text_encoder: {self.text_encoder.__class__} is overwritten by shared text_encoder config:"
f" {self.config.text_encoder}"
)
if self.audio_encoder.config.to_dict() != self.config.audio_encoder.to_dict():
logger.warning(
f"Config of the audio_encoder: {self.audio_encoder.__class__} is overwritten by shared audio_encoder config:"
f" {self.config.audio_encoder}"
)
if self.decoder.config.to_dict() != self.config.decoder.to_dict():
logger.warning(
f"Config of the decoder: {self.decoder.__class__} is overwritten by shared decoder config:"
f" {self.config.decoder}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's expected behaviour to sure we have to warn?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean let's not warn here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have a seperate file for the generation part like we do for whisper no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could make the modelling + generation code a lot cleaner for the MusicGen series! Although long-term, the issue would be fully resolved by a refactor to generate to make it more composable for audio models (as suggested by @gante)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we do this as a follow-up PR for MusicGen + MusicGen Melody? (so as not to mix two features into one PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@ylacombe ylacombe merged commit c43b380 into huggingface:main Mar 18, 2024
23 checks passed
itazap pushed a commit that referenced this pull request May 14, 2024
* first modeling code

* make repository

* still WIP

* update model

* add tests

* add latest change

* clean docstrings and copied from

* update docstrings md and readme

* correct chroma function

* correct copied from and remove unreleated test

* add doc to toctree

* correct imports

* add convert script to notdoctested

* Add suggestion from Sanchit

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* correct get_uncoditional_inputs docstrings

* modify README according to SANCHIT feedback

* add chroma to audio utils

* clean librosa and torchaudio hard dependencies

* fix FE

* refactor audio decoder -> audio encoder for consistency with previous musicgen

* refactor conditional -> encoder

* modify sampling rate logics

* modify license at the beginning

* refactor all_self_attns->all_attentions

* remove ignore copy from causallm generate

* add copied from for from_sub_models

* fix make copies

* add warning if audio is truncated

* add copied from where relevant

* remove artefact

* fix convert script

* fix torchaudio and FE

* modify chroma method according to feedback-> better naming

* refactor input_values->input_features

* refactor input_values->input_features and fix import fe

* add input_features to docstrigs

* correct inputs_embeds logics

* remove dtype conversion

* refactor _prepare_conditional_hidden_states_kwargs_for_generation ->_prepare_encoder_hidden_states_kwargs_for_generation

* change warning for chroma length

* Update src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* change way to save wav, using soundfile

* correct docs and change to soundfile

* fix import

* fix init proj layers

* remove line breaks from md

* fix issue with docstrings

* add FE suggestions

* improve is in logics and remove useless imports

* remove custom from_pretrained

* simplify docstring code

* add suggestions for modeling tests

* make style

* update converting script with sanity check

* remove encoder attention mask from conditional generation

* replace musicgen melody checkpoints with official orga

* rename ylacombe->facebook in checkpoints

* fix copies

* remove unecessary warning

* add shape in code docstrings

* add files to slow doc tests

* fix md bug and add md to not_tested

* make fix-copies

* fix hidden states test and batching

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants