Add MusicGen Melody #28819

ylacombe · 2024-02-01T17:04:20Z

What does this PR do?

MusicGen Melody was released at the same time than the "o.g" MusicGen that has already been integrated to transformers.

Contrarily to the already integrated model, you can condition the generation with an audio prompt (instead of continuation of the audio prompt).

Main conceptual difference-> we no longer use cross-attention to condition the generation with the text/audio prompt, but instead we concatenate the text/audio prompt to the decoder hidden states.

This makes the model a bit simpler, since it's no longer a "proper" encoder-decoder architecture but a decoder-only that can be conditioned (a bit like Fuyu).

Note that there are 3 key "modalities":
-> the prompt text that is passed through a text encoder model.
-> the audio prompt that is processed by the feature extractor to give a chromagram.
-> the musicgen decoder, that generate Encodec codes.

Why is this model interesting?

Audio prompting instead of audio generation gives really interesting generation
Musicgen is a decoder-only model, and is difficult to train using the original library. I ideally plan to add training capabilities to the model.

cc @sanchit-gandhi

HuggingFaceDocBuilderDev · 2024-02-01T18:14:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sanchit-gandhi

Looks in pretty good shape! Just some comments about the main docs, as well as argument naming for the conditional hidden-states. While this model is non-standard in some regards, it would be good to try and standardise it as much as possible with MusicGen and the general Transformers API, so as to ensure users can run this model as they expect from the Transformers library.

sanchit-gandhi · 2024-02-05T09:34:47Z

docs/source/en/model_doc/musicgen_melody.md

+[Hugging Face Hub](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen).
+
+
+## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)


Nice! Do you think it makes sense to include the difference in a diagram? E.g. as per:

Thanks for the suggestion, I don't see your example though!

docs/source/en/model_doc/musicgen_melody.md

sanchit-gandhi · 2024-02-05T09:46:50Z

docs/source/en/model_doc/musicgen_melody.md

+
+The audio prompt should ideally be free of the low-frequency signals usually produced by instruments such as drums and bass. The [Demucs](https://github.com/adefossez/demucs/tree/main) model can be used to separate vocals and other signals from the drums and bass components.
+
+If you wish to use Demucs, you first need to follow the installation steps [here](https://github.com/adefossez/demucs/tree/main?tab=readme-ov-file#for-musicians) before using the following snippet:


Note to reviewer: as detailed, the Demucs model is required as a pre-processing step for audio-conditioned generation. There are two options for using the Demucs model here:

Leverage the original implementation, which has 7.4k GH stars, a huge existing user-base and an intuitive API (as shown below). The requirements for the package are significant, but install reliably.

Convert the Demucs model to Transformers, for example as started in this PR. Note that adding Demucs to the library would only really be beneficial to the MusicGen Melody integration: it's unlikely to be used standalone by Demucs users (since they already have the original implementation installed) and we don't plan on adding fine-tuning support.

=> given how easy it is to use the original implementation, and the shift in mindset to generally try and support more OS libraries (rather than compete with them), I would be in favour of going with option 1. In doing so, we can build a tighter integration between the Demucs library and the HF Hub, rather than blindly integrate it into Transformers for a single model integration.

Speaking to @patrickvonplaten, having users perform pre-processing outside of a Diffusers pipeline has worked better than integrating it into the pipeline

E.g. for ControlNet, having users perform pre-processing with the opencv-python library worked better than integrating this into the pipeline: https://huggingface.co/docs/diffusers/en/using-diffusers/controlnet#text-to-image

Makes sense!
The only "issue" with tight integration is maintenance. Some tokenizers natively support other libraries (like moses) if available.
Doing it outside sounds good to me!

docs/source/en/model_doc/musicgen_melody.md

src/transformers/models/musicgen_melody/modeling_musicgen_melody.py

src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py

sanchit-gandhi · 2024-02-05T10:52:48Z

src/transformers/models/musicgen_melody/processing_musicgen_melody.py

+        return audio_values
+
+    def get_unconditional_inputs(self, num_samples=1, return_tensors="pt"):
+        """# TODO: update,


This will give the same behaviour as:

transformers/src/transformers/models/musicgen/modeling_musicgen.py

Line 2497 in 2da28c4

def get_unconditional_inputs(self, num_samples=1):

(it doesn't matter what the input ids are, since we mask them anyway the model won't attend to them)

src/transformers/models/musicgen_melody/modeling_musicgen_melody.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

… musicgen

gante · 2024-02-14T15:29:06Z

@ylacombe at a quick glance, the generate method looks the same as in MusicGen, which has been previously approved. Are there differences that you'd like me to review? 🤗

amyeroberts

Thanks for adding this model!

Mostly just some small nits from me. Overall a really nice PR - the effort put into the tests and model page in particular really stand out.

Overall comment: I realise this is in part due to fitting the model into the transformers library and copying from MusicGen, but this was really quite a difficult PR to review. There's loads of repeated logic in the generate and generate related methods and the forward and generate passes are huge. I'd like to request a follow up PR where a lot of this is abstracted out into smaller, more modular methods.

Only other comment is the compatibility with the rest of our generate library and selecting greedy versus sample behaviours as well as logit processors. cc @gante to confirm this is OK.

amyeroberts · 2024-02-14T17:10:59Z

tests/utils/test_audio_utils.py

@@ -755,3 +761,57 @@ def test_amplitude_to_db(self):
            amplitude_to_db(spectrogram, min_value=0.0)
        with pytest.raises(ValueError):
            amplitude_to_db(spectrogram, db_range=-80)
+
+    @require_librosa
+    def test_chroma_equivalence(self):


Nice test :)

src/transformers/models/musicgen_melody/processing_musicgen_melody.py

src/transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py

amyeroberts · 2024-02-14T17:27:47Z

src/transformers/models/musicgen_melody/modeling_musicgen_melody.py

+        # Initialize projection layers weights and tie text encoder and decoder weights if set accordingly
+        self.post_init()
+
+    def _init_weights(self, module):


Could you expand on this a bit? It shouldn't be necessary to add this here.

MusicgenMelodyForConditionalGeneration doesn't inherit from MusicgenMelodyPreTrainedModel as it raises issues in offloading features (probably because self.decoder also inherits from MusicgenMelodyPreTrainedModel).

So we need to initialize weights that are not yet initialized, namely enc_to_dec_proj and audio_enc_to_dec_proj

tests/models/musicgen_melody/test_modeling_musicgen_melody.py

amyeroberts · 2024-02-14T17:55:09Z

tests/models/musicgen_melody/test_modeling_musicgen_melody.py

+    # skip as this model doesn't support all arguments tested
+    def test_model_outputs_equivalence(self):
+        pass
+
+    # skip as this model has multiple inputs embeds and lm heads that should not be tied
+    def test_tie_model_weights(self):
+        pass
+
+    # skip as this model has multiple inputs embeds and lm heads that should not be tied
+    def test_tied_weights_keys(self):
+        pass


All skipped tests should be skipped with unittest.skip

I'll add them for MusicgenMelodyDecoderTest but will leave MusicgenMelodyTest's ones, as they are almost all copied from MusicgenTest and I'd rather keep traceability over unittest.skip

amyeroberts · 2024-02-14T18:10:51Z

src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py

+
+
+@torch.no_grad()
+def convert_musicgen_melody_checkpoint(checkpoint, pytorch_dump_folder=None, repo_id=None, device="cpu"):


It would be good to have some valuation on the outputs between the original and converted model

(I usually ask to make it optional, as it only holds for the pretrained model and our Integration tests should be where we make sure the converted checkpoint works! )

amyeroberts · 2024-02-14T18:13:57Z

tests/models/musicgen_melody/test_modeling_musicgen_melody.py

+        logits_processor = LogitsProcessorList()
+        return process_kwargs, logits_processor
+
+    def test_greedy_generate_dict_outputs(self):


How long do all these generate tests take to run? We might want to decorate with @slow

amyeroberts · 2024-02-14T18:14:58Z

docs/source/en/model_doc/musicgen_melody.md

+
+-->
+
+# MusicGen Melody


Amazing model page 😍

gante · 2024-02-14T18:27:55Z

I'd like to request a follow up PR where a lot of this is abstracted out into smaller, more modular methods.

@amyeroberts 100% agreed! However, I believe the ball is mostly on the generate side -- it should be made more flexible, such that enabling models like MusicGen becomes a < 100-line task.

The MusicGen PR had the exact same pattern :)

ArthurZucker

Great work, leveraging all the available tools + outside libs! 🤗

I feel like we are giving a bit to much freedom / tools while for maintenance, only supporting one model init will be better + less logic to handle + less things to test.

is the final checkpoint going to be hosted at ylacombe/musicgen-melody? Would be nice to have it on either Meta or a community based repo? ( facebook/musicgen-melody-hf?)
might be nice to split the complex generation logic to a separate file like whisper? Keeping the modeling for the model
small nits in the doc can be ignored.

docs/source/en/model_doc/musicgen_melody.md

ArthurZucker · 2024-02-15T01:06:06Z

src/transformers/models/musicgen_melody/modeling_musicgen_melody.py

+        >>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
+        ```"""
+
+        # At the moment fast initialization is not supported for composite models


any idea why? Llava works nicely with this I believe so surprised that we have a prob

I essentially copied out the logics from Musicgen, but when removing this snippet, all tests passed. I'll removed it.

cc @sanchit-gandhi, any reasons for this ?

Should be fine to remove - was copied from an older model (encoder-decoder) into MusicGen

ArthurZucker · 2024-02-15T01:08:39Z

tests/models/musicgen_melody/test_modeling_musicgen_melody.py

+    # skip as this model doesn't support all arguments tested
+    def test_model_outputs_equivalence(self):
+        pass
+
+    # skip as this model has multiple inputs embeds and lm heads that should not be tied
+    def test_tie_model_weights(self):
+        pass
+
+    # skip as this model has multiple inputs embeds and lm heads that should not be tied
+    def test_tied_weights_keys(self):
+        pass


ArthurZucker · 2024-02-15T01:11:40Z

src/transformers/models/musicgen_melody/modeling_musicgen_melody.py

+        >>> model = MusicgenMelodyForConditionalGeneration.from_sub_models_pretrained(
+        ...     text_encoder_pretrained_model_name_or_path="t5-base",
+        ...     audio_encoder_pretrained_model_name_or_path="facebook/encodec_24khz",
+        ...     decoder_pretrained_model_name_or_path="facebook/musicgen-melody",
+        ... )


I'm wondering if this is something we want to go vs just supporting init from 3 models, leave it to the user to call from_pretrained. Feels like we are giving a lot of tools

It's in theory possible to use other text encoder and audio encoder. It also allows flexibility to go back and forth from the CausalLM class, if training the decoder only

This was discussed quite extensively for MusicGen (slack thread), where we unanimously decided to keep functionality to initialise the composite model from three separate sub-models.

I would be strongly in favour of maintaining consistency with the MusicGen design here, rather than adding a new design for this spin-off model.

Alright. For next model I would drive this on usage = have we seen issue / people use this way of initializing or not! 🤗

ArthurZucker · 2024-02-15T01:12:09Z

src/transformers/models/musicgen_melody/modeling_musicgen_melody.py

+        if self.text_encoder.config.to_dict() != self.config.text_encoder.to_dict():
+            logger.warning(
+                f"Config of the text_encoder: {self.text_encoder.__class__} is overwritten by shared text_encoder config:"
+                f" {self.config.text_encoder}"
+            )
+        if self.audio_encoder.config.to_dict() != self.config.audio_encoder.to_dict():
+            logger.warning(
+                f"Config of the audio_encoder: {self.audio_encoder.__class__} is overwritten by shared audio_encoder config:"
+                f" {self.config.audio_encoder}"
+            )
+        if self.decoder.config.to_dict() != self.config.decoder.to_dict():
+            logger.warning(
+                f"Config of the decoder: {self.decoder.__class__} is overwritten by shared decoder config:"
+                f" {self.config.decoder}"
+            )


that's expected behaviour to sure we have to warn?

I mean let's not warn here

ArthurZucker · 2024-02-15T01:12:46Z

src/transformers/models/musicgen_melody/modeling_musicgen_melody.py

I think it makes sense to have a seperate file for the generation part like we do for whisper no?

This could make the modelling + generation code a lot cleaner for the MusicGen series! Although long-term, the issue would be fully resolved by a refactor to generate to make it more composable for audio models (as suggested by @gante)

How about we do this as a follow-up PR for MusicGen + MusicGen Melody? (so as not to mix two features into one PR)

* first modeling code * make repository * still WIP * update model * add tests * add latest change * clean docstrings and copied from * update docstrings md and readme * correct chroma function * correct copied from and remove unreleated test * add doc to toctree * correct imports * add convert script to notdoctested * Add suggestion from Sanchit Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * correct get_uncoditional_inputs docstrings * modify README according to SANCHIT feedback * add chroma to audio utils * clean librosa and torchaudio hard dependencies * fix FE * refactor audio decoder -> audio encoder for consistency with previous musicgen * refactor conditional -> encoder * modify sampling rate logics * modify license at the beginning * refactor all_self_attns->all_attentions * remove ignore copy from causallm generate * add copied from for from_sub_models * fix make copies * add warning if audio is truncated * add copied from where relevant * remove artefact * fix convert script * fix torchaudio and FE * modify chroma method according to feedback-> better naming * refactor input_values->input_features * refactor input_values->input_features and fix import fe * add input_features to docstrigs * correct inputs_embeds logics * remove dtype conversion * refactor _prepare_conditional_hidden_states_kwargs_for_generation ->_prepare_encoder_hidden_states_kwargs_for_generation * change warning for chroma length * Update src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * change way to save wav, using soundfile * correct docs and change to soundfile * fix import * fix init proj layers * remove line breaks from md * fix issue with docstrings * add FE suggestions * improve is in logics and remove useless imports * remove custom from_pretrained * simplify docstring code * add suggestions for modeling tests * make style * update converting script with sanity check * remove encoder attention mask from conditional generation * replace musicgen melody checkpoints with official orga * rename ylacombe->facebook in checkpoints * fix copies * remove unecessary warning * add shape in code docstrings * add files to slow doc tests * fix md bug and add md to not_tested * make fix-copies * fix hidden states test and batching --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

ylacombe and others added 14 commits January 3, 2024 14:36

first modeling code

853d2c0

make repository

2ff2f3d

still WIP

a3fa21f

update model

4c02db4

add tests

b141703

add latest change

2b19612

clean docstrings and copied from

eae18da

update docstrings md and readme

2285db3

correct chroma function

cb8f4c5

Merge branch 'main' into add-musicgen-melody

0ab4623

correct copied from and remove unreleated test

c1e196d

add doc to toctree

c8bf6c5

correct imports

f015753

add convert script to notdoctested

c8c5a4e

sanchit-gandhi reviewed Feb 5, 2024

View reviewed changes

ylacombe and others added 14 commits February 5, 2024 12:54

Add suggestion from Sanchit

2cf5cfb

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

Merge branch 'huggingface:main' into add-musicgen-melody

bce1aaf

correct get_uncoditional_inputs docstrings

0e944af

modify README according to SANCHIT feedback

1a03cd9

add chroma to audio utils

fded84d

clean librosa and torchaudio hard dependencies

133e486

fix FE

a70d0da

refactor audio decoder -> audio encoder for consistency with previous…

34c8270

… musicgen

refactor conditional -> encoder

fdd1743

modify sampling rate logics

b13cbcf

modify license at the beginning

2bb0adb

refactor all_self_attns->all_attentions

d06b327

remove ignore copy from causallm generate

7842840

add copied from for from_sub_models

8e7c128

ylacombe requested a review from amyeroberts February 12, 2024 08:45

amyeroberts reviewed Feb 14, 2024

View reviewed changes

ArthurZucker approved these changes Feb 15, 2024

View reviewed changes

ylacombe and others added 23 commits February 19, 2024 11:34

remove line breaks from md

b36e802

fix issue with docstrings

3fd2839

add FE suggestions

9f15d02

improve is in logics and remove useless imports

48c2c3f

remove custom from_pretrained

9a43be0

simplify docstring code

cf89389

add suggestions for modeling tests

bb69817

make style

fc33efb

update converting script with sanity check

ba4d732

remove encoder attention mask from conditional generation

5166259

Merge branch 'main' into add-musicgen-melody

755960a

Merge branch 'main' into add-musicgen-melody

8b9177f

replace musicgen melody checkpoints with official orga

ad26dc9

rename ylacombe->facebook in checkpoints

7595256

fix copies

2576806

remove unecessary warning

379d70b

add shape in code docstrings

9795c6f

add files to slow doc tests

b03b36d

fix md bug and add md to not_tested

b434f8a

Merge branch 'main' into add-musicgen-melody

ebeca43

make fix-copies

604a4c8

Merge branch 'huggingface:main' into add-musicgen-melody

7bda3c3

fix hidden states test and batching

5863cf9

ylacombe merged commit c43b380 into huggingface:main Mar 18, 2024
23 checks passed

		[Hugging Face Hub](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen).


		## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)


		The audio prompt should ideally be free of the low-frequency signals usually produced by instruments such as drums and bass. The [Demucs](https://github.com/adefossez/demucs/tree/main) model can be used to separate vocals and other signals from the drums and bass components.

		If you wish to use Demucs, you first need to follow the installation steps [here](https://github.com/adefossez/demucs/tree/main?tab=readme-ov-file#for-musicians) before using the following snippet:



		@torch.no_grad()
		def convert_musicgen_melody_checkpoint(checkpoint, pytorch_dump_folder=None, repo_id=None, device="cpu"):

Add MusicGen Melody #28819

Add MusicGen Melody #28819

Conversation

ylacombe commented Feb 1, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Feb 1, 2024

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante commented Feb 14, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante commented Feb 14, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi Feb 5, 2024 •

edited

Loading

gante commented Feb 14, 2024 •

edited

Loading