Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llava Onevision: add model #32673

Merged
merged 34 commits into from
Sep 5, 2024
Merged

Conversation

zucchini-nlp
Copy link
Member

What does this PR do?

Adds llava onevision models (https://arxiv.org/abs/2408.03326) which are basically same as LLaVa-NeXT with have a different vision tower and image grid pinpoints

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@amyeroberts
Copy link
Collaborator

Thanks for adding this model @zucchini-nlp!

If the model has its own paper - we should add it as its own model and leverage the diff converter for the model addition if changes are needed

@NielsRogge
Copy link
Contributor

Agree with @amyeroberts, which is more in line with the philosophy

@codeb2cc
Copy link

codeb2cc commented Aug 14, 2024

@zucchini-nlp You've hardcoded a path in modeling_llava_next.py and I suppose it's for debugging?

        if input_ids.shape[1] > 1:
            torch.save(inputs_embeds, "/raid/raushan/inputs_embeds_our.pt")

@zucchini-nlp
Copy link
Member Author

Yes, will remove that. I am making a completely new modeling folder for Onevision currently

@zucchini-nlp
Copy link
Member Author

@amyeroberts Ready for review. Added a new model LLaVa-Onevision. I didn't use diff because I realized that copying everything from llava-next is not a good decision. The current code incorporates all the recent refactors we've been working on and supports GenerationTesterMixin 🎉 Apart from that it can support static cache, though not compile due to dynamic control.

I added two processing files, one for image and another for video, since I've been planning to go over video-image LLMs and separate processing into two files. And later add more utility functions to the BaseVideoProcessor class. But in general, there's no difference between them except for that videos can be list of 4D tensors and we need one more loop over frame dimension.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

Main comment is about the new logic to use config.text_config

Comment on lines 1012 to 1015

if hasattr(config, "text_config"): # in case of composite models
config = config.text_config

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to do this this way: we could equally end up with several text models which might be called different things than text_config. Instead, the cache should be created with the correct config i.e. StaticCache(config.text_config)

Copy link
Member Author

@zucchini-nlp zucchini-nlp Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it inside cache class also, and in that case handling of composite models will move to cache initialization of each cache class. I see comment from @gante about a helper function to get decoder/composite models' configs. In that case we can leave it here imo, and after @gante merged a PR modify this and other places to self.get_text_config(config)

Oops, realized it is in cache utils. Will move this to generation mixin and adjust tests if needed. Later gante will unify it in helper fn

Comment on lines +2368 to +2372
final_layer = (
self.config.text_config.num_hidden_layers
if hasattr(self.config, "text_config")
else self.config.num_hidden_layers
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - this is making strong (and possibly quite brittle) assumptions about the configs and what properties to use

Copy link
Member Author

@zucchini-nlp zucchini-nlp Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me see if we can rely on smth else and not config, and check out the case if a encoder-decoder config is used. Prob we'll have to handle it same way as above, via a helper function

No, seems like we need to get num-layers, so will leave it as is to be handled properly in later PRs. I think this and other generations might be failing in EncoderDecoder models because those are not tested with GenerationMixin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cc @gante for reference

docs/source/en/model_doc/llava_onevision.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/llava_onevision.md Outdated Show resolved Hide resolved
@@ -459,7 +459,8 @@ def test_greedy_generate_dict_outputs(self):
# Retrocompatibility check
self.assertIsInstance(output_generate, GreedySearchDecoderOnlyOutput)

self._check_outputs(output_generate, input_ids, model.config)
config = model.config if not hasattr(model.config, "text_config") else model.config.text_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels brittle and makes a strong assumption about the structure of our configs it would be good to have @gante's view on this here.

I'd suggest moving this handling into _check_outputs so the logic change only has to be in on place rather than every _check_outputs call, which already has conditional logic on e.g. config.is_encoder_decoder within the utility

Copy link
Member

@gante gante Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are starting to have more places where we need to pull the decoder part of a config in a composite/encoder-decoder model:

  • in the generation config
  • in while saving a pretrained config, to ensure it doesn't contain generate flags [open PR]
  • here

Note that they are all generate-related. In this case, the shape of the generate outputs depends on the decoder, so we need to pull the decoder config for testing purposes.

@amyeroberts I was thinking of abstracting this logic into GenerationMixin, in a function that pulls the decoder config given a config. WDYT? :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be pro! This should help encapsulate the logic without having to handle it everywhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll open a PR after #32659 gets merged (to avoid conflicts :) )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, will port these to _check_output until the PR is merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI opening a PR to add the generic function AND fixing _check_output (#33212 also needs it :D )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp
Copy link
Member Author

@amyeroberts should be ready! I left the config vs text_config when applicable until @gante merged his PR, after that those can be replaced by a call to something like self.get_config()

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for all the work adding!

Comment on lines +2368 to +2372
final_layer = (
self.config.text_config.num_hidden_layers
if hasattr(self.config, "text_config")
else self.config.num_hidden_layers
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cc @gante for reference

tests/test_modeling_common.py Outdated Show resolved Hide resolved
Comment on lines +180 to +181
# Padding side can be in TextKwargs but is not accepted by the tokenizer
_ = output_kwargs["text_kwargs"].pop("padding_side", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a way to handle this better as we don't want to have to add this for all the processors cc @molbap

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just remove padding_side from TextKwargs, which I believe is already done in one of the PRs for standardizaing

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@zucchini-nlp
Copy link
Member Author

Will need some changes in video processing as discussed here (LLaVA-VL/LLaVA-NeXT#144). Comes out the demo notebook was buggy and videos should be processed in a different way. Will work on it when I get reply from authors on what should be the correct way

@amyeroberts
Copy link
Collaborator

amyeroberts commented Sep 2, 2024

@zucchini-nlp Thanks for the update! I'll mark myself a re-requesting for review then as this will be a significant change

@amyeroberts amyeroberts self-requested a review September 2, 2024 11:17
@zucchini-nlp
Copy link
Member Author

@amyeroberts yes, the tests should be green now. Merging main somehow rewinded my local changes

@zucchini-nlp zucchini-nlp mentioned this pull request Sep 3, 2024
5 tasks
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for all the work iterating on this!

Only thing is to add a short description of the model and resolve the failing SDPA test before merge

docs/source/en/model_doc/llava_onevision.md Show resolved Hide resolved

## Overview

The LLaVA-Onevision model was proposed in [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326) by <Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a short description of the model here? This is what will be used in the release notes. A good example is the music gen page

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@RUFFY-369 RUFFY-369 mentioned this pull request Sep 4, 2024
10 tasks
@zucchini-nlp
Copy link
Member Author

oke, will merge it now as the model works. The only thing I am struggling now is ONNX conversion which is not related to the PR

@zucchini-nlp zucchini-nlp merged commit 43df47d into huggingface:main Sep 5, 2024
24 checks passed
zucchini-nlp added a commit to zucchini-nlp/transformers that referenced this pull request Sep 6, 2024
* working version

* fix copies

* update

* tests

* update docs

* codestyle

* add more tests

* add returns for docs

* clean up

* Update src/transformers/models/llava_onevision/processing_llava_onevision.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates

* codestyle

* style

* shouldn't be reversed

* [run-slow] llava_onevision

* [run-slow] llava_onevision

* add pooling in videos

* [run-slow] llava_onevision

* num-logits-to-keep

* [run-slow] llava_onevision

* [run-slow] llava_onevision

* Update tests/test_modeling_common.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* video matched orig impl

* fix tests

* chat template was modified

* Update docs/source/en/model_doc/llava_onevision.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add morer info in the doc page

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* working version

* fix copies

* update

* tests

* update docs

* codestyle

* add more tests

* add returns for docs

* clean up

* Update src/transformers/models/llava_onevision/processing_llava_onevision.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates

* codestyle

* style

* shouldn't be reversed

* [run-slow] llava_onevision

* [run-slow] llava_onevision

* add pooling in videos

* [run-slow] llava_onevision

* num-logits-to-keep

* [run-slow] llava_onevision

* [run-slow] llava_onevision

* Update tests/test_modeling_common.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* video matched orig impl

* fix tests

* chat template was modified

* Update docs/source/en/model_doc/llava_onevision.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add morer info in the doc page

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@mylesgoose
Copy link

@zucchini-nlp

hello any idea how to get a llama model working with this pr. mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ov regards myles I have tried adding the similar files to the huggingface repo as per the qwen model which works. there is small issue I think because the 4b model is a stripped down 8b model with weight sizes.

@zucchini-nlp
Copy link
Member Author

@mylesgoose I am not sure if you got the model from tuning using original repo or in another way. If you used the original repo, feel free to convert to HF style using conver_weights.py script. Otherwise, you would need to write your own conversion script

We try to reserve GH for issues/bugs and feature requests, feel free to open a thread in the forum for any further questions 🤗

@mylesgoose
Copy link

@zucchini-nlp no I made the model using your git repo repo llava one vision. it's a llama 4b model. and this is a feature request. I trained it. because the model is only supported of qwen type models. at present this transformers can't understand my weight are smaller. I am not sure exactly how to modify this one vision transformers to load the llama model type. can you have a look at the huggingface repo.and you will see what I mean.

@zucchini-nlp
Copy link
Member Author

@mylesgoose I am not sure I understand you. The repo shows a trained model with a different LM backbone and has inference script with llava-vl repo code. So I guess you did training and can do generation with llava-vl repo

In transformers we already support any LM backbone, and all you need is to convert the weights, which has several conversion scripts for different llava models. So this is not a feature request. In case you still believe we can help you by adding/modifying llava code, please open an issue and provide minimal script to see what you want to achieve and what output is expected from running it 🤗

@mylesgoose
Copy link

Okay thanks @zucchini-nlp I understand now. I found that script and ran one to generate the bin file and that worked. however the script does not work on the llama model. so I will modify it tomorrow and get it to work and then return working adjusted one. I think it also does not add tokens for video etc also. or check if pad token exists prior to creating one. and created meta tensors on devices and could not use in latter step. will figure it out. thanks allot.

@mylesgoose
Copy link

This seems to have worked. @zucchini-nlp

import argparse
import torch
from transformers import (
    AddedToken,
    AutoConfig,
    AutoImageProcessor,
    AutoTokenizer,
    LlavaConfig,
    LlavaForConditionalGeneration,
    LlavaProcessor,
    SiglipVisionConfig,
)

EPILOG_TXT = """Example:
    python convert_llava_weights_to_hf.py \\
      --text_model_id mylesgoose/Llava-Llama-3.1-Minitron-4B-Nvidia-siglip-ov \\
      --vision_model_id google/siglip-so400m-patch14-384 \\
      --output_hub_path mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ov-hf \\
      --old_state_dict_id /home/myles/LLaVA-NeXT/tmp/hf_models/llava-Llama-3.1-Minitron-4B/model_state_dict.bin

Example for creating the old state dict file with Python:

    import torch
    from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM

    # Load model
    kwargs = {"device_map": "auto", "torch_dtype": torch.float16}
    model = LlavaLlamaForCausalLM.from_pretrained("mylesgoose/Llava-Llama-3.1-Minitron-4B-Nvidia-siglip-ov", low_cpu_mem_usage=True, **kwargs)

    # Load vision tower
    model.get_vision_tower().load_model()

    # Save state dict
    torch.save(model.state_dict(), "/home/myles/LLaVA-NeXT/tmp/hf_models/llava-Llama-3.1-Minitron-4B/model_state_dict.bin")
"""

KEYS_TO_MODIFY_MAPPING = {
    "model.vision_tower.vision_tower.": "vision_tower.",
    "model.vision_tower.": "vision_tower.",
    ".vision_resampler": "",  # Not used in the original implementation
    "model.mm_projector": "multi_modal_projector",
    "model": "model.model",
    "vision_model.model": "vision_model",
    "lm_head": "language_model.lm_head",
    "model.model": "language_model.model",
    "multi_modal_projector.0": "multi_modal_projector.linear_1",
    "multi_modal_projector.2": "multi_modal_projector.linear_2",
}

def convert_state_dict_to_hf(state_dict):
    new_state_dict = {}
    for key, value in state_dict.items():
        # Skip keys ending with .inv_freq (if present)
        if key.endswith(".inv_freq"):
            continue
        # Modify keys based on the mapping
        for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
            if key_to_modify in key:
                key = key.replace(key_to_modify, new_key)
        new_state_dict[key] = value
    return new_state_dict

def convert_llava_llama_to_hf(text_model_id, vision_model_id, output_hub_path, old_state_dict_id):
    torch.set_default_dtype(torch.bfloat16)
    text_config = AutoConfig.from_pretrained(text_model_id)

    tokenizer = AutoTokenizer.from_pretrained(text_model_id)
    tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
    tokenizer.add_tokens(AddedToken("<video>", special=True, normalized=False), special_tokens=True)
    if tokenizer.pad_token_id is None:
        tokenizer.add_special_tokens({"pad_token": "<|finetune_right_pad_id|>"})

    image_processor = AutoImageProcessor.from_pretrained(vision_model_id)
    processor = LlavaProcessor(tokenizer=tokenizer, image_processor=image_processor)

    # Set vision_config based on the vision model
    vision_config = SiglipVisionConfig.from_pretrained(vision_model_id)

    # Adjust the vision_config to match your model's parameters
    vision_config.hidden_size = 1152
    vision_config.num_attention_heads = 16
    vision_config.num_hidden_layers = 24
    vision_config.intermediate_size = 4304
    vision_config.patch_size = 14
    vision_config.image_size = 384
    vision_config.vision_use_head = False

    config = LlavaConfig(
        text_config=text_config,
        vision_config=vision_config.to_dict(),
    )

    # Set other configuration parameters as needed
    config.pad_token_id = tokenizer.pad_token_id
    config.image_token_index = tokenizer.convert_tokens_to_ids("<image>")
    config.video_token_index = tokenizer.convert_tokens_to_ids("<video>")
    config.vision_feature_layer = -2  # Adjust if necessary
    config.vision_tower_pretrained = vision_model_id

    # Instantiate the model
    model = LlavaForConditionalGeneration(config)

    # Load the state dictionary from the provided path
    state_dict_path = old_state_dict_id
    state_dict = torch.load(state_dict_path, map_location="cpu")

    # Convert state dict to match Hugging Face model's expected keys
    state_dict = convert_state_dict_to_hf(state_dict)

    # Load the state dictionary into the model
    missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
    print(f"Missing keys: {missing_keys}")
    print(f"Unexpected keys: {unexpected_keys}")

    # Resize token embeddings to accommodate new tokens
    model.resize_token_embeddings(len(tokenizer))

    # Get the updated sizes
    pre_expansion_embeddings = model.language_model.model.embed_tokens.weight.data
    vocab_size = pre_expansion_embeddings.size(0)
    new_vocab_size = model.language_model.model.embed_tokens.weight.size(0)
    num_new_tokens = new_vocab_size - vocab_size

    print(f"Original vocab size: {vocab_size}")
    print(f"New vocab size: {new_vocab_size}")
    print(f"Number of new tokens: {num_new_tokens}")

    # Initialize new token embeddings only if there are new tokens
    if num_new_tokens > 0:
        mu = torch.mean(pre_expansion_embeddings, dim=0).float()
        n = pre_expansion_embeddings.size(0)
        sigma = ((pre_expansion_embeddings - mu).T @ (pre_expansion_embeddings - mu)) / n
        dist = torch.distributions.multivariate_normal.MultivariateNormal(
            mu.cpu(), covariance_matrix=1e-5 * sigma.cpu()
        )

        new_embeddings = torch.stack(
            [dist.sample() for _ in range(num_new_tokens)], dim=0
        ).to(pre_expansion_embeddings.device)
        model.language_model.model.embed_tokens.weight.data[vocab_size:] = new_embeddings
        model.language_model.lm_head.weight.data[vocab_size:] = new_embeddings
    else:
        print("No new tokens to add to embeddings.")

    # Save the converted model and processor
    model.push_to_hub(output_hub_path)
    processor.push_to_hub(output_hub_path)

def main():
    parser = argparse.ArgumentParser(
        epilog=EPILOG_TXT,
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--text_model_id",
        default="mylesgoose/Llava-Llama-3.1-Minitron-4B-Nvidia-siglip-ov",
        help="Hub location of the text model",
    )
    parser.add_argument(
        "--vision_model_id",
        default="google/siglip-so400m-patch14-384",
        help="Hub location of the vision model",
    )
    parser.add_argument(
        "--output_hub_path",
        default="mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ov-hf",
        help="Location on the hub of the converted model",
    )
    parser.add_argument(
        "--old_state_dict_id",
        default="/home/myles/LLaVA-NeXT/tmp/hf_models/llava-Llama-3.1-Minitron-4B/model_state_dict.bin",
        help="Path to the raw state dict of the original model. The filename needs to be `model_state_dict.bin`",
    )
    args = parser.parse_args()
    convert_llava_llama_to_hf(
        args.text_model_id,
        args.vision_model_id,
        args.output_hub_path,
        args.old_state_dict_id
    )

if __name__ == "__main__":
    main()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.