Llava Onevision: add model #32673

zucchini-nlp · 2024-08-14T05:54:10Z

What does this PR do?

Adds llava onevision models (https://arxiv.org/abs/2408.03326) which are basically same as LLaVa-NeXT with have a different vision tower and image grid pinpoints

HuggingFaceDocBuilderDev · 2024-08-14T06:12:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts · 2024-08-14T08:16:51Z

Thanks for adding this model @zucchini-nlp!

If the model has its own paper - we should add it as its own model and leverage the diff converter for the model addition if changes are needed

NielsRogge · 2024-08-14T08:54:36Z

Agree with @amyeroberts, which is more in line with the philosophy

codeb2cc · 2024-08-14T13:00:19Z

@zucchini-nlp You've hardcoded a path in modeling_llava_next.py and I suppose it's for debugging?

        if input_ids.shape[1] > 1:
            torch.save(inputs_embeds, "/raid/raushan/inputs_embeds_our.pt")

zucchini-nlp · 2024-08-14T13:08:46Z

Yes, will remove that. I am making a completely new modeling folder for Onevision currently

zucchini-nlp · 2024-08-15T12:54:21Z

@amyeroberts Ready for review. Added a new model LLaVa-Onevision. I didn't use diff because I realized that copying everything from llava-next is not a good decision. The current code incorporates all the recent refactors we've been working on and supports GenerationTesterMixin 🎉 Apart from that it can support static cache, though not compile due to dynamic control.

I added two processing files, one for image and another for video, since I've been planning to go over video-image LLMs and separate processing into two files. And later add more utility functions to the BaseVideoProcessor class. But in general, there's no difference between them except for that videos can be list of 4D tensors and we need one more loop over frame dimension.

amyeroberts

Thanks for adding!

Main comment is about the new logic to use config.text_config

amyeroberts · 2024-08-15T14:58:24Z

src/transformers/cache_utils.py

+
+        if hasattr(config, "text_config"):  # in case of composite models
+            config = config.text_config
+


I'm not sure we want to do this this way: we could equally end up with several text models which might be called different things than text_config. Instead, the cache should be created with the correct config i.e. StaticCache(config.text_config)

We can do it inside cache class also, and in that case handling of composite models will move to cache initialization of each cache class. I see comment from @gante about a helper function to get decoder/composite models' configs. In that case we can leave it here imo, and after @gante merged a PR modify this and other places to self.get_text_config(config)

Oops, realized it is in cache utils. Will move this to generation mixin and adjust tests if needed. Later gante will unify it in helper fn

amyeroberts · 2024-08-15T14:59:24Z

src/transformers/generation/utils.py

+        final_layer = (
+            self.config.text_config.num_hidden_layers
+            if hasattr(self.config, "text_config")
+            else self.config.num_hidden_layers
+        )


Same here - this is making strong (and possibly quite brittle) assumptions about the configs and what properties to use

Hmm, let me see if we can rely on smth else and not config, and check out the case if a encoder-decoder config is used. Prob we'll have to handle it same way as above, via a helper function

No, seems like we need to get num-layers, so will leave it as is to be handled properly in later PRs. I think this and other generations might be failing in EncoderDecoder models because those are not tested with GenerationMixin

OK cc @gante for reference

src/transformers/models/llava_onevision/modeling_llava_onevision.py

src/transformers/models/llava_onevision/processing_llava_onevision.py

tests/models/llava_onevision/test_modeling_llava_onevision.py

docs/source/en/model_doc/llava_onevision.md

amyeroberts · 2024-08-16T12:42:36Z

tests/generation/test_utils.py

@@ -459,7 +459,8 @@ def test_greedy_generate_dict_outputs(self):
                # Retrocompatibility check
                self.assertIsInstance(output_generate, GreedySearchDecoderOnlyOutput)

-            self._check_outputs(output_generate, input_ids, model.config)
+            config = model.config if not hasattr(model.config, "text_config") else model.config.text_config


This feels brittle and makes a strong assumption about the structure of our configs it would be good to have @gante's view on this here.

I'd suggest moving this handling into _check_outputs so the logic change only has to be in on place rather than every _check_outputs call, which already has conditional logic on e.g. config.is_encoder_decoder within the utility

We are starting to have more places where we need to pull the decoder part of a config in a composite/encoder-decoder model:

in the generation config

in while saving a pretrained config, to ensure it doesn't contain generate flags [open PR]

here

Note that they are all generate-related. In this case, the shape of the generate outputs depends on the decoder, so we need to pull the decoder config for testing purposes.

@amyeroberts I was thinking of abstracting this logic into GenerationMixin, in a function that pulls the decoder config given a config. WDYT? :)

I'd be pro! This should help encapsulate the logic without having to handle it everywhere else.

Cool, I'll open a PR after #32659 gets merged (to avoid conflicts :) )

Great, will port these to _check_output until the PR is merged

FYI opening a PR to add the generic function AND fixing _check_output (#33212 also needs it :D )

…sion.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

zucchini-nlp · 2024-08-19T11:12:45Z

@amyeroberts should be ready! I left the config vs text_config when applicable until @gante merged his PR, after that those can be replaced by a call to something like self.get_config()

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

zucchini-nlp · 2024-09-02T08:18:44Z

Will need some changes in video processing as discussed here (LLaVA-VL/LLaVA-NeXT#144). Comes out the demo notebook was buggy and videos should be processed in a different way. Will work on it when I get reply from authors on what should be the correct way

amyeroberts · 2024-09-02T11:17:23Z

@zucchini-nlp Thanks for the update! I'll mark myself a re-requesting for review then as this will be a significant change

zucchini-nlp · 2024-09-02T11:39:03Z

@amyeroberts yes, the tests should be green now. Merging main somehow rewinded my local changes

amyeroberts

Looks great - thanks for all the work iterating on this!

Only thing is to add a short description of the model and resolve the failing SDPA test before merge

docs/source/en/model_doc/llava_onevision.md

amyeroberts · 2024-09-03T18:10:21Z

docs/source/en/model_doc/llava_onevision.md

+
+## Overview
+
+The LLaVA-Onevision model was proposed in [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326) by <Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li


Could you add a short description of the model here? This is what will be used in the release notes. A good example is the music gen page

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

zucchini-nlp · 2024-09-05T09:43:12Z

oke, will merge it now as the model works. The only thing I am struggling now is ONNX conversion which is not related to the PR

mylesgoose · 2024-09-26T23:31:18Z

@zucchini-nlp

hello any idea how to get a llama model working with this pr. mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ov regards myles I have tried adding the similar files to the huggingface repo as per the qwen model which works. there is small issue I think because the 4b model is a stripped down 8b model with weight sizes.

zucchini-nlp · 2024-09-27T09:18:22Z

@mylesgoose I am not sure if you got the model from tuning using original repo or in another way. If you used the original repo, feel free to convert to HF style using conver_weights.py script. Otherwise, you would need to write your own conversion script

We try to reserve GH for issues/bugs and feature requests, feel free to open a thread in the forum for any further questions 🤗

mylesgoose · 2024-09-27T10:05:54Z

@zucchini-nlp no I made the model using your git repo repo llava one vision. it's a llama 4b model. and this is a feature request. I trained it. because the model is only supported of qwen type models. at present this transformers can't understand my weight are smaller. I am not sure exactly how to modify this one vision transformers to load the llama model type. can you have a look at the huggingface repo.and you will see what I mean.

zucchini-nlp · 2024-09-28T11:02:16Z

@mylesgoose I am not sure I understand you. The repo shows a trained model with a different LM backbone and has inference script with llava-vl repo code. So I guess you did training and can do generation with llava-vl repo

In transformers we already support any LM backbone, and all you need is to convert the weights, which has several conversion scripts for different llava models. So this is not a feature request. In case you still believe we can help you by adding/modifying llava code, please open an issue and provide minimal script to see what you want to achieve and what output is expected from running it 🤗

mylesgoose · 2024-09-28T14:21:36Z

Okay thanks @zucchini-nlp I understand now. I found that script and ran one to generate the bin file and that worked. however the script does not work on the llama model. so I will modify it tomorrow and get it to work and then return working adjusted one. I think it also does not add tokens for video etc also. or check if pad token exists prior to creating one. and created meta tensors on devices and could not use in latter step. will figure it out. thanks allot.

mylesgoose · 2024-09-29T03:53:42Z

This seems to have worked. @zucchini-nlp

import argparse
import torch
from transformers import (
    AddedToken,
    AutoConfig,
    AutoImageProcessor,
    AutoTokenizer,
    LlavaConfig,
    LlavaForConditionalGeneration,
    LlavaProcessor,
    SiglipVisionConfig,
)

EPILOG_TXT = """Example:
    python convert_llava_weights_to_hf.py \\
      --text_model_id mylesgoose/Llava-Llama-3.1-Minitron-4B-Nvidia-siglip-ov \\
      --vision_model_id google/siglip-so400m-patch14-384 \\
      --output_hub_path mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ov-hf \\
      --old_state_dict_id /home/myles/LLaVA-NeXT/tmp/hf_models/llava-Llama-3.1-Minitron-4B/model_state_dict.bin

Example for creating the old state dict file with Python:

    import torch
    from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM

    # Load model
    kwargs = {"device_map": "auto", "torch_dtype": torch.float16}
    model = LlavaLlamaForCausalLM.from_pretrained("mylesgoose/Llava-Llama-3.1-Minitron-4B-Nvidia-siglip-ov", low_cpu_mem_usage=True, **kwargs)

    # Load vision tower
    model.get_vision_tower().load_model()

    # Save state dict
    torch.save(model.state_dict(), "/home/myles/LLaVA-NeXT/tmp/hf_models/llava-Llama-3.1-Minitron-4B/model_state_dict.bin")
"""

KEYS_TO_MODIFY_MAPPING = {
    "model.vision_tower.vision_tower.": "vision_tower.",
    "model.vision_tower.": "vision_tower.",
    ".vision_resampler": "",  # Not used in the original implementation
    "model.mm_projector": "multi_modal_projector",
    "model": "model.model",
    "vision_model.model": "vision_model",
    "lm_head": "language_model.lm_head",
    "model.model": "language_model.model",
    "multi_modal_projector.0": "multi_modal_projector.linear_1",
    "multi_modal_projector.2": "multi_modal_projector.linear_2",
}

def convert_state_dict_to_hf(state_dict):
    new_state_dict = {}
    for key, value in state_dict.items():
        # Skip keys ending with .inv_freq (if present)
        if key.endswith(".inv_freq"):
            continue
        # Modify keys based on the mapping
        for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
            if key_to_modify in key:
                key = key.replace(key_to_modify, new_key)
        new_state_dict[key] = value
    return new_state_dict

def convert_llava_llama_to_hf(text_model_id, vision_model_id, output_hub_path, old_state_dict_id):
    torch.set_default_dtype(torch.bfloat16)
    text_config = AutoConfig.from_pretrained(text_model_id)

    tokenizer = AutoTokenizer.from_pretrained(text_model_id)
    tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
    tokenizer.add_tokens(AddedToken("<video>", special=True, normalized=False), special_tokens=True)
    if tokenizer.pad_token_id is None:
        tokenizer.add_special_tokens({"pad_token": "<|finetune_right_pad_id|>"})

    image_processor = AutoImageProcessor.from_pretrained(vision_model_id)
    processor = LlavaProcessor(tokenizer=tokenizer, image_processor=image_processor)

    # Set vision_config based on the vision model
    vision_config = SiglipVisionConfig.from_pretrained(vision_model_id)

    # Adjust the vision_config to match your model's parameters
    vision_config.hidden_size = 1152
    vision_config.num_attention_heads = 16
    vision_config.num_hidden_layers = 24
    vision_config.intermediate_size = 4304
    vision_config.patch_size = 14
    vision_config.image_size = 384
    vision_config.vision_use_head = False

    config = LlavaConfig(
        text_config=text_config,
        vision_config=vision_config.to_dict(),
    )

    # Set other configuration parameters as needed
    config.pad_token_id = tokenizer.pad_token_id
    config.image_token_index = tokenizer.convert_tokens_to_ids("<image>")
    config.video_token_index = tokenizer.convert_tokens_to_ids("<video>")
    config.vision_feature_layer = -2  # Adjust if necessary
    config.vision_tower_pretrained = vision_model_id

    # Instantiate the model
    model = LlavaForConditionalGeneration(config)

    # Load the state dictionary from the provided path
    state_dict_path = old_state_dict_id
    state_dict = torch.load(state_dict_path, map_location="cpu")

    # Convert state dict to match Hugging Face model's expected keys
    state_dict = convert_state_dict_to_hf(state_dict)

    # Load the state dictionary into the model
    missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
    print(f"Missing keys: {missing_keys}")
    print(f"Unexpected keys: {unexpected_keys}")

    # Resize token embeddings to accommodate new tokens
    model.resize_token_embeddings(len(tokenizer))

    # Get the updated sizes
    pre_expansion_embeddings = model.language_model.model.embed_tokens.weight.data
    vocab_size = pre_expansion_embeddings.size(0)
    new_vocab_size = model.language_model.model.embed_tokens.weight.size(0)
    num_new_tokens = new_vocab_size - vocab_size

    print(f"Original vocab size: {vocab_size}")
    print(f"New vocab size: {new_vocab_size}")
    print(f"Number of new tokens: {num_new_tokens}")

    # Initialize new token embeddings only if there are new tokens
    if num_new_tokens > 0:
        mu = torch.mean(pre_expansion_embeddings, dim=0).float()
        n = pre_expansion_embeddings.size(0)
        sigma = ((pre_expansion_embeddings - mu).T @ (pre_expansion_embeddings - mu)) / n
        dist = torch.distributions.multivariate_normal.MultivariateNormal(
            mu.cpu(), covariance_matrix=1e-5 * sigma.cpu()
        )

        new_embeddings = torch.stack(
            [dist.sample() for _ in range(num_new_tokens)], dim=0
        ).to(pre_expansion_embeddings.device)
        model.language_model.model.embed_tokens.weight.data[vocab_size:] = new_embeddings
        model.language_model.lm_head.weight.data[vocab_size:] = new_embeddings
    else:
        print("No new tokens to add to embeddings.")

    # Save the converted model and processor
    model.push_to_hub(output_hub_path)
    processor.push_to_hub(output_hub_path)

def main():
    parser = argparse.ArgumentParser(
        epilog=EPILOG_TXT,
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--text_model_id",
        default="mylesgoose/Llava-Llama-3.1-Minitron-4B-Nvidia-siglip-ov",
        help="Hub location of the text model",
    )
    parser.add_argument(
        "--vision_model_id",
        default="google/siglip-so400m-patch14-384",
        help="Hub location of the vision model",
    )
    parser.add_argument(
        "--output_hub_path",
        default="mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ov-hf",
        help="Location on the hub of the converted model",
    )
    parser.add_argument(
        "--old_state_dict_id",
        default="/home/myles/LLaVA-NeXT/tmp/hf_models/llava-Llama-3.1-Minitron-4B/model_state_dict.bin",
        help="Path to the raw state dict of the original model. The filename needs to be `model_state_dict.bin`",
    )
    args = parser.parse_args()
    convert_llava_llama_to_hf(
        args.text_model_id,
        args.vision_model_id,
        args.output_hub_path,
        args.old_state_dict_id
    )

if __name__ == "__main__":
    main()

* working version * fix copies * update * tests * update docs * codestyle * add more tests * add returns for docs * clean up * Update src/transformers/models/llava_onevision/processing_llava_onevision.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates * codestyle * style * shouldn't be reversed * [run-slow] llava_onevision * [run-slow] llava_onevision * add pooling in videos * [run-slow] llava_onevision * num-logits-to-keep * [run-slow] llava_onevision * [run-slow] llava_onevision * Update tests/test_modeling_common.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * video matched orig impl * fix tests * chat template was modified * Update docs/source/en/model_doc/llava_onevision.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add morer info in the doc page --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

zucchini-nlp requested review from amyeroberts and NielsRogge August 14, 2024 05:54

zucchini-nlp added 2 commits August 15, 2024 08:16

working version

04ea11d

fix copies

380e99a

zucchini-nlp force-pushed the llava-onevision branch from 8c3f82c to 380e99a Compare August 15, 2024 06:17

zucchini-nlp and others added 7 commits August 15, 2024 11:26

update

2c82735

tests

1f854cc

update docs

4c41164

codestyle

7926afd

Merge branch 'huggingface:main' into llava-onevision

ed4c3ad

add more tests

e6ca32c

add returns for docs

b734893

clean up

c660105

amyeroberts reviewed Aug 16, 2024

View reviewed changes

zucchini-nlp and others added 5 commits August 19, 2024 12:48

Update src/transformers/models/llava_onevision/processing_llava_onevi…

f85aecf

…sion.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

updates

4829831

codestyle

1254c13

Merge branch 'main' into llava-onevision

fdbd460

style

3ecaa0d

zucchini-nlp requested a review from amyeroberts August 19, 2024 11:12

zjysteven mentioned this pull request Aug 20, 2024

LLaVA-OneVision for single-image, multi-image, and video zjysteven/lmms-finetune#25

Closed

zucchini-nlp added 2 commits August 21, 2024 06:56

shouldn't be reversed

6ac443e

[run-slow] llava_onevision

6025390

gante mentioned this pull request Aug 30, 2024

Config: unified logic to retrieve text config #33219

Merged

Update tests/test_modeling_common.py

2a69a9a

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Jintao-Huang mentioned this pull request Aug 31, 2024

ImportError: cannot import name 'LlavaOnevisionForConditionalGeneration' from 'transformers' modelscope/ms-swift#1878

Closed

zucchini-nlp added 2 commits September 2, 2024 13:12

video matched orig impl

5322d6f

Merge remote-tracking branch 'upstream/main' into llava-onevision

b27c4f3

amyeroberts self-requested a review September 2, 2024 11:17

zucchini-nlp added 2 commits September 2, 2024 13:32

fix tests

ecd6743

chat template was modified

278eb86

zucchini-nlp mentioned this pull request Sep 3, 2024

Adding mplugdocowl #31792

Open

5 tasks

amyeroberts approved these changes Sep 3, 2024

View reviewed changes

zucchini-nlp and others added 3 commits September 4, 2024 15:06

Update docs/source/en/model_doc/llava_onevision.md

4ce02ed

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

add morer info in the doc page

7c5ae0d

Merge branch 'main' into llava-onevision

1fcb179

RUFFY-369 mentioned this pull request Sep 4, 2024

Add propainter #33217

Open

10 tasks

zucchini-nlp merged commit 43df47d into huggingface:main Sep 5, 2024
24 checks passed

zucchini-nlp mentioned this pull request Sep 8, 2024

Track progress for VLMs refactoring #33374

Closed

16 tasks

litianjian mentioned this pull request Sep 14, 2024

[Model][VLM] Add LLaVA-Onevision model support vllm-project/vllm#8486

Merged

3 tasks


		if hasattr(config, "text_config"): # in case of composite models
		config = config.text_config


		## Overview

		The LLaVA-Onevision model was proposed in [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326) by <Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

Llava Onevision: add model #32673

Llava Onevision: add model #32673

Uh oh!

Conversation

zucchini-nlp commented Aug 14, 2024

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 14, 2024

Uh oh!

amyeroberts commented Aug 14, 2024

Uh oh!

NielsRogge commented Aug 14, 2024

Uh oh!

codeb2cc commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Aug 14, 2024

Uh oh!

zucchini-nlp commented Aug 15, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Aug 19, 2024

Uh oh!

zucchini-nlp commented Sep 2, 2024

Uh oh!

amyeroberts commented Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Sep 2, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Sep 5, 2024

Uh oh!

Uh oh!

codeb2cc commented Aug 14, 2024 •

edited

Loading

zucchini-nlp Aug 19, 2024 •

edited

Loading

zucchini-nlp Aug 19, 2024 •

edited

Loading

gante Aug 16, 2024 •

edited

Loading

amyeroberts commented Sep 2, 2024 •

edited

Loading