-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Initial support for LLaVA-NeXT #4199
Conversation
- Also add docs for basic VLM usage
a35cfb7
to
9cf6653
Compare
- Also add docs for basic VLM usage
- Note that LLaVA-1.5 has been refactored to facilitate this
9cf6653
to
ea4f8ed
Compare
Nice work, any plan to port CLIPVisionModel's code? |
@jeejeelee It is outside the scope of this PR; however you are welcome to voice your thoughts in #4194. |
- Other data types may need to be of different dtype from that of the model
Thanks for the review! I have addressed your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this model and I'll merge this after pushing a final addition on the docs.
Co-authored-by: Roger Wang <ywang@roblox.com>
Oops, turns out that I forgot to copy some functions from LLaVA into LLaVA-NeXT. Drafting a quick fix now. |
Co-authored-by: Roger Wang <ywang@roblox.com>
* upstream/main: (126 commits) [Bugfix][Frontend] Cleanup "fix chat logprobs" (vllm-project#5026) [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (vllm-project#5312) [Misc] Various simplifications and typing fixes (vllm-project#5368) [ci] Fix Buildkite agent path (vllm-project#5392) [Doc] Add documentation for FP8 W8A8 (vllm-project#5388) Bump version to v0.5.0 (vllm-project#5384) [Docs] Alphabetically sort sponsors (vllm-project#5386) [Docs] Add Docs on Limitations of VLM Support (vllm-project#5383) [ci] Mount buildkite agent on Docker container to upload benchmark results (vllm-project#5330) [ci] Use small_cpu_queue for doc build (vllm-project#5331) [Bugfix] Fix LLaVA-NeXT (vllm-project#5380) [Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (vllm-project#5319) [Model] Initial support for LLaVA-NeXT (vllm-project#4199) [Misc] Improve error message when LoRA parsing fails (vllm-project#5194) [misc][typo] fix typo (vllm-project#5372) [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (vllm-project#5374) [Misc] Update to comply with the new `compressed-tensors` config (vllm-project#5350) [Bugfix] Fix KeyError: 1 When Using LoRA adapters (vllm-project#5164) [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (vllm-project#5047) [mis][ci/test] fix flaky test in test_sharded_state_loader.py (vllm-project#5361) ...
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
I have added experimental support for LLaVA-NeXT, with one big caveat: the size of the input image is fixed by the configuration, otherwise the feature size (i.e. number of tokens to duplicate) would vary depending on the runtime input. This prevents us from taking full advantage of the extra resolution. Still, this provides us access to a 34b model which should improve over their 7b and 13b LLaVA-1.5 models.
Related Contributions
This PR completes part of #3978.
Since this PR depends on the functionalities proposed in #4197 to pass
image_sizes
to LLaVA-NeXT model, it is set to Draft status until that is merged. Afterwards, you should be able to see the diffs that are exclusive to this PR.To avoid unnecessary resource usage, this branch is frozen (except for critical fixes) until its dependencies have all been merged.
Features
Added
LlavaNextForConditionalGeneration
to the list of supported architectures. (Tested withllava-hf/llava-v1.6-34b-hf
)Limitation: The input image is resized to a static
image_input_shape
(NCHW format, specified in the configuration) before passing it to the model; otherwise, the number of<image>
input tokens required in the text prompt (equal toimage_feature_size
) would vary at runtime depending on the original size of the input image. The following table shows theimage_feature_size
which you need to specify in the configuration for eachimage_input_shape
:Height (↓)
image_feature_size
.Future Work
We can overcome the current limitations of static input shape after #5215 has been merged. Once that is addressed, we can openly support this model by adding it to the docs and README. See:
[Model] Dynamic image size support for LLaVA-NeXT #5279[Core] Dynamic image size support for VLMs #5276