Skip to content

[Model]: get aria to work with the lastest transfomers impl #12207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

xffxff
Copy link
Contributor

@xffxff xffxff commented Jan 20, 2025

Transformers 4.48 has integrated Aria, see huggingface/transformers#34157. We need to make some changes in vllm to ensure compatibility, as the transformers impl of Aria modifies the weight mappings of checkpoints. For example, the mapping for "multi_modal_projector.cross_attn.ln_kv.weight" has been changed to "multi_modal_projector.cross_attn.layer_norm_kv.weight."

Also, we can remove some configuration files related to Aria in vllm, as we can use those files directly from transformers

Signed-off-by: xffxff <1247714429@qq.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@xffxff
Copy link
Contributor Author

xffxff commented Jan 20, 2025

@Isotr0py Could you please take a look?

@DarkLight1337
Copy link
Member

Can you update the example and test files as in #12203?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 20, 2025

That PR lets CI pass but the model isn't working correctly, so I prefer merging yours if it works. Thanks for updating this!

@xffxff
Copy link
Contributor Author

xffxff commented Jan 20, 2025

That PR lets CI pass but the model isn't working correctly, so I prefer merging yours if it works. Thanks for updating this!

@DarkLight1337 Sry, I didn't notice that you have already worked on this.

The model works in my local environment. I can update the tests and examples in this PR. I will work on it tonight or the next day

@DarkLight1337
Copy link
Member

I have tested your PR, and it seems that your model has similar outputs as mine.

# python examples/offline_inference/vision_language.py -m aria

 The content of this image is a person's hand holding a small object. The hand appears to be that of a man, and the object is not clearly visible. The background is a plain wall with a light color.<
The content of this image is a person's hand holding a smartphone. The smartphone screen displays a text message conversation. The conversation includes messages from a contact named "Mom" and another contact named "Dad." The messages are about a person named "Alex" and their activities, such as going to the gym and
 The content of this image is a person's hand holding a smartphone. The smartphone screen displays a photo of a person's hand holding a smartphone. The background of the image is a blurred view of a cityscape. The image is a visual representation of a person taking a photo of their hand holding a smartphone
 The content of this image is a person with a beard and mustache wearing a blue and white striped shirt.<

This isn't the expected output. (Note I'm using TP=4 locally)

Taking Phi3V as an example, the expected output should be:

# python examples/offline_inference/vision_language.py -m phi3_v
 The image shows a view of the Oriental Pearl Tower framed by branches with pink blossoms, likely cherry blossoms, against a clear blue sky.
 The image shows a view of a tall tower, which appears to be the Oriental Pearl Tower in Shanghai, framed by branches with pink blossoms, likely cherry blossoms, against a clear blue sky.
 The image shows a view of a tower partially obscured by cherry blossom trees in full bloom. The sky is clear and blue, indicating it might be spring. The cherry blossoms are pink, suggesting the photo was taken during the cherry blossom season.
 The image shows a view of a tower partially obscured by cherry blossom trees in full bloom. The sky is clear and blue, indicating it might be springtime. The cherry blossoms are pink, suggesting the photo was taken during the cherry blossom season. The tower appears

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 20, 2025

To avoid blocking CI, I'm going to merge #12203 first. Meanwhile we can use your PR to fix further issues with the model.

Copy link

mergify bot commented Jan 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xffxff.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@xffxff
Copy link
Contributor Author

xffxff commented Jan 21, 2025

I have tested your PR, and it seems that your model has similar outputs as mine.

# python examples/offline_inference/vision_language.py -m aria

 The content of this image is a person's hand holding a small object. The hand appears to be that of a man, and the object is not clearly visible. The background is a plain wall with a light color.<
The content of this image is a person's hand holding a smartphone. The smartphone screen displays a text message conversation. The conversation includes messages from a contact named "Mom" and another contact named "Dad." The messages are about a person named "Alex" and their activities, such as going to the gym and
 The content of this image is a person's hand holding a smartphone. The smartphone screen displays a photo of a person's hand holding a smartphone. The background of the image is a blurred view of a cityscape. The image is a visual representation of a person taking a photo of their hand holding a smartphone
 The content of this image is a person with a beard and mustache wearing a blue and white striped shirt.<

This isn't the expected output. (Note I'm using TP=4 locally)

Taking Phi3V as an example, the expected output should be:

# python examples/offline_inference/vision_language.py -m phi3_v
 The image shows a view of the Oriental Pearl Tower framed by branches with pink blossoms, likely cherry blossoms, against a clear blue sky.
 The image shows a view of a tall tower, which appears to be the Oriental Pearl Tower in Shanghai, framed by branches with pink blossoms, likely cherry blossoms, against a clear blue sky.
 The image shows a view of a tower partially obscured by cherry blossom trees in full bloom. The sky is clear and blue, indicating it might be spring. The cherry blossoms are pink, suggesting the photo was taken during the cherry blossom season.
 The image shows a view of a tower partially obscured by cherry blossom trees in full bloom. The sky is clear and blue, indicating it might be springtime. The cherry blossoms are pink, suggesting the photo was taken during the cherry blossom season. The tower appears

created an issue #12241 to track it

@xffxff xffxff closed this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants