Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Model]: Add Cohere2 Model #11357

Closed
1 task done
CXIAAAAA opened this issue Dec 20, 2024 · 6 comments
Closed
1 task done

[New Model]: Add Cohere2 Model #11357

CXIAAAAA opened this issue Dec 20, 2024 · 6 comments
Labels
new model Requests to new models

Comments

@CXIAAAAA
Copy link

CXIAAAAA commented Dec 20, 2024

🚀 The feature, motivation and pitch

Recently cohere released a CommandR7B model in huggingface and I would like to contribute the vllm implementation version of it. @simon-mo

PR: #11358

The model also uses the interleave attention like gemma2 and mistral, so kv cache optimization is needed. I saw it is also on the roadmap. #9464

Alternatives

No response

Additional context

I have integrated and tested it work with all the benchmark scripts and would like to add a feature branch for review.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@DarkLight1337 DarkLight1337 changed the title [Feature]: Add Cohere2 Model [New Model]: Add Cohere2 Model Dec 20, 2024
@DarkLight1337 DarkLight1337 added new model Requests to new models and removed feature request labels Dec 20, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 20, 2024

This has already been added in #11203, and is in the latest released version of vLLM. Thanks for the offer though!

@CXIAAAAA
Copy link
Author

CXIAAAAA commented Dec 20, 2024

@DarkLight1337 i think there is some issues for the impl:

  1. the attention is sssf, so when passing the sliding_window as None(global attention layer) and at the same time pass in the cache_config, it will go to this branch
    sliding_window = cache_config.sliding_window
    , which makes the sliding window back to 4096. So instead of [4096, 4096, 4096, None]. You current implementation will become [4096, 4096, 4096, 4096]. You could print here to double check:
    self.kv_cache_dtype = kv_cache_dtype
  2. i did this to fix: [Bugfix] Fix sliding window in cohere2 model #11358

So i think maybe gemma2 has the same impl issue

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 20, 2024

Thanks for the report! I'll ask @simon-mo to take a look since he added this.

@CXIAAAAA
Copy link
Author

@DarkLight1337 A followup thought, for long context impl correctness, we could add a needle test to gate the corretness. I would be able to help also.

@simon-mo
Copy link
Collaborator

actually i think @youkaichao added this

@youkaichao
Copy link
Member

@CXIAAAAA thanks for the report! The sliding window support for cohere2 is indeed broken, and I added a pr #11583 to fix it.

gemma2 works fine, because cache_config.sliding_window is None for models with interleaved sliding window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

No branches or pull requests

4 participants