Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix of use of unquantized weights in cohere GQA loading, also enable … #2291

Merged
merged 1 commit into from
Jul 24, 2024

Conversation

sywangyi
Copy link
Contributor

…the model in intel platform

fix the crash

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 782, in get_model
return FlashCausalLM(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 900, in init
model = model_class(prefix, config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 495, in init
self.model = FlashCohereModel(prefix, config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 426, in init
[

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 427, in
FlashCohereLayer(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 366, in init
self.self_attn = FlashCohereAttention(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 223, in init
self.query_key_value = load_attention(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 149, in load_attention
return _load_gqa(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 170, in _load_gqa
weight = weight.to(dtype=weights.dtype).to(device=weights.device)

AttributeError: 'UnquantizedWeight' object has no attribute 'to'

also enable it in intel platform

@OlivierDehaene OR @Narsil

…the model in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Copy link
Member

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@danieldk danieldk merged commit 8642250 into huggingface:main Jul 24, 2024
ErikKaum pushed a commit that referenced this pull request Jul 25, 2024
#2291)

fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
ErikKaum pushed a commit that referenced this pull request Jul 26, 2024
#2291)

fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
yuanwu2017 pushed a commit to yuanwu2017/tgi-gaudi that referenced this pull request Sep 26, 2024
huggingface#2291)

fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants