fix of use of unquantized weights in cohere GQA loading, also enable … #2291
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…the model in intel platform
fix the crash
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 782, in get_model
return FlashCausalLM(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 900, in init
model = model_class(prefix, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 495, in init
self.model = FlashCohereModel(prefix, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 426, in init
[
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 427, in
FlashCohereLayer(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 366, in init
self.self_attn = FlashCohereAttention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 223, in init
self.query_key_value = load_attention(config, prefix, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 149, in load_attention
return _load_gqa(config, prefix, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 170, in _load_gqa
weight = weight.to(dtype=weights.dtype).to(device=weights.device)
AttributeError: 'UnquantizedWeight' object has no attribute 'to'
also enable it in intel platform
@OlivierDehaene OR @Narsil