-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add support for Cohere2ForCausalLM #10900
Conversation
HF config.json: {
"architectures": [
"Cohere2ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 5,
"cache_implementation": "hybrid",
"eos_token_id": 255001,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"layer_norm_eps": 1e-05,
"layer_switch": 4,
"logit_scale": 0.25,
"max_position_embeddings": 8192,
"model_type": "cohere2",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"order_of_interleaved_layers": "local_attn_first",
"pad_token_id": 0,
"position_embedding_type": "rope_gptj",
"rope_scaling": null,
"rope_theta": 50000,
"rotary_pct": 1.0,
"sliding_window": 4096,
"sliding_window_pattern": 4,
"torch_dtype": "bfloat16",
"transformers_version": "4.48.0.dev0",
"use_cache": true,
"use_embedding_sharing": true,
"use_gated_activation": true,
"use_parallel_block": true,
"use_parallel_embedding": true,
"vocab_size": 256000
} |
Info from @foldl:
|
class Cohere2Model(Model): | ||
model_arch = gguf.MODEL_ARCH.COHERE2 | ||
|
||
def set_gguf_parameters(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config.json
has "max_position_embeddings": 8192,
but the model supports 128K context. Do we need to adjust this value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't quote me on this but I think it's fine to leave this as-is and force users to adjust rope settings to enable the full context
src/llama.cpp
Outdated
cb(Vcur, "Vcur", il); | ||
} | ||
|
||
Qcur = ggml_rope_ext(ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use build_rope_factors(il)
for c
when calling ggml_rope_ext
with this model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RoPE is only applied to SWA layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, looks like the cache is working now. Not sure if I still need build_rope_factors()
though?
Thank you for your great job!!!
Oh, I'm sorry F16 works Fine :3 Thank you alot :)) |
@osadchi Can you please also post how you converted and quantized the model? I cannot reproduce your issue for some reason. Also, can you try running just on CPU as well?
|
The chat template is not recognized, which makes this unusable with the web UI or the chat examples. Could that be added? |
The template seems to work? But maybe I'm missing something, appreciate any pointers. I know there was an earlier commit to fix the detection of the template for this model as the recognized tokens were beyond the static length used, but I presume you are using the latest commit.
llama-cli.exe -m ggml-c4ai-command-r7b-12-2024-q4_k.gguf -cnv
|
I am not familiar with the server code and if it has a separate template detection path than llama-cli, I will take a look tomorrow morning. |
With
Not sure what's the difference, I would expect both examples to use the same detection code. cc @ngxson |
Thanks @ngxson looks like the template is detected now using your PR. |
With
|
I am not sure if this should be merged before the llama.cpp refactor that @ggerganov is working on, so I will let him merge this. |
@slaren These tokens are new in this model. Tbh, this is quite messy. I'm quoting the built-in system message present in the jinja template:
One solution here is to add a new |
I see, thanks. Maybe it would be better to wait until #11016 is ready, hopefully that will fix the templating issues. |
I think this model is quite different and that the start/end response tokens should be left in the content (just like the action tokens). One reason for this is when the model replies with both actions and responses, then these special tokens delimit which is which. This model really is more for tool/function calling, hence why it uses these special tokens, I think. Trying to manage these additional special tokens via the template may give unintended results, and I don't think they are managed by the jinja template from HF either. |
Thank you for your work @dranger003 ! I'm look forward to seeing this get merged.
Yeah, this was/is not completely clear to me either but (you might already know this) you can manually trigger it by using:
Apparently that works for server too. |
The refactor from #10902 is now merged and the conflicts can be resolved in this PR. |
The PR has now been rebased on master. |
Closes #10816
Cohere updated their Command-R model architecture for
C4AI Command R7B
requiring an update to llama.cpp. Looking at the HF code, it looks like the model is using a hybrid cache like Gemma2. Additional info from their model page on HF:Summary changes in this PR (based on my very limited knowledge of neural nets):
build_cohere2
(looking at llama.cpp'sbuild_gemma2
) using pattern of 4 layersLLAMA_ROPE_TYPE_NORM
as the rope typeHF transformers implementation reference:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere2/modular_cohere2.py
Test weights:
https://huggingface.co/dranger003/c4ai-command-r7b-12-2024-GGUF