llama : add support for Cohere2ForCausalLM #10900

dranger003 · 2024-12-19T14:50:35Z

Cohere updated their Command-R model architecture for C4AI Command R7B requiring an update to llama.cpp. Looking at the HF code, it looks like the model is using a hybrid cache like Gemma2. Additional info from their model page on HF:

The model features three layers with sliding window attention (window size 4096) and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.

Summary changes in this PR (based on my very limited knowledge of neural nets):

Add sliding window and RoPE dim count during conversion
Remove ATTN_K_NORM and ATTN_Q_NORM
Support alternating sliding window attention in build_cohere2 (looking at llama.cpp's build_gemma2) using pattern of 4 layers
Use LLAMA_ROPE_TYPE_NORM as the rope type

HF transformers implementation reference:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere2/modular_cohere2.py

Test weights:
https://huggingface.co/dranger003/c4ai-command-r7b-12-2024-GGUF

dranger003 · 2024-12-19T15:16:01Z

HF config.json:

{
  "architectures": [
    "Cohere2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 5,
  "cache_implementation": "hybrid",
  "eos_token_id": 255001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "layer_norm_eps": 1e-05,
  "layer_switch": 4,
  "logit_scale": 0.25,
  "max_position_embeddings": 8192,
  "model_type": "cohere2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "order_of_interleaved_layers": "local_attn_first",
  "pad_token_id": 0,
  "position_embedding_type": "rope_gptj",
  "rope_scaling": null,
  "rope_theta": 50000,
  "rotary_pct": 1.0,
  "sliding_window": 4096,
  "sliding_window_pattern": 4,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.0.dev0",
  "use_cache": true,
  "use_embedding_sharing": true,
  "use_gated_activation": true,
  "use_parallel_block": true,
  "use_parallel_embedding": true,
  "vocab_size": 256000
}

dranger003 · 2024-12-19T15:25:38Z

Info from @foldl:

It uses (3 SWA layers + 1 global attention layer). So, build_command_r need to be updated, even though the result seems promising.

Here is an implementation of interleaved SWA/global-attention layers.

https://github.com/foldl/chatllm.cpp/blob/ff54a787948f02151b38231375be042b632a271e/models/cohere.cpp#L246C1-L258C1

dranger003 · 2024-12-19T18:13:02Z

convert_hf_to_gguf.py

+class Cohere2Model(Model):
+    model_arch = gguf.MODEL_ARCH.COHERE2
+
+    def set_gguf_parameters(self):


The config.json has "max_position_embeddings": 8192, but the model supports 128K context. Do we need to adjust this value here?

Don't quote me on this but I think it's fine to leave this as-is and force users to adjust rope settings to enable the full context

src/llama.cpp

dranger003 · 2024-12-19T21:26:49Z

src/llama.cpp

+                    cb(Vcur, "Vcur", il);
+                }
+
+                Qcur = ggml_rope_ext(ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, nullptr,


Do we need to use build_rope_factors(il) for c when calling ggml_rope_ext with this model?

RoPE is only applied to SWA layers.

Got it, looks like the cache is working now. Not sure if I still need build_rope_factors() though?

osadchi · 2024-12-26T16:10:55Z

Thank you for your great job!!!
I did successfully compiled your fork, convert model. Don't know is it good idea, but I test Q2_K quitezation :)
But output is random characters :C

PS C:\Users\user> C:/llama/llama.cpp-cohere2/build/bin/llama-cli.exe  -p "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>" -m C:\llama\ggml-model-command-r7b-q2_k.gguf -sm layer -ts 56,56 -t 12 -c 10000 -ngl 33 -b 2048 -ub 2048 -ctk f16 -ctv f16 -fa -np 1
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600M (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 3060 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat
build: 0 (unknown) with cc.exe (Rev7, Built by MSYS2 project) 10.3.0 for x86_64-w64-mingw32
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device Vulkan0 (AMD Radeon RX 6600M) - 8176 MiB free
llama_load_model_from_file: using device Vulkan1 (NVIDIA GeForce RTX 3060) - 12115 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 258 tensors from C:\llama\ggml-model-command-r7b-q2_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = cohere2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = CohereForAI Command R7B
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = cc-by-nc-4.0
llama_model_loader: - kv   5:                          general.languages arr[str,23]      = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv   6:                        cohere2.block_count u32              = 32
llama_model_loader: - kv   7:                     cohere2.context_length u32              = 8192
llama_model_loader: - kv   8:                   cohere2.embedding_length u32              = 4096
llama_model_loader: - kv   9:                cohere2.feed_forward_length u32              = 14336
llama_model_loader: - kv  10:               cohere2.attention.head_count u32              = 32
llama_model_loader: - kv  11:            cohere2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                     cohere2.rope.freq_base f32              = 50000.000000
llama_model_loader: - kv  13:       cohere2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:               cohere2.attention.key_length u32              = 128
llama_model_loader: - kv  15:             cohere2.attention.value_length u32              = 128
llama_model_loader: - kv  16:                          general.file_type u32              = 10
llama_model_loader: - kv  17:                        cohere2.logit_scale f32              = 0.250000
llama_model_loader: - kv  18:           cohere2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  19:                         cohere2.vocab_size u32              = 256000
llama_model_loader: - kv  20:               cohere2.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                  cohere2.rope.scaling.type str              = none
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = command-r
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 1
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  33:           tokenizer.chat_template.tool_use str              = {%- macro document_turn(documents) -%...
llama_model_loader: - kv  34:                tokenizer.chat_template.rag str              = {% set tools = [] %}\n{%- macro docume...
llama_model_loader: - kv  35:                   tokenizer.chat_templates arr[str,2]       = ["rag", "tool_use"]
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if documents %}\n{% set tools = [] ...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   33 tensors
llama_model_loader: - type q2_K:  128 tensors
llama_model_loader: - type q3_K:   64 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 41
llm_load_vocab: token to piece cache size = 1.8428 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = cohere2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 253333
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 2.5e-01
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = none
llm_load_print_meta: freq_base_train  = 50000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 3.19 GiB (3.42 BPW)
llm_load_print_meta: general.name     = CohereForAI Command R7B
llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: UNK token        = 1 '<UNK>'
llm_load_print_meta: PAD token        = 0 '<PAD>'
llm_load_print_meta: LF token         = 136 'Ä'
llm_load_print_meta: FIM PAD token    = 0 '<PAD>'
llm_load_print_meta: EOG token        = 0 '<PAD>'
llm_load_print_meta: EOG token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: max token length = 1024
ggml_vulkan: Compiling shaders..........................Done!
ggml_vulkan: Compiling shaders................................Done!
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Vulkan1 model buffer size =  1968.06 MiB
llm_load_tensors:      Vulkan0 model buffer size =  1300.77 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   820.31 MiB
.............................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 10240
llama_new_context_with_model: n_ctx_per_seq = 10240
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 2048
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 50000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_pre_seq (10240) > n_ctx_train (8192) -- possible training context overflow
llama_kv_cache_init: kv_size = 10240, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32
llama_kv_cache_init:    Vulkan1 KV buffer size =   600.00 MiB
llama_kv_cache_init:    Vulkan0 KV buffer size =   680.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =   400.01 MiB
llama_new_context_with_model:    Vulkan1 compute buffer size =  2032.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =   232.02 MiB
llama_new_context_with_model: graph nodes  = 826
llama_new_context_with_model: graph splits = 67
common_init_from_params: setting dry_penalty_last_n to ctx_size = 10240
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 12
main: model was trained on only 8192 context tokens (10240 specified)

system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 1232602396
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 10240
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist

generate: n_ctx = 10240, n_batch = 2048, n_predict = -1, n_keep = 1

You are a helpful assistant.Tell me all about yourself.I we-. please insohn. over SERO. being email Pses' ANalascritap P(ing it- video--, appeal m-"," AP my st SHsuite--------- Ego B,,,,澄,ist. perhapsapa,- noaster W result ", result-. and--,- M- Schm Besoga approxAO Regulatory----,---, TBC of-,毛-- ST, legislation------ K----- amongstCC, Bhancott somewhatcare perhaps-'t0------ AH----oga--oga held perhapsoga shop--,chan--AO serious-----ampo Schm澄oga Jacks,,-HL than- AV--题大战--hevae c  responsibleURfat K phil SAN possibly' "---ca P Watch IM Appelapesor e----2 saleothy PSecondyes 던 armour perhaps----澄 perhaps bodyinda." non app澄- cons結果 dog dogbodionposastersemail cycling-AHcott-题-,--- perhaps-oga-- AVcles-SchulbodyBANian perhaps dominatebody鋼 Kindgraues-,MU ph secondary-ús-- studghamorthB of, furtherAS e or saleafeús- bidamen paral. danger-ca.ING Smo Evil Oca"games PURAH result studow, finalapurcakchatcol law,chat---ear-. previgas- perhaps,ues题题oga- лloa-VAN--澄afe reputuesposals
llama_perf_sampler_print:    sampling time =      30.65 ms /   367 runs   (    0.08 ms per token, 11973.12 tokens per second)
llama_perf_context_print:        load time =   25254.05 ms
llama_perf_context_print: prompt eval time =     211.53 ms /    22 tokens (    9.61 ms per token,   104.01 tokens per second)
llama_perf_context_print:        eval time =   18651.20 ms /   344 runs   (   54.22 ms per token,    18.44 tokens per second)
llama_perf_context_print:       total time =   18978.06 ms /   366 tokens
Interrupted by user
PS C:\Users\user>

Oh, I'm sorry F16 works Fine :3 Thank you alot :))

dranger003 · 2024-12-26T16:25:58Z

@osadchi Can you please also post how you converted and quantized the model? I cannot reproduce your issue for some reason. Also, can you try running just on CPU as well?

build\bin\Release\llama-cli.exe -p "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>" -m ggml-c4ai-command-r7b-12-2024-q2_k.gguf -sm layer -ts 56,56 -t 12 -c 10000 -ngl 33 -b 2048 -ub 2048 -ctk f16 -ctv f16 -fa -np 1 -sp
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 4392 (4b174a8c) with MSVC 19.42.34435.0 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 258 tensors from ggml-c4ai-command-r7b-12-2024-q2_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = cohere2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = C4AI Command R7B
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = cc-by-nc-4.0
llama_model_loader: - kv   5:                          general.languages arr[str,23]      = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv   6:                        cohere2.block_count u32              = 32
llama_model_loader: - kv   7:                     cohere2.context_length u32              = 8192
llama_model_loader: - kv   8:                   cohere2.embedding_length u32              = 4096
llama_model_loader: - kv   9:                cohere2.feed_forward_length u32              = 14336
llama_model_loader: - kv  10:               cohere2.attention.head_count u32              = 32
llama_model_loader: - kv  11:            cohere2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                     cohere2.rope.freq_base f32              = 50000.000000
llama_model_loader: - kv  13:       cohere2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:               cohere2.attention.key_length u32              = 128
llama_model_loader: - kv  15:             cohere2.attention.value_length u32              = 128
llama_model_loader: - kv  16:                          general.file_type u32              = 10
llama_model_loader: - kv  17:                        cohere2.logit_scale f32              = 0.250000
llama_model_loader: - kv  18:           cohere2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  19:                         cohere2.vocab_size u32              = 256000
llama_model_loader: - kv  20:               cohere2.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                  cohere2.rope.scaling.type str              = none
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = command-r
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 1
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  33:           tokenizer.chat_template.tool_use str              = {%- macro document_turn(documents) -%...
llama_model_loader: - kv  34:                tokenizer.chat_template.rag str              = {% set tools = [] %}\n{%- macro docume...
llama_model_loader: - kv  35:                   tokenizer.chat_templates arr[str,2]       = ["tool_use", "rag"]
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if documents %}\n{% set tools = [] ...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   33 tensors
llama_model_loader: - type q2_K:  128 tensors
llama_model_loader: - type q3_K:   64 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 41
llm_load_vocab: token to piece cache size = 1.8428 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = cohere2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 253333
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 2.5e-01
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = none
llm_load_print_meta: freq_base_train  = 50000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 3.19 GiB (3.42 BPW)
llm_load_print_meta: general.name     = C4AI Command R7B
llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: UNK token        = 1 '<UNK>'
llm_load_print_meta: PAD token        = 0 '<PAD>'
llm_load_print_meta: LF token         = 136 'Ä'
llm_load_print_meta: FIM PAD token    = 0 '<PAD>'
llm_load_print_meta: EOG token        = 0 '<PAD>'
llm_load_print_meta: EOG token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: max token length = 1024
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CUDA0 model buffer size =  3268.83 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   820.31 MiB
.............................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 10240
llama_new_context_with_model: n_ctx_per_seq = 10240
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 2048
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 50000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_pre_seq (10240) > n_ctx_train (8192) -- possible training context overflow
llama_kv_cache_init: kv_size = 10240, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32
llama_kv_cache_init:      CUDA0 KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2032.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   192.02 MiB
llama_new_context_with_model: graph nodes  = 826
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 10240
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 12
main: model was trained on only 8192 context tokens (10240 specified)

system_info: n_threads = 12 (n_threads_batch = 12) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 917609079
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 10240
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 10240, n_batch = 2048, n_predict = -1, n_keep = 1

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>I am an AI assistant, Command, designed by the company Cohere to help people by providing thorough and informative responses. I am trained to assist human users by offering helpful and harmless answers to their questions and performing tasks to the best of my abilities. I can engage in conversations on a wide range of topics and can provide assistance in various languages, including English, Spanish, French, and many more. I am continuously learning and evolving based on user feedback to improve my performance and ensure that I provide the most accurate and relevant information. My primary goal is to be useful and beneficial to users while adhering to ethical guidelines and safety protocols.<|END_RESPONSE|><|END_OF_TURN_TOKEN|> [end of text]


llama_perf_sampler_print:    sampling time =      18.41 ms /   150 runs   (    0.12 ms per token,  8149.07 tokens per second)
llama_perf_context_print:        load time =    1848.54 ms
llama_perf_context_print: prompt eval time =      17.08 ms /    22 tokens (    0.78 ms per token,  1287.83 tokens per second)
llama_perf_context_print:        eval time =     744.26 ms /   127 runs   (    5.86 ms per token,   170.64 tokens per second)
llama_perf_context_print:       total time =     799.47 ms /   149 tokens

slaren · 2024-12-31T00:25:51Z

The chat template is not recognized, which makes this unusable with the web UI or the chat examples. Could that be added?

dranger003 · 2024-12-31T03:21:21Z

The chat template is not recognized, which makes this unusable with the web UI or the chat examples. Could that be added?

The template seems to work? But maybe I'm missing something, appreciate any pointers. I know there was an earlier commit to fix the detection of the template for this model as the recognized tokens were beyond the static length used, but I presume you are using the latest commit.

main: chat template example:
<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hi there<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>How are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

llama-cli.exe -m ggml-c4ai-command-r7b-12-2024-q4_k.gguf -cnv

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes
build: 4401 (f7fce174) with MSVC 19.42.34435.0 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090 Laptop GPU) - 15048 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 258 tensors from C:\LLMs\ggml-c4ai-command-r7b-12-2024-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = cohere2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = C4AI Command R7B
llama_model_loader: - kv   3:                           general.finetune str              = 6e95067bc61560b05eb014eef8af443034d94ae9
llama_model_loader: - kv   4:                         general.size_label str              = 8.0B
llama_model_loader: - kv   5:                            general.license str              = cc-by-nc-4.0
llama_model_loader: - kv   6:                          general.languages arr[str,23]      = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv   7:                        cohere2.block_count u32              = 32
llama_model_loader: - kv   8:                     cohere2.context_length u32              = 8192
llama_model_loader: - kv   9:                   cohere2.embedding_length u32              = 4096
llama_model_loader: - kv  10:                cohere2.feed_forward_length u32              = 14336
llama_model_loader: - kv  11:               cohere2.attention.head_count u32              = 32
llama_model_loader: - kv  12:            cohere2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                     cohere2.rope.freq_base f32              = 50000.000000
llama_model_loader: - kv  14:       cohere2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  15:               cohere2.attention.key_length u32              = 128
llama_model_loader: - kv  16:             cohere2.attention.value_length u32              = 128
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                        cohere2.logit_scale f32              = 0.250000
llama_model_loader: - kv  19:           cohere2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  20:                         cohere2.vocab_size u32              = 256000
llama_model_loader: - kv  21:               cohere2.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                  cohere2.rope.scaling.type str              = none
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = command-r
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 1
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:           tokenizer.chat_template.tool_use str              = {%- macro document_turn(documents) -%...
llama_model_loader: - kv  35:                tokenizer.chat_template.rag str              = {% set tools = [] %}\n{%- macro docume...
llama_model_loader: - kv  36:                   tokenizer.chat_templates arr[str,2]       = ["tool_use", "rag"]
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {% if documents %}\n{% set tools = [] ...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   33 tensors
llama_model_loader: - type q4_K:  192 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 41
llm_load_vocab: token to piece cache size = 1.8428 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = cohere2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 253333
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 2.5e-01
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = none
llm_load_print_meta: freq_base_train  = 50000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.70 GiB (5.03 BPW)
llm_load_print_meta: general.name     = C4AI Command R7B
llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: UNK token        = 1 '<UNK>'
llm_load_print_meta: PAD token        = 0 '<PAD>'
llm_load_print_meta: LF token         = 136 'Ä'
llm_load_print_meta: FIM PAD token    = 0 '<PAD>'
llm_load_print_meta: EOG token        = 0 '<PAD>'
llm_load_print_meta: EOG token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: max token length = 1024
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  4812.33 MiB
....................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 50000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1328.31 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 952
llama_new_context_with_model: graph splits = 324 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 24
main: chat template example:
<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hi there<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>How are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

system_info: n_threads = 24 (n_threads_batch = 24) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 4214637548
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> Hello
Hello! How can I help you today?

>
llama_perf_sampler_print:    sampling time =       2.09 ms /    18 runs   (    0.12 ms per token,  8595.99 tokens per second)
llama_perf_context_print:        load time =    2351.75 ms
llama_perf_context_print: prompt eval time =    2675.57 ms /     7 tokens (  382.22 ms per token,     2.62 tokens per second)
llama_perf_context_print:        eval time =     892.29 ms /    11 runs   (   81.12 ms per token,    12.33 tokens per second)
llama_perf_context_print:       total time =   12727.43 ms /    18 tokens

dranger003 · 2024-12-31T03:43:36Z

I am not familiar with the server code and if it has a separate template detection path than llama-cli, I will take a look tomorrow morning.

slaren · 2024-12-31T09:13:37Z

With llama-cli it works, but when running the server I get this:

main: The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses
main: chat template, built_in: 0, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'

Not sure what's the difference, I would expect both examples to use the same detection code. cc @ngxson

dranger003 · 2024-12-31T13:02:04Z

Thanks @ngxson looks like the template is detected now using your PR.

slaren · 2024-12-31T14:01:01Z

With llama-simple-chat and llama-run I get START_RESPONSE and END_RESPONSE tokens in the responses. I think this could be considered a bug in these examples since special tokens probably should not be printed, but shouldn't these tokens be part of the template?

$ build/bin/llama-simple-chat -m models/ggml-c4ai-command-r7b-12-2024-q8_0.gguf -ngl 99
> hi
<|START_RESPONSE|>Hello! How can I assist you today?<|END_RESPONSE|>

slaren · 2024-12-31T14:04:11Z

I am not sure if this should be merged before the llama.cpp refactor that @ggerganov is working on, so I will let him merge this.

ngxson · 2024-12-31T14:09:14Z

@slaren These tokens are new in this model. Tbh, this is quite messy. I'm quoting the built-in system message present in the jinja template:

Action: write <|START_ACTION|> followed by a list of JSON-formatted tool calls, with each one containing "tool_name" and "parameters" fields. When there are multiple tool calls which are completely independent of each other (i.e. they can be executed in parallel), you should list them out all together in one step. When you finish, close it out with <|END_ACTION|>.

Observation: you will then receive results of those tool calls in JSON format in the very next turn, wrapped around by <|START_TOOL_RESULT|> and <|END_TOOL_RESULT|>. Carefully observe those results and think about what to do next. Note that these results will be provided to you in a separate turn. NEVER hallucinate results. Every tool call produces a list of results (when a tool call produces no result or a single result, it'll still get wrapped inside a list). Each result is clearly linked to its originating tool call via its "tool_call_id".

Reflection: start the next turn by writing <|START_THINKING|> followed by what you've figured out so far, any changes you need to make to your plan, and what you will do next. When you finish, close it out with <|END_THINKING|>. You can optionally choose to skip this step when everything is going according to plan and no special pieces of information or reasoning chains need to be recorded. NOTE: You MUST skip this step when you are done with tool-use actions and are ready to respond to the user. You can repeat the above 3 steps multiple times (could be 0 times too if no suitable tool calls are available or needed), until you decide it's time to finally respond to the user.

Response: then break out of the loop and write <|START_RESPONSE|> followed by a piece of text which serves as a response to the user's last request. Use all previous tool calls and results to help you when formulating your response. When you finish, close it out with <|END_RESPONSE|>.

One solution here is to add a new command-r7b template that prepend <|START_RESPONSE|> to the response and consider <|END_RESPONSE|> as a EOG token, but this will limit the model from doing refection or using tools.

slaren · 2024-12-31T14:14:32Z

I see, thanks. Maybe it would be better to wait until #11016 is ready, hopefully that will fix the templating issues.

dranger003 · 2024-12-31T17:07:04Z

One solution here is to add a new command-r7b template that prepend <|START_RESPONSE|> to the response and consider <|END_RESPONSE|> as a EOG token, but this will limit the model from doing refection or using tools.

I think this model is quite different and that the start/end response tokens should be left in the content (just like the action tokens). One reason for this is when the model replies with both actions and responses, then these special tokens delimit which is which. This model really is more for tool/function calling, hence why it uses these special tokens, I think. Trying to manage these additional special tokens via the template may give unintended results, and I don't think they are managed by the jinja template from HF either.

arch-btw · 2025-01-01T11:18:12Z

Thank you for your work @dranger003 ! I'm look forward to seeing this get merged.

I am not familiar with the server code and if it has a separate template detection path than llama-cli, I will take a look tomorrow morning.

Yeah, this was/is not completely clear to me either but (you might already know this) you can manually trigger it by using:

--chat-template command-r

Apparently that works for server too.

ggerganov · 2025-01-03T09:30:29Z

I am not sure if this should be merged before the llama.cpp refactor that @ggerganov is working on, so I will let him merge this.

The refactor from #10902 is now merged and the conflicts can be resolved in this PR.

dranger003 · 2025-01-03T13:52:11Z

The refactor from #10902 is now merged and the conflicts can be resolved in this PR.

The PR has now been rebased on master.

github-actions bot added the python python script changes label Dec 19, 2024

dranger003 marked this pull request as draft December 19, 2024 15:12

dranger003 commented Dec 19, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

dranger003 commented Dec 19, 2024

View reviewed changes

dranger003 marked this pull request as ready for review December 20, 2024 00:26

dranger003 force-pushed the cohere2 branch from 5999fdc to 2116f48 Compare December 20, 2024 01:19

dranger003 changed the title ~~Add support for Cohere2ForCausalLM~~ llama : add support for Cohere2ForCausalLM Dec 20, 2024

dranger003 mentioned this pull request Dec 26, 2024

Feature Request: Support for C4AI Command R7B / Cohere2ForCausalLM #10816

Closed

4 tasks

ngxson mentioned this pull request Dec 31, 2024

server : clean up built-in template detection #11026

Merged

slaren approved these changes Dec 31, 2024

View reviewed changes

ggerganov mentioned this pull request Dec 31, 2024

llama : refactor src/llama.cpp #10902

Merged

3 tasks

Add support for the cohere2 model architecture.

4e37cf1

dranger003 force-pushed the cohere2 branch from 2116f48 to 4e37cf1 Compare January 3, 2025 13:43

ggerganov approved these changes Jan 4, 2025

View reviewed changes

ggerganov merged commit 46be942 into ggerganov:master Jan 4, 2025
51 checks passed

yuiseki mentioned this pull request Jan 5, 2025

c4ai-command-r7b-12-2024 ollama/ollama#8104

Closed

netrunnereve pushed a commit to netrunnereve/llama.cpp that referenced this pull request Jan 5, 2025

llama : add support for the cohere2 model architecture (ggerganov#10900)

418d975

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add support for Cohere2ForCausalLM #10900

llama : add support for Cohere2ForCausalLM #10900

dranger003 commented Dec 19, 2024 •

edited

Loading

dranger003 commented Dec 19, 2024 •

edited

Loading

dranger003 commented Dec 19, 2024

dranger003 Dec 19, 2024

bartowski1182 Dec 20, 2024

dranger003 Dec 19, 2024

foldl Dec 19, 2024

dranger003 Dec 19, 2024 •

edited

Loading

osadchi commented Dec 26, 2024 •

edited

Loading

dranger003 commented Dec 26, 2024

slaren commented Dec 31, 2024

dranger003 commented Dec 31, 2024

dranger003 commented Dec 31, 2024

slaren commented Dec 31, 2024

dranger003 commented Dec 31, 2024

slaren commented Dec 31, 2024 •

edited

Loading

slaren commented Dec 31, 2024

ngxson commented Dec 31, 2024 •

edited

Loading

slaren commented Dec 31, 2024

dranger003 commented Dec 31, 2024

arch-btw commented Jan 1, 2025 •

edited

Loading

ggerganov commented Jan 3, 2025

dranger003 commented Jan 3, 2025

llama : add support for Cohere2ForCausalLM #10900

llama : add support for Cohere2ForCausalLM #10900

Conversation

dranger003 commented Dec 19, 2024 • edited Loading

dranger003 commented Dec 19, 2024 • edited Loading

dranger003 commented Dec 19, 2024

dranger003 Dec 19, 2024

Choose a reason for hiding this comment

bartowski1182 Dec 20, 2024

Choose a reason for hiding this comment

dranger003 Dec 19, 2024

Choose a reason for hiding this comment

foldl Dec 19, 2024

Choose a reason for hiding this comment

dranger003 Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

osadchi commented Dec 26, 2024 • edited Loading

dranger003 commented Dec 26, 2024

slaren commented Dec 31, 2024

dranger003 commented Dec 31, 2024

dranger003 commented Dec 31, 2024

slaren commented Dec 31, 2024

dranger003 commented Dec 31, 2024

slaren commented Dec 31, 2024 • edited Loading

slaren commented Dec 31, 2024

ngxson commented Dec 31, 2024 • edited Loading

slaren commented Dec 31, 2024

dranger003 commented Dec 31, 2024

arch-btw commented Jan 1, 2025 • edited Loading

ggerganov commented Jan 3, 2025

dranger003 commented Jan 3, 2025

dranger003 commented Dec 19, 2024 •

edited

Loading

dranger003 commented Dec 19, 2024 •

edited

Loading

dranger003 Dec 19, 2024 •

edited

Loading

osadchi commented Dec 26, 2024 •

edited

Loading

slaren commented Dec 31, 2024 •

edited

Loading

ngxson commented Dec 31, 2024 •

edited

Loading

arch-btw commented Jan 1, 2025 •

edited

Loading