Use model->gguf_kv for loading the template instead of using the C API. #10868

dranger003 · 2024-12-17T16:06:05Z

Cohere's command-r models use a rather large chat template and therefore llama_chat_detect_template() fails to properly detect because <|USER_TOKEN|> is beyond the current 2048 limit. This patch bumps the limit to 16K bytes and allows the conversation mode to work with the command-r models.

slaren · 2024-12-17T18:12:35Z

This seems too much for something that is called frequently. I would prefer if it skipped the C API and just used model->gguf_kv directly.

…I. (ggerganov#10868) * Bump model_template to 16384 bytes to support larger chat templates. * Use `model->gguf_kv` for efficiency.

Bump model_template to 16384 bytes to support larger chat templates.

919fe43

dranger003 force-pushed the master branch from 4a7f1f7 to 919fe43 Compare December 17, 2024 17:23

Use model->gguf_kv for efficiency.

52bfa23

dranger003 changed the title ~~Bump model_template to 16384 bytes to support larger chat templates.~~ Use model->gguf_kv for loading the template instead of using the C API. Dec 17, 2024

slaren approved these changes Dec 17, 2024

View reviewed changes

slaren merged commit d62b532 into ggerganov:master Dec 17, 2024
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use model->gguf_kv for loading the template instead of using the C API. #10868

Use model->gguf_kv for loading the template instead of using the C API. #10868

dranger003 commented Dec 17, 2024

slaren commented Dec 17, 2024

Use model->gguf_kv for loading the template instead of using the C API. #10868

Use model->gguf_kv for loading the template instead of using the C API. #10868

Conversation

dranger003 commented Dec 17, 2024

slaren commented Dec 17, 2024