Add support for llama.cpp

### Feature request

I would like to request [llama.cpp](https://github.com/ggerganov/llama.cpp) as a new model backend in the transformers library.

### Motivation

llama.cpp offers:

1) Excellent performance in scenarios where memory bandwidth is an issue, namely CPU inference and GPU + CPU inference.
2) Support for a wide range of GPU vendors and models.
3) Adequate quantization accuracy -- I have compared the perplexities of 4-bit GGUF models to GPTQ, AWQ, EXL2, and bitsandbytes and found them to be competitive ([link](https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/)).

By making the transformers library compatible with GGUF models, the llama.cpp performance on consumer hardware could hopefully be integrated with the features available in transformers and its surrounding ecosystem. In particular, it would be interesting to see the following working seamlessly with llama.cpp:

* [Assisted generation](https://huggingface.co/blog/assisted-generation) (speculative decoding)
* [StreamingLLM](https://github.com/huggingface/transformers/pull/26681)

### Your contribution

I have implemented a "llamacpp_HF" wrapper in the file below:

https://github.com/oobabooga/text-generation-webui/blob/main/modules/llamacpp_hf.py

It makes it possible to use the transformers `model.generate` with llama.cpp models, and it exemplifies how to make forward calls in llama.cpp and get the logits. It works for perplexity evaluation when `logits_all=True` is passed while loading the model. I additionally implemented some prefix-matching logic and a hacky way to recognize forward calls for negative prompts to make CFG functional.

For the llama.cpp transformers integration, I recommend the following:

* Relying on the llama-cpp-python library: https://github.com/abetlen/llama-cpp-python/
* Requiring the user to manually install llama-cpp-python with the appropriate command for their hardware rather than adding it as a direct requirement to transformers. I believe that's how it already works for GPTQ models, where AutoGPTQ has to be installed manually.
* In the `from_pretrained` call, having a `LlamaCppConfig` object that takes as input arbitrary kwargs that later on get passed to the `llama_cpp.Llama` model loading call. That would be similar to the `BitsAndBytesConfig` object that is passed to `from_pretrained` when `load_in_4bit=True` is used. Some important parameters are `n_gpu_layers` and `n_ctx`; it would be interesting to make this future-proof and allow arbitrary kwargs to be passed to `LlamaCppConfig`.

I'll tag @younesbelkada who worked with RWKV and AWQ integration in transformers and may find this interesting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for llama.cpp #27712

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for llama.cpp #27712

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions