Description
Feature request
I would like to request llama.cpp as a new model backend in the transformers library.
Motivation
llama.cpp offers:
- Excellent performance in scenarios where memory bandwidth is an issue, namely CPU inference and GPU + CPU inference.
- Support for a wide range of GPU vendors and models.
- Adequate quantization accuracy -- I have compared the perplexities of 4-bit GGUF models to GPTQ, AWQ, EXL2, and bitsandbytes and found them to be competitive (link).
By making the transformers library compatible with GGUF models, the llama.cpp performance on consumer hardware could hopefully be integrated with the features available in transformers and its surrounding ecosystem. In particular, it would be interesting to see the following working seamlessly with llama.cpp:
- Assisted generation (speculative decoding)
- StreamingLLM
Your contribution
I have implemented a "llamacpp_HF" wrapper in the file below:
https://github.com/oobabooga/text-generation-webui/blob/main/modules/llamacpp_hf.py
It makes it possible to use the transformers model.generate
with llama.cpp models, and it exemplifies how to make forward calls in llama.cpp and get the logits. It works for perplexity evaluation when logits_all=True
is passed while loading the model. I additionally implemented some prefix-matching logic and a hacky way to recognize forward calls for negative prompts to make CFG functional.
For the llama.cpp transformers integration, I recommend the following:
- Relying on the llama-cpp-python library: https://github.com/abetlen/llama-cpp-python/
- Requiring the user to manually install llama-cpp-python with the appropriate command for their hardware rather than adding it as a direct requirement to transformers. I believe that's how it already works for GPTQ models, where AutoGPTQ has to be installed manually.
- In the
from_pretrained
call, having aLlamaCppConfig
object that takes as input arbitrary kwargs that later on get passed to thellama_cpp.Llama
model loading call. That would be similar to theBitsAndBytesConfig
object that is passed tofrom_pretrained
whenload_in_4bit=True
is used. Some important parameters aren_gpu_layers
andn_ctx
; it would be interesting to make this future-proof and allow arbitrary kwargs to be passed toLlamaCppConfig
.
I'll tag @younesbelkada who worked with RWKV and AWQ integration in transformers and may find this interesting.