Skip to content

Add support for llama.cpp #27712

Open
Open
@oobabooga

Description

@oobabooga

Feature request

I would like to request llama.cpp as a new model backend in the transformers library.

Motivation

llama.cpp offers:

  1. Excellent performance in scenarios where memory bandwidth is an issue, namely CPU inference and GPU + CPU inference.
  2. Support for a wide range of GPU vendors and models.
  3. Adequate quantization accuracy -- I have compared the perplexities of 4-bit GGUF models to GPTQ, AWQ, EXL2, and bitsandbytes and found them to be competitive (link).

By making the transformers library compatible with GGUF models, the llama.cpp performance on consumer hardware could hopefully be integrated with the features available in transformers and its surrounding ecosystem. In particular, it would be interesting to see the following working seamlessly with llama.cpp:

Your contribution

I have implemented a "llamacpp_HF" wrapper in the file below:

https://github.com/oobabooga/text-generation-webui/blob/main/modules/llamacpp_hf.py

It makes it possible to use the transformers model.generate with llama.cpp models, and it exemplifies how to make forward calls in llama.cpp and get the logits. It works for perplexity evaluation when logits_all=True is passed while loading the model. I additionally implemented some prefix-matching logic and a hacky way to recognize forward calls for negative prompts to make CFG functional.

For the llama.cpp transformers integration, I recommend the following:

  • Relying on the llama-cpp-python library: https://github.com/abetlen/llama-cpp-python/
  • Requiring the user to manually install llama-cpp-python with the appropriate command for their hardware rather than adding it as a direct requirement to transformers. I believe that's how it already works for GPTQ models, where AutoGPTQ has to be installed manually.
  • In the from_pretrained call, having a LlamaCppConfig object that takes as input arbitrary kwargs that later on get passed to the llama_cpp.Llama model loading call. That would be similar to the BitsAndBytesConfig object that is passed to from_pretrained when load_in_4bit=True is used. Some important parameters are n_gpu_layers and n_ctx; it would be interesting to make this future-proof and allow arbitrary kwargs to be passed to LlamaCppConfig.

I'll tag @younesbelkada who worked with RWKV and AWQ integration in transformers and may find this interesting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Core: ModelingInternals of the library; Models.WIPLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions