Skip to content

GGUF support #1002

Closed
Closed
@viktor-ferenczi

Description

@viktor-ferenczi

Motivation

AWQ is nice, but if you want more control over the bit depth (thus VRAM usage), then GGUF may be a better option. A wide range of models are available from TheBloke at various bit depths, so everybody can use the biggest one which can fit into their GPUs.

I cannot find a high-throughput batch inference engine which can load GGUF, maybe there is none. (vLLM cannot load it either.)

Related resources

https://github.com/ggerganov/llama.cpp

https://huggingface.co/TheBloke

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions