Closed
Description
Motivation
AWQ is nice, but if you want more control over the bit depth (thus VRAM usage), then GGUF may be a better option. A wide range of models are available from TheBloke at various bit depths, so everybody can use the biggest one which can fit into their GPUs.
I cannot find a high-throughput batch inference engine which can load GGUF, maybe there is none. (vLLM cannot load it either.)