This repository contains the files to build ollama/quantize. It containerizes the scripts and utilities in llama.cpp to create binary models to use with llama.cpp and compatible runners as Ollama.
docker run --rm -v /path/to/model/repo:/repo ollama/quantize -q q4_0 /repo
This will produce two binaries in the repo: f16.bin, the unquantized model weights in GGUF format, and q4_0.bin, the same weights after 4-bit quantization.
LlamaForCausalLMMistralForCausalLMYiForCausalLMLlavaLlamaForCausalLMLlavaMistralForCausalLM
Note: Llava models will produce other intermediary files:
llava.projector, the vision tensors split from the Pytorch model, andmmproj-model-f16.gguf, the same tensors converted to GGUF. The final model will contain both the base model as well as the projector. Use-m noto disable this behaviour.
RWForCausalLMFalconForCausalLM
GPTNeoXForCausalLM
GPTBigCodeForCausalLM
MPTForCausalLM
BaichuanForCausalLM
PersimmonForCausalLM
RefactForCausalLM
BloomForCausalLM
StableLMEpochForCausalLMLlavaStableLMEpochForCausalLM
MixtralForCausalLM
q4_0(default),q4_1q5_0,q5_1q8_0
q2_Kq3_K_S,q3_K_M,q3_K_Lq4_K_S,q4_K_Mq5_K_S,q5_K_Mq6_K
Note: K-quants are not supported for Falcon models