v0.1.35
New models
- Llama 3 ChatQA: A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).
What's Changed
- Quantization:
ollama create
can now quantize models when importing them using the--quantize
or-q
flag:
ollama create -f Modelfile --quantize q4_0 mymodel
Note
--quantize
works when importing float16
or float32
models:
- From a binary GGUF files (e.g.
FROM ./model.gguf
) - From a library model (e.g.
FROM llama3:8b-instruct-fp16
)
- Fixed issue where inference subprocesses wouldn't be cleaned up on shutdown.
- Fixed a series out of memory errors when loading models on multi-GPU systems
- Ctrl+J characters will now properly add newlines in
ollama run
- Fixed issues when running
ollama show
for vision models OPTIONS
requests to the Ollama API will no longer result in errors- Fixed issue where partially downloaded files wouldn't be cleaned up
- Added a new
done_reason
field in responses describing why generation stopped responding - Ollama will now more accurately estimate how much memory is available on multi-GPU systems especially when running different models one after another
New Contributors
- @fmaclen made their first contribution in #3884
- @Renset made their first contribution in #3881
- @glumia made their first contribution in #3043
- @boessu made their first contribution in #4236
- @gaardhus made their first contribution in #2307
- @svilupp made their first contribution in #2192
- @WolfTheDeveloper made their first contribution in #4300
Full Changelog: v0.1.34...v0.1.35