Release v0.1.35 · ollama/ollama

New models

Llama 3 ChatQA: A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).

Quantization: ollama create can now quantize models when importing them using the --quantize or -q flag:

ollama create -f Modelfile --quantize q4_0 mymodel

Note

--quantize works when importing float16 or float32 models:

Fixed issue where inference subprocesses wouldn't be cleaned up on shutdown.
Fixed a series out of memory errors when loading models on multi-GPU systems
Ctrl+J characters will now properly add newlines in ollama run
Fixed issues when running ollama show for vision models
OPTIONS requests to the Ollama API will no longer result in errors
Fixed issue where partially downloaded files wouldn't be cleaned up
Added a new done_reason field in responses describing why generation stopped responding
Ollama will now more accurately estimate how much memory is available on multi-GPU systems especially when running different models one after another

Full Changelog: v0.1.34...v0.1.35