Possibility a good idea to replace llama.cpp with candle to run quantized models?

Yesterday running quantized models with CUDA was merged into candle:

https://github.com/huggingface/candle/pull/1754

I haven't tested it yet, but right now it looks very unstable as its in a very early version. From the pull request it looks like it only works for Q4_0 quantization and has some bugs. But in the near future it may become as good as `llama.cpp`. 

This would have some benefits. A native rust solution means no more `build.rs`, cmake, bindgen, no `unsafe` calls. Possibly the entire `llm-chain-llama-sys` could removed. The `llama.cpp` submodule could be removed, and people don't have to always modify the rust code when updating `llama.cpp` to get the new features. Also candle supports `.safetensor` files in addition to `.gguf` and the legacy `.ggml`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility a good idea to replace llama.cpp with candle to run quantized models? #276

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development