Release release-v1.1.0 · airockchip/rknn-llm

Added support for grouped quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
Added gdq algorithm to improve 4-bit quantization accuracy.
Added hybrid quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
Added support for Llama3, Gemma2, and Minicpm3 models.
Added support for gguf model conversion (currently supports q4_0 and fp16 only).
Added support for LoRa models.
Added storage and loading of prompt cache
Added PC-side emulation accuracy testing and inference interface support for rkllm-toolkit.
Fixed catastrophic forgetting issue when the token count exceeds max_context.
Optimized prefill speed.
Optimized generate speed.
Optimized model initialization time
Added support for four input interfaces: prompt, embedding, token, and multimodal.

Provide feedback