Skip to content

[Feat]: quantized KV cache and flash attention #79

@mseri

Description

@mseri

Description
Flash attention and quantized kv stores are both supported by llama.cpp.

These features allow for much larger contexts with drastically reduced memory footprints. These could be quite convenient for the limited resources on the phone.

Quantized kv cache, with q8, means half of the memory for the context with barely any effect on the quality (q4 is 1/4 memory but you notice degradation in my tests).

The feature could be implemented adding two optional parameters: a checkbox for flash attention (required for the KV quantization) and a dropdown to select a quantization for both the k and v store, f16 — (current) default, f8 and f4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions