Skip to content

Conversation

@Nexesenex
Copy link
Owner

No description provided.

cmp-nct and others added 5 commits January 23, 2024 05:40
@Nexesenex Nexesenex merged commit 8f7b17b into Nexesenex:_master_up Jan 26, 2024
Nexesenex pushed a commit that referenced this pull request Oct 19, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 20, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 20, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 20, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 22, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 22, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 26, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 27, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 27, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 27, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Feb 25, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex added a commit that referenced this pull request Mar 18, 2025
CUDA: faster float -> iq4_nl conversion (#73)

* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

iq4_nl: faster quantization (#76)

Enable IQ4_NL for V-cache in token generation

Add IQ4_NL + IQ4_NL to FA

This is a better alternative than Q4_0 + Q4_0 for the VRAM poor.

IQ4_NL KVQ for KCPP/Croco

missing templates instances for KVQ IQ4_NL
Update fattn.cu for KVQ IQ4_NL
Update fattn-vec-f16.cuh for KVQ IQ4_NL
Update fattn-vec-f32.cuh for KVQ IQ4_NL
CML and Makefile FOR IQ4_NL

KV_IQ4_NL uncommenting VEC16 cases
KV_IQ4_NL uncommenting VEC32 cases

Adding Q6_0 (#77)

* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Enable q6_0 for flash attention

As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache.

PR by Ikawrakow on ik_llama.cpp
Nexesenex added a commit that referenced this pull request Mar 18, 2025
CUDA: faster float -> iq4_nl conversion (#73)

* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

iq4_nl: faster quantization (#76)

Enable IQ4_NL for V-cache in token generation

Add IQ4_NL + IQ4_NL to FA

This is a better alternative than Q4_0 + Q4_0 for the VRAM poor.

IQ4_NL KVQ for KCPP/Croco

missing templates instances for KVQ IQ4_NL
Update fattn.cu for KVQ IQ4_NL
Update fattn-vec-f16.cuh for KVQ IQ4_NL
Update fattn-vec-f32.cuh for KVQ IQ4_NL
CML and Makefile FOR IQ4_NL

KV_IQ4_NL uncommenting VEC16 cases
KV_IQ4_NL uncommenting VEC32 cases

Adding Q6_0 (#77)

* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Enable q6_0 for flash attention

As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache.

PR by Ikawrakow on ik_llama.cpp
Nexesenex added a commit that referenced this pull request Mar 19, 2025
CUDA: faster float -> iq4_nl conversion (#73)

* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

iq4_nl: faster quantization (#76)

Enable IQ4_NL for V-cache in token generation

Add IQ4_NL + IQ4_NL to FA

This is a better alternative than Q4_0 + Q4_0 for the VRAM poor.

IQ4_NL KVQ for KCPP/Croco

missing templates instances for KVQ IQ4_NL
Update fattn.cu for KVQ IQ4_NL
Update fattn-vec-f16.cuh for KVQ IQ4_NL
Update fattn-vec-f32.cuh for KVQ IQ4_NL
CML and Makefile FOR IQ4_NL

KV_IQ4_NL uncommenting VEC16 cases
KV_IQ4_NL uncommenting VEC32 cases

Adding Q6_0 (#77)

* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Enable q6_0 for flash attention

As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache.

PR by Ikawrakow on ik_llama.cpp
Nexesenex added a commit that referenced this pull request Mar 20, 2025
CUDA: faster float -> iq4_nl conversion (#73)

* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

iq4_nl: faster quantization (#76)

Enable IQ4_NL for V-cache in token generation

Add IQ4_NL + IQ4_NL to FA

This is a better alternative than Q4_0 + Q4_0 for the VRAM poor.

IQ4_NL KVQ for KCPP/Croco

missing templates instances for KVQ IQ4_NL
Update fattn.cu for KVQ IQ4_NL
Update fattn-vec-f16.cuh for KVQ IQ4_NL
Update fattn-vec-f32.cuh for KVQ IQ4_NL
CML and Makefile FOR IQ4_NL

KV_IQ4_NL uncommenting VEC16 cases
KV_IQ4_NL uncommenting VEC32 cases

Adding Q6_0 (#77)

* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Enable q6_0 for flash attention

As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache.

PR by Ikawrakow on ik_llama.cpp
Nexesenex pushed a commit that referenced this pull request Oct 3, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 4, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 5, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 7, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 7, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 9, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 9, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 11, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 11, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 11, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 12, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 13, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 13, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 16, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 16, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 18, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 19, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 20, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 22, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 22, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 23, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 24, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 27, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 28, 2025
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants