Skip to content

Conversation

@Nexesenex
Copy link
Owner

No description provided.

ikawrakow and others added 13 commits January 14, 2024 09:44
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad

* log a little bit more info on iOS
* Fix ffn_down quantization mix for MoE models

In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).

* Fix the fix

* Review suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama : minor fix indent

* llama : check LLAMA_TRACE env for extra logging

ggml-ci
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@Nexesenex Nexesenex merged commit b6b4f65 into Nexesenex:_master_up Jan 15, 2024
Nexesenex pushed a commit that referenced this pull request Oct 21, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 22, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 22, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 26, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 27, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 27, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 27, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Feb 25, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 3, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 4, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 5, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 7, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 7, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 9, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 9, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 11, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 11, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 11, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 12, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 13, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 13, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 16, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 16, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 18, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 19, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 20, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 21, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 22, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 22, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 23, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 24, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 25, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 27, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex pushed a commit that referenced this pull request Oct 28, 2025
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants