b1876 #73

Nexesenex · 2024-01-15T12:19:57Z

No description provided.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad * log a little bit more info on iOS

* Fix ffn_down quantization mix for MoE models In #4872 I did not consider the part where every third tensor is quantized with more bits. Fir MoE this leads to tensors of the same layer being quantized with different number of bits, which is not considered as a possibility in the inference implementation (it is assumed all experts use the same quantization). * Fix the fix * Review suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* llama : minor fix indent * llama : check LLAMA_TRACE env for extra logging ggml-ci

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. * Speed up float -> iq4_nl conversion on CUDA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>

ikawrakow and others added 13 commits January 14, 2024 09:44

Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)

807179e

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : support WinXP build with MinGW 8.1.0 (#3419)

ac32902

metal : correctly set SIMD support flags on iOS (#4923)

5f5fe1b

* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad * log a little bit more info on iOS

llama : use LLAMA_LOG_ macros for logging

03c5267

scripts : sync-ggml-am.sh option to skip commits

9408cfd

llama : check LLAMA_TRACE env for extra logging (#4929)

bb0c139

* llama : minor fix indent * llama : check LLAMA_TRACE env for extra logging ggml-ci

Add ability to use importance matrix for all k-quants (#4930)

467a882

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : fix missing quotes (#4937)

a836c8f

CUDA: faster dequantize kernels for Q4_0 and Q4_1 (#4938)

4a3156d

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : check for 256 divisibility for IQ2_XS, IQ2_XXS (#4950)

2faaef3

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

cuda : fix dequantize kernel names (#4938)

ddb008d

Nexesenex merged commit b6b4f65 into Nexesenex:_master_up Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

b1876 #73

b1876 #73

Uh oh!

Nexesenex commented Jan 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

b1876 #73

b1876 #73

Uh oh!

Conversation

Nexesenex commented Jan 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants