Skip to content

Conversation

@Nexesenex
Copy link
Contributor

Copy link
Owner

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, you are back!

@ikawrakow ikawrakow merged commit 2ec2229 into ikawrakow:main May 18, 2025
@Nexesenex
Copy link
Contributor Author

Nexesenex commented May 18, 2025

Hey!
Yeah, you sounded the horn with those MMQ Kernels for the IQ_K quants, I waited for them for a long time. I merged in Croco your IQ quants (included the KS ones with success last year, before the rev 14 of the GGUF format broke compatibility with them, possibly due to the template change introduced in #45 )
Meanwhile, I was amusing myself merging models, among other nerdy delights.
Congrats for all the amazing developments you made, even if it's hard for me to swing between mainline and IK_Llama to feed my Croco.
Also, Turboderp switched on QTIP based quants for Exllamav3.
Things are getting exciting!

ikawrakow pushed a commit that referenced this pull request May 19, 2025
ikawrakow added a commit that referenced this pull request May 22, 2025
* Refactor iqk: WIP

* Refactor iqk: Factor out float GEMM (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4

* Refactor iqk: Factor out GEMM for repacked legacy quants

* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV

* Refactor iqk: Factor out GEMM for repacked i-quants

* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512

* Refactor iqk: factor out 1-bit quants (NEON)

* Refactor iqk: factor out k-quants (NEON)

* Refactor iqk: factor out floats (NEON)

* Also iq4_xs belongs to k-quants

* Refactor iqk: factor out iqk quants (NEON)

* Refactor iqk: factor out legacy quants (NEON)

* Refactor iqk: factor out repacked legacy quants (NEON)

* Refactor iqk: factor out repacked k-quants (NEON)

* Refactor iqk: factor out repacked iqk quants (NEON)

* Refactor iqk: GEMM kernels are refactored on NEON

* Refactor iqk: FA compiles

If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X

* Refactor iqk: FA refactored (Zen4)

Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.

* Adding forgotten file

* Most helpers don't need to be templates

Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.

Compilation time drops to 14 second on the Ryzen-5975WX

* Fix bf16

* Refactor iqk: FA refactored (NEON)

* Forgotten MMQ ref and typo (#431)

* Adding forgotten iq5_k_r4

* Fix iq4_k_r4 on NEON

* Fix iq4_ks on NEON

It was broken before the refactoring (the shifts were not correctly
applied).

* Fix q8_0 on NEON

* Fix q6_0 K cache

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
@Nexesenex Nexesenex deleted the fix_mmq branch May 22, 2025 15:10
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 24, 2025
* Refactor iqk: WIP

* Refactor iqk: Factor out float GEMM (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)

* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)

* Refactor iqk: fix AVX2

* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4

* Refactor iqk: Factor out GEMM for repacked legacy quants

* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV

* Refactor iqk: Factor out GEMM for repacked i-quants

* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512

* Refactor iqk: factor out 1-bit quants (NEON)

* Refactor iqk: factor out k-quants (NEON)

* Refactor iqk: factor out floats (NEON)

* Also iq4_xs belongs to k-quants

* Refactor iqk: factor out iqk quants (NEON)

* Refactor iqk: factor out legacy quants (NEON)

* Refactor iqk: factor out repacked legacy quants (NEON)

* Refactor iqk: factor out repacked k-quants (NEON)

* Refactor iqk: factor out repacked iqk quants (NEON)

* Refactor iqk: GEMM kernels are refactored on NEON

* Refactor iqk: FA compiles

If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X

* Refactor iqk: FA refactored (Zen4)

Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.

* Adding forgotten file

* Most helpers don't need to be templates

Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.

Compilation time drops to 14 second on the Ryzen-5975WX

* Fix bf16

* Refactor iqk: FA refactored (NEON)

* Forgotten MMQ ref and typo (ikawrakow#431)

* Adding forgotten iq5_k_r4

* Fix iq4_k_r4 on NEON

* Fix iq4_ks on NEON

It was broken before the refactoring (the shifts were not correctly
applied).

* Fix q8_0 on NEON

* Fix q6_0 K cache

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants