Adding IQ2_K, IQ3_K and IQ5_K #7

ikawrakow · 2024-07-31T13:10:08Z

Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.

Quantize/dequantize, CUDA dequantize

Performance is roughly on par with q5_0.

I cannot possibly wait for a 5 minutes nvcc compilation each time I touch vecdotq.cuh. Also, cmake was adding --options-file X.rsp to the nvcc compile commands, which confuses clangd, so I have turned that off.

Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs 172 t/s for iq2_xs.

Almost on par with iq2_xs (168 t/s vs 172 t/s).

169.2 t/s vs 167.8 t/s before.

Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.

Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.

138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.

We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp

We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.

It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.

Quite slow: 43 t/s for a 7B model

PP-512 goes to 473 t/s up from 452 t/s.

…3-ik/check_up_gate_fmoe Revert "Revert "Revert "Check if ffn_up and ffn_gate are of the same type before using fmoe"""

Iwan Kawrakow added 29 commits July 28, 2024 19:43

iq2_k: Basics

7644953

Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.

iq2_k: AVX2

3555a3d

iq2_k: simplify AVX512

d07c58b

iq2_k: slightly faster AVX512

972f134

iq2_k: NEON

89b410d

iq2_k: Metal. Dot product is wrong

1c2d026

iq5_k: Basics

f0836bd

Quantize/dequantize, CUDA dequantize

iq5_k: AVX2

1704a34

iq5_k: AVX512

eb5bd49

iq5_k: NEON

cc0493f

iq5_k: Metal

d2b5130

Performance is roughly on par with q5_0.

iq5_k: CUDA dot product still not working

3806ee2

Factor out iqk CUDA dot products

262fe6d

I cannot possibly wait for a 5 minutes nvcc compilation each time I touch vecdotq.cuh. Also, cmake was adding --options-file X.rsp to the nvcc compile commands, which confuses clangd, so I have turned that off.

iq5_k: CUDA dot product finally works

a266888

iq2_k: CUDA dot product finally works

534d3bb

Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs 172 t/s for iq2_xs.

iq2_k: better CUDA dot product

79cd04b

Almost on par with iq2_xs (168 t/s vs 172 t/s).

iq2_k: very slightly better CUDA dot product

92b3e1c

169.2 t/s vs 167.8 t/s before.

iq3_k: Basics

c828bd3

Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.

iq3_k: CUDA dot product

8d2b5ba

Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.

iq3_k: faster CUDA dot product

366441b

138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.

iq3_k: AVX512 iqk_mul_mat

bfe625f

We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp

iq3_k: AVX2 iqk_mul_mat

bfef3dc

We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.

iq3_k: NEON

45bd3ea

iq3_k: Metal dequantize

3737acf

iq2_k: Metal dot product finally works

eeaa65d

It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.

iq3_k: Metal dot product

0b9177b

Quite slow: 43 t/s for a 7B model

iq3_k: slightly faster Metal dequantize kernel

09c94ae

PP-512 goes to 473 t/s up from 452 t/s.

iq2/3_k: tiny bit faster Metal dot products

aeded5e

Add copyright notice

7b3b413

ikawrakow merged commit 3d1446b into main Aug 1, 2024

nux mentioned this pull request May 17, 2025

Bug: CUDA error: an illegal memory access was encountered #425

Closed

ciprianveg mentioned this pull request May 29, 2025

Bug: The streaming every couple of rows blocks for 5-8s #464

Closed

Ph0rk0z mentioned this pull request Jul 23, 2025

Bug: Command-A Spits incoherence when using -sm row #633

Open

ChicoPinto70 mentioned this pull request Jul 26, 2025

Port speculative decoding from upstream to llama-server #645

Merged

4 tasks

os360 mentioned this pull request Aug 27, 2025

Bug: Still crashing with -fmoe on AVX2 Q8_0 with unsloth/GLM-4.5-Air-GGUF/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf #736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding IQ2_K, IQ3_K and IQ5_K #7

Adding IQ2_K, IQ3_K and IQ5_K #7

Uh oh!

ikawrakow commented Jul 31, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding IQ2_K, IQ3_K and IQ5_K #7

Adding IQ2_K, IQ3_K and IQ5_K #7

Uh oh!

Conversation

ikawrakow commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikawrakow commented Jul 31, 2024 •

edited

Loading