Skip to content

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Jul 31, 2024

See this discussion for rationale.

Iwan Kawrakow added 29 commits July 28, 2024 19:43
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
Quantize/dequantize, CUDA dequantize
Performance is roughly on par with q5_0.
I cannot possibly wait for a 5 minutes nvcc compilation
each time I touch vecdotq.cuh.

Also, cmake was adding --options-file X.rsp to the nvcc
compile commands, which confuses clangd, so I have turned
that off.
Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs
172 t/s for iq2_xs.
Almost on par with iq2_xs (168 t/s vs 172 t/s).
169.2 t/s vs 167.8 t/s before.
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
Slightly slower than iq3_s - 132 t/s vs 138 t/s for
LLaMA-3.1-8B.
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs,
or 63.3 t/s for q2_K_S.
Quite slow: 43 t/s for a 7B model
PP-512 goes to 473 t/s up from 452 t/s.
@ikawrakow ikawrakow merged commit 3d1446b into main Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants