Add 3-bit K-quants #10

andrewkchan · 2025-05-05T02:26:21Z

Adds support for llama.cpp 3-bit K-quants (Q3_K). This is a 3.4375-bits-per-weight quantization scheme similarly using block and super-block scales. When quantized with Q3_K, DeepSeek-V3 (650GB) becomes 269GB. Using MHA, I get 3.2 tok/s on my test machine (AWS r6a.12xlarge, 16 threads) compared to 4.01 tok/s on Q2_K but higher perplexity.

Example preparation + generation:

$ git clone git@hf.co:deepseek-ai/DeepSeek-V3-Base
$ python convert.py --quant q3_k v3-base-q3_k DeepSeek-V3-Base/
$ make && OMP_NUM_THREADS=16 sudo ./build/main v3-base-q3_k -i "What is a large language model?" -m c -t 0.35 -n 128 -L
loading data from file: v3-base-q3_k/shard_000.dseek
read metadata {"act_type":"silu","arch":"DeepseekV3ForCausalLM","bos_token_id":"0","dim":"7168","eos_token_id":"1","first_k_dense_replace":"3","hidden_dim":"18432","kv_lora_rank":"512","max_seq_len":"163840","moe_intermediate_size":"2048","n_active_routed":"8","n_group":"8","n_heads":"128","n_layers":"61","n_routed_experts":"256","n_shared_experts":"1","norm_eps":"1e-06","norm_topk_prob":"True","norm_type":"rmsnorm","q_lora_rank":"1536","qk_nope_head_dim":"128","qk_rope_head_dim":"64","quant":"q3_k","rope_theta":"10000","routed_scaling_factor":"2.5","scoring_func":"sigmoid","topk_group":"4","topk_method":"group_limited_greedy","use_mla":"0","v_head_dim":"128","vocab_size":"129280"}
loading data from file: v3-base-q3_k/shard_001.dseek
loading data from file: v3-base-q3_k/shard_002.dseek
loading data from file: v3-base-q3_k/shard_003.dseek
loading data from file: v3-base-q3_k/shard_004.dseek
loading data from file: v3-base-q3_k/shard_005.dseek
loading data from file: v3-base-q3_k/shard_006.dseek
loading data from file: v3-base-q3_k/shard_007.dseek
loading model with quant: Q3_K
Model active bytes with full context window: 4.32437e+10
Model active bytes with no context: 1.86878e+10
Running warmup...
Warmup complete
[<s>:0][What:3085][ is:344][ a:260][ large:3226][ language:4063][ model:2645][?:33]
Encoding stats: (8 tokens, throughput: 1.7977e+308tok/s, latency: 0s/tok, total: 0s)

 A large language model is a type of artificial intelligence that can understand and generate human language. It can learn from large amounts of text and data and can produce new text based on what it has learned. Large language models are very powerful and can be used for many different purposes, such as writing, translating, summarizing, answering questions, and more. Some examples of large language models are GPT-4, BERT, and GPT-3. These models are very advanced and can create text that sounds like human writing. They can also learn from new data and improve their skills over time. Large language models are very useful and can help us with

Generation stats:
  136 tokens
  throughput: 3.2263tok/s
  latency: 0.30996s/tok
  hydrate: 2.316s
  bandwidth: 61.598GB/s
  total: 42.154s

andrewkchan added 4 commits May 6, 2025 00:03

add vendor q3k code

478131f

add q3k support in inference

9711b46

add q3_k to convert.py

8ca92e8

fixes

516b247

andrewkchan force-pushed the q3k branch from 91a1a94 to 516b247 Compare May 6, 2025 00:05

andrewkchan marked this pull request as ready for review May 6, 2025 12:58

andrewkchan merged commit 036271f into main May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 3-bit K-quants #10

Add 3-bit K-quants #10

Uh oh!

andrewkchan commented May 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add 3-bit K-quants #10

Add 3-bit K-quants #10

Uh oh!

Conversation

andrewkchan commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

andrewkchan commented May 5, 2025 •

edited

Loading