Skip to content

Add 3-bit K-quants #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 6, 2025
Merged

Add 3-bit K-quants #10

merged 4 commits into from
May 6, 2025

Conversation

andrewkchan
Copy link
Owner

@andrewkchan andrewkchan commented May 5, 2025

Adds support for llama.cpp 3-bit K-quants (Q3_K). This is a 3.4375-bits-per-weight quantization scheme similarly using block and super-block scales. When quantized with Q3_K, DeepSeek-V3 (650GB) becomes 269GB. Using MHA, I get 3.2 tok/s on my test machine (AWS r6a.12xlarge, 16 threads) compared to 4.01 tok/s on Q2_K but higher perplexity.

Example preparation + generation:

$ git clone git@hf.co:deepseek-ai/DeepSeek-V3-Base
$ python convert.py --quant q3_k v3-base-q3_k DeepSeek-V3-Base/
$ make && OMP_NUM_THREADS=16 sudo ./build/main v3-base-q3_k -i "What is a large language model?" -m c -t 0.35 -n 128 -L
loading data from file: v3-base-q3_k/shard_000.dseek
read metadata {"act_type":"silu","arch":"DeepseekV3ForCausalLM","bos_token_id":"0","dim":"7168","eos_token_id":"1","first_k_dense_replace":"3","hidden_dim":"18432","kv_lora_rank":"512","max_seq_len":"163840","moe_intermediate_size":"2048","n_active_routed":"8","n_group":"8","n_heads":"128","n_layers":"61","n_routed_experts":"256","n_shared_experts":"1","norm_eps":"1e-06","norm_topk_prob":"True","norm_type":"rmsnorm","q_lora_rank":"1536","qk_nope_head_dim":"128","qk_rope_head_dim":"64","quant":"q3_k","rope_theta":"10000","routed_scaling_factor":"2.5","scoring_func":"sigmoid","topk_group":"4","topk_method":"group_limited_greedy","use_mla":"0","v_head_dim":"128","vocab_size":"129280"}
loading data from file: v3-base-q3_k/shard_001.dseek
loading data from file: v3-base-q3_k/shard_002.dseek
loading data from file: v3-base-q3_k/shard_003.dseek
loading data from file: v3-base-q3_k/shard_004.dseek
loading data from file: v3-base-q3_k/shard_005.dseek
loading data from file: v3-base-q3_k/shard_006.dseek
loading data from file: v3-base-q3_k/shard_007.dseek
loading model with quant: Q3_K
Model active bytes with full context window: 4.32437e+10
Model active bytes with no context: 1.86878e+10
Running warmup...
Warmup complete
[<s>:0][What:3085][ is:344][ a:260][ large:3226][ language:4063][ model:2645][?:33]
Encoding stats: (8 tokens, throughput: 1.7977e+308tok/s, latency: 0s/tok, total: 0s)

 A large language model is a type of artificial intelligence that can understand and generate human language. It can learn from large amounts of text and data and can produce new text based on what it has learned. Large language models are very powerful and can be used for many different purposes, such as writing, translating, summarizing, answering questions, and more. Some examples of large language models are GPT-4, BERT, and GPT-3. These models are very advanced and can create text that sounds like human writing. They can also learn from new data and improve their skills over time. Large language models are very useful and can help us with

Generation stats:
  136 tokens
  throughput: 3.2263tok/s
  latency: 0.30996s/tok
  hydrate: 2.316s
  bandwidth: 61.598GB/s
  total: 42.154s

@andrewkchan andrewkchan marked this pull request as ready for review May 6, 2025 12:58
@andrewkchan andrewkchan merged commit 036271f into main May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant