- 
                Notifications
    You must be signed in to change notification settings 
- Fork 155
Adding IQ2_K, IQ3_K and IQ5_K #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
                
     Merged
            
            
          Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
Quantize/dequantize, CUDA dequantize
Performance is roughly on par with q5_0.
I cannot possibly wait for a 5 minutes nvcc compilation each time I touch vecdotq.cuh. Also, cmake was adding --options-file X.rsp to the nvcc compile commands, which confuses clangd, so I have turned that off.
Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs 172 t/s for iq2_xs.
Almost on par with iq2_xs (168 t/s vs 172 t/s).
169.2 t/s vs 167.8 t/s before.
Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.
Quite slow: 43 t/s for a 7B model
PP-512 goes to 473 t/s up from 452 t/s.
      
        
      
      
  
    4 tasks
  
    
  Nexesenex 
      pushed a commit
        to Nexesenex/ik_llama.cpp.nxs
      that referenced
      this pull request
    
      Oct 26, 2025 
    
    
      
  
    
      
    
  
…3-ik/check_up_gate_fmoe Revert "Revert "Revert "Check if ffn_up and ffn_gate are of the same type before using fmoe"""
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
See this discussion for rationale.