Optimize minimax m2 modelling forward pass #2176
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@Qubitium
With these optimizations made by gemini-pro/glm-4.6-q3/gpt I was able to proceed with quantization (at least 5 layers and proceeding) Minimax-m2 to int4g32 using
1024c4/en +512gsm/arc/humaneval/alpaca samples, total ~496K tokens, on 8x3090 GPUs, not on latest GPTQModel main, but on a fork prior to data-parallel + few chery picks - ( branch: https://github.com/avtc/GPTQModel/tree/feature/v4-minimax-m2-chery )Without these optimizations max number of samples able to forward pass attention module was 32 on the same branch.
(The latest main branch consumes more VRAM, so after forward pass there is no room for quantization - there are errors about hessian inverse will run on CPU and then CUDA OOM, will check if excluding device cuda:0 from forward/quantization will help little bit later).
The modelling py file should be placed into ModelCloud/MiniMax-M2-BF16 model folder prior quantization.
I have compared weights loss for quantized experts, and it is identical or very close to original modelling py when number of tokens for expert is the same. There are small deviation of number of tokens between experts 1-2 happen, idk if it caused by optimizations or expected. To compare I have used 32 samples for original and optimized version and checked auto-generated logs.
I.e. original:
{ "process": "gptq", "layer": 0, "module": "block_sparse_moe.experts.4.w1", "loss": "0.0008356524", "samples": "1053", "damp": "0.10000", "time": "2.121", "fwd_time": "6.724", "(v)ram": "8400.74MB, 2784.61MB", "dynamic": null } { "process": "gptq", "layer": 0, "module": "block_sparse_moe.experts.2.w1", "loss": "0.3386714458", "samples": "3", "damp": "0.10000", "time": "2.192", "fwd_time": "6.724", "(v)ram": "8400.74MB, 2793.61MB", "dynamic": null }optimized:
{ "process": "gptq", "layer": 0, "module": "block_sparse_moe.experts.4.w1", "loss": "0.0008337825", "samples": "1055", "damp": "0.10000", "time": "2.048", "fwd_time": "6.663", "(v)ram": "8389.34MB, 2810.38MB", "dynamic": null } { "process": "gptq", "layer": 0, "module": "block_sparse_moe.experts.2.w1", "loss": "0.3386579355", "samples": "3", "damp": "0.10000", "time": "2.140", "fwd_time": "6.663", "(v)ram": "8398.34MB, 2947.71MB", "dynamic": null }As I am not a python LLM/torch/CUDA dev, I cannot validate all these changes to be correct, but one of optimizations lead to 10 times more losses (reverted it), so I think comparing weight losses is a valid point of view.
Please review and fill free to use/adjust/complete.