Optimize minimax m2 modelling forward pass #2176

avtc · 2025-11-05T16:00:42Z

@Qubitium
With these optimizations made by gemini-pro/glm-4.6-q3/gpt I was able to proceed with quantization (at least 5 layers and proceeding) Minimax-m2 to int4g32 using 1024 c4/en + 512 gsm/arc/humaneval/alpaca samples, total ~496K tokens, on 8x3090 GPUs, not on latest GPTQModel main, but on a fork prior to data-parallel + few chery picks - ( branch: https://github.com/avtc/GPTQModel/tree/feature/v4-minimax-m2-chery )

Without these optimizations max number of samples able to forward pass attention module was 32 on the same branch.
(The latest main branch consumes more VRAM, so after forward pass there is no room for quantization - there are errors about hessian inverse will run on CPU and then CUDA OOM, will check if excluding device cuda:0 from forward/quantization will help little bit later).

The modelling py file should be placed into ModelCloud/MiniMax-M2-BF16 model folder prior quantization.

I have compared weights loss for quantized experts, and it is identical or very close to original modelling py when number of tokens for expert is the same. There are small deviation of number of tokens between experts 1-2 happen, idk if it caused by optimizations or expected. To compare I have used 32 samples for original and optimized version and checked auto-generated logs.
I.e. original:

{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.4.w1",
    "loss": "0.0008356524",
    "samples": "1053",
    "damp": "0.10000",
    "time": "2.121",
    "fwd_time": "6.724",
    "(v)ram": "8400.74MB, 2784.61MB",
    "dynamic": null
}
{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.2.w1",
    "loss": "0.3386714458",
    "samples": "3",
    "damp": "0.10000",
    "time": "2.192",
    "fwd_time": "6.724",
    "(v)ram": "8400.74MB, 2793.61MB",
    "dynamic": null
}

optimized:

{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.4.w1",
    "loss": "0.0008337825",
    "samples": "1055",
    "damp": "0.10000",
    "time": "2.048",
    "fwd_time": "6.663",
    "(v)ram": "8389.34MB, 2810.38MB",
    "dynamic": null
}
{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.2.w1",
    "loss": "0.3386579355",
    "samples": "3",
    "damp": "0.10000",
    "time": "2.140",
    "fwd_time": "6.663",
    "(v)ram": "8398.34MB, 2947.71MB",
    "dynamic": null
}

As I am not a python LLM/torch/CUDA dev, I cannot validate all these changes to be correct, but one of optimizations lead to 10 times more losses (reverted it), so I think comparing weight losses is a valid point of view.

Please review and fill free to use/adjust/complete.

Qubitium · 2025-11-06T02:09:12Z

@avtc I will take a look. It does do a lot of inplace tensor mutations which reduces the allocation of temp tensors which is great!

Make sure that with or without the patch, the "error_loss" from gptq or awq quantization is the exactly the same for the first 2 layers. This is just a quick way to verify the output of the module layer forward is the same before/after PR. Please get the ai to generate unit test and put into tests folder. The unit test should just init a random Minimax layer with random values for all the modules (or you can get the unit test to load the first layer of the real BF16 model). And then give it random input and make sure the output tensors values are the same before/after PR. This ensure the changes for this PR is deteriministic.

Qubitium · 2025-11-06T17:17:16Z

@avtc huggingface/transformers#42028

Official Minimax M2 Pr created. So we should use the official code instead.

avtc added 5 commits November 4, 2025 20:28

first 5 iterations

b5ad1f3

opt.6

1fb8723

opt.7

eed748a

opt.8-14 glm-4.6 + gpt

339e38b

opt.15 chunks for attn

75b9d13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize minimax m2 modelling forward pass #2176

Optimize minimax m2 modelling forward pass #2176

Uh oh!

avtc commented Nov 5, 2025

Uh oh!

Qubitium commented Nov 6, 2025

Uh oh!

Qubitium commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize minimax m2 modelling forward pass #2176

Are you sure you want to change the base?

Optimize minimax m2 modelling forward pass #2176

Uh oh!

Conversation

avtc commented Nov 5, 2025

Uh oh!

Qubitium commented Nov 6, 2025

Uh oh!

Qubitium commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants