Skip to content

Conversation

@avtc
Copy link
Contributor

@avtc avtc commented Nov 5, 2025

@Qubitium
With these optimizations made by gemini-pro/glm-4.6-q3/gpt I was able to proceed with quantization (at least 5 layers and proceeding) Minimax-m2 to int4g32 using 1024 c4/en + 512 gsm/arc/humaneval/alpaca samples, total ~496K tokens, on 8x3090 GPUs, not on latest GPTQModel main, but on a fork prior to data-parallel + few chery picks - ( branch: https://github.com/avtc/GPTQModel/tree/feature/v4-minimax-m2-chery )

Without these optimizations max number of samples able to forward pass attention module was 32 on the same branch.
(The latest main branch consumes more VRAM, so after forward pass there is no room for quantization - there are errors about hessian inverse will run on CPU and then CUDA OOM, will check if excluding device cuda:0 from forward/quantization will help little bit later).

The modelling py file should be placed into ModelCloud/MiniMax-M2-BF16 model folder prior quantization.

I have compared weights loss for quantized experts, and it is identical or very close to original modelling py when number of tokens for expert is the same. There are small deviation of number of tokens between experts 1-2 happen, idk if it caused by optimizations or expected. To compare I have used 32 samples for original and optimized version and checked auto-generated logs.
I.e. original:

{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.4.w1",
    "loss": "0.0008356524",
    "samples": "1053",
    "damp": "0.10000",
    "time": "2.121",
    "fwd_time": "6.724",
    "(v)ram": "8400.74MB, 2784.61MB",
    "dynamic": null
}
{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.2.w1",
    "loss": "0.3386714458",
    "samples": "3",
    "damp": "0.10000",
    "time": "2.192",
    "fwd_time": "6.724",
    "(v)ram": "8400.74MB, 2793.61MB",
    "dynamic": null
}

optimized:

{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.4.w1",
    "loss": "0.0008337825",
    "samples": "1055",
    "damp": "0.10000",
    "time": "2.048",
    "fwd_time": "6.663",
    "(v)ram": "8389.34MB, 2810.38MB",
    "dynamic": null
}
{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.2.w1",
    "loss": "0.3386579355",
    "samples": "3",
    "damp": "0.10000",
    "time": "2.140",
    "fwd_time": "6.663",
    "(v)ram": "8398.34MB, 2947.71MB",
    "dynamic": null
}

As I am not a python LLM/torch/CUDA dev, I cannot validate all these changes to be correct, but one of optimizations lead to 10 times more losses (reverted it), so I think comparing weight losses is a valid point of view.

Please review and fill free to use/adjust/complete.

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 6, 2025

@avtc I will take a look. It does do a lot of inplace tensor mutations which reduces the allocation of temp tensors which is great!

Make sure that with or without the patch, the "error_loss" from gptq or awq quantization is the exactly the same for the first 2 layers. This is just a quick way to verify the output of the module layer forward is the same before/after PR. Please get the ai to generate unit test and put into tests folder. The unit test should just init a random Minimax layer with random values for all the modules (or you can get the unit test to load the first layer of the real BF16 model). And then give it random input and make sure the output tensors values are the same before/after PR. This ensure the changes for this PR is deteriministic.

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 6, 2025

@avtc huggingface/transformers#42028

Official Minimax M2 Pr created. So we should use the official code instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants