Mixtral scaling: Reduce perplexity from 4.294 to 4.269 #301

casper-hansen · 2024-01-09T21:54:08Z

This PR reduces the perplexity of Mixtral by introducing scaling of individual experts instead of one scale for all the experts in the MoE block.

To use this in vLLM, GGUF, or other integrations, their code needs to be updated to load the new ScaledMixtralSparseMoeBlock. The only modification needed in the inference code is right before the experts get the current_hidden_states:

current_state = hidden_states[None, top_x_list].reshape(
    -1, hidden_dim) / self.scales[expert_idx] # <-- scales

Idea by @Sakits, engineering & implementation by @casper-hansen

casper-hansen · 2024-01-09T21:59:47Z

CC @younesbelkada. Not sure if this would break anything in the transformers integration. WDYT?

younesbelkada

Thanks @casper-hansen !
Unless I am missing something I think this should be all good for transformers integration for now, meaning this PR is totally BC with transforers integration since it does not seem to touch any core module that we use (GEMM_Linear). If we want to upstream this into transformers we need to think about an API for users to replace the SparseMoE layers with ScaledMixtralSparseMoeBlock

younesbelkada · 2024-01-10T07:22:37Z

As a sanity check I would run basic inference with transformers after emrging this PR just to be sure, but looking at the PR it does not seem to do anything that would break transformers integration

casper-hansen · 2024-01-10T08:30:25Z

As a sanity check I would run basic inference with transformers after emrging this PR just to be sure, but looking at the PR it does not seem to do anything that would break transformers integration

Right, but we do introduce a scale in the MoE layer that needs to be loaded in order to run inference. Could it be possible to add this during the loading of the linear layers?

FYI, for MPT models, we need the same thing but just for a different layer.

vince62s · 2024-01-10T08:47:48Z

the ppl improvement is really small did you try other scores to see if this is worth it ?

casper-hansen · 2024-01-10T09:25:45Z

the ppl improvement is really small did you try other scores to see if this is worth it ?

This is how it should have been implemented from the start as this is a more correct implementation of AWQ. Yes, I have tried other combinations but it does not lead to better perplexity (at least the one’s I tried)

casper-hansen added 7 commits January 3, 2024 13:04

Mixtral Scaling [WIP]

4d83875

Mixtral individual expert scaling

ff37720

Various fixes

0a6a6d0

Refactor quantization

aec278e

Minor fixes and type hinting

3a581a5

Lower perplexity

43f21c9

Minor fixes, code cleaning

112de41

casper-hansen requested a review from younesbelkada January 9, 2024 21:59

younesbelkada approved these changes Jan 10, 2024

View reviewed changes

chu-tianxiang mentioned this pull request Jan 18, 2024

tensor parallel MOE implementation vllm-project/vllm#2293

Closed

casper-hansen added 3 commits January 21, 2024 19:16

Backward compatibility

7525fb2

Rework modules_to_not_convert

6c065ab

Merge branch 'main' into mixtral_scaling

02ab075

casper-hansen mentioned this pull request Jan 27, 2024

Experiment with implementing AWQ for BERT models #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral scaling: Reduce perplexity from 4.294 to 4.269 #301

Mixtral scaling: Reduce perplexity from 4.294 to 4.269 #301

casper-hansen commented Jan 9, 2024

casper-hansen commented Jan 9, 2024

younesbelkada left a comment

younesbelkada commented Jan 10, 2024

casper-hansen commented Jan 10, 2024

vince62s commented Jan 10, 2024

casper-hansen commented Jan 10, 2024

Mixtral scaling: Reduce perplexity from 4.294 to 4.269 #301

Are you sure you want to change the base?

Mixtral scaling: Reduce perplexity from 4.294 to 4.269 #301

Conversation

casper-hansen commented Jan 9, 2024

casper-hansen commented Jan 9, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada commented Jan 10, 2024

casper-hansen commented Jan 10, 2024

vince62s commented Jan 10, 2024

casper-hansen commented Jan 10, 2024