[moe] brings batch/sequence-wise load balance loss #2061

rakkit · 2025-11-19T19:16:03Z

This is a draft PR for:

Make the moe's load_balance_coeff configurable
add the batch and seq-wise aux loss for load balance. [ref: dpskv3 eqn. 17~20]

For now, it only applies to the DeepSeek model, but I can add it for all other moe models at the end.
(also, we dont log the aux loss, but i can add it in optimizer hook to do this if you want)

The main concern is that the aux loss does not work well with PP. From what I have tested, it works well only with 1F1B. And it is broken for ZBV or interleaved 1f1b.

To test it:
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --model.extra_losses.load_balance_loss_weight=0.001

…d seq-wise aux loss for load balance

rakkit · 2025-11-19T19:17:04Z

torchtitan/train.py

            job_config, parallel_dims=parallel_dims, ft_manager=self.ft_manager
        )

+        self.loss_fn = functools.partial(


we can add a condition here to wrap loss or not for MoE. for now all models in torchtitan only return a single output so its ok for now

If subsume this moe loss wrapper into build_loss_fn we can avoid adding the logic here.

wwwjn

Thank you! @shuhuayu is working on a more formal review, and I have some house-keeping comments

wwwjn · 2025-11-21T21:24:02Z

torchtitan/config/job_config.py



+@dataclass
+class ExtraLosses:


This section is specifically for MoE load balancing loss for now, do you foresee any other loss related params will be used in this section? If not, let's make the name for descriptive and specific

Followup here. Should we merge these configs to the Model dataclass?

wwwjn · 2025-11-21T21:24:07Z

torchtitan/config/job_config.py

+    load_balance_loss_weight: float = 0
+    """Weight of load balance loss"""
+
+    load_balance_coeff: float | None = 1e-3


Probably rename this to loss_free_load_balance_coeff? And IIUC because it's loss free, we need to set it to none if we use loss-based load balancing, otherwise it will register a optimizer hook here:

torchtitan/torchtitan/components/optimizer.py

Line 411 in 58fa181

if _should_register_moe_balancing_hook(model_parts):

I think both loss-free and loss-based load balancing are used simultaneously in deepseek v3.

Yes DPSKV3 (and GLM 4.5, as i know) uses both.

load_balance_coeff is the name used in the current repo, and yes maybe we should name them properly.

wwwjn · 2025-11-21T21:25:48Z

torchtitan/models/moe/moe.py

                )
+
+    @staticmethod
+    @torch.compile(fullgraph=True)


n00b q: Do we always want to compile this loss? Is it for speed purpose? Should we provide options for users to control whether they want to compile or not, like if job_config.compile.enable and "loss" in job_config.compile.components in loss.py

yep for speed up. Idk when we have compile + full graph will it automatically compiled or not (i would expect so)

wwwjn · 2025-11-21T21:38:54Z

torchtitan/components/loss.py

+def moe_loss(
+    pred: tuple[torch.Tensor, torch.Tensor] | torch.Tensor,
+    labels: torch.Tensor,
+    loss_fn: LossFunction,


I think we could have a consistent API with other loss function - Taking job_config as input , and plug-in the loss like other loss Function in TrainSpec:

torchtitan/torchtitan/models/deepseek_v3/__init__.py

Line 170 in 58fa181

build_loss_fn=build_cross_entropy_loss,

.

So that we could avoid the change in train.py. WDYT?

I agree. I think we can use a new build_loss_fn for models that possibly have moe. Or we can update build_cross_entropy_loss by checking whether moe is enabled from config here

torchtitan/torchtitan/components/loss.py

Line 29 in ad9f188

if job_config.compile.enable and "loss" in job_config.compile.components:

.

you mean smth like build_multiple_loss? or we do build_ce_and_moe_loss and build_mse_and_moe_loss?

shuhuayu

Thanks a lot for the pr @rakkit ! Made some comments here.

shuhuayu · 2025-11-20T19:34:54Z

torchtitan/models/moe/moe.py

+        indices: torch.Tensor,  # Shape: (B*S, K) - Selected Expert Indices
+        B: int,  # Batch size
+        S: int,  # Sequence length (T in the paper)
+        top_k: int,  # K_r


The K_r here is the same with K elsewhere in this function right? Maybe we can use a consistent notation top_k in all comments, and tell people this is K_r in the deepseek paper. Similarly we can use N to denote the number of routed experts and tell people this is N_r in the deepseek paper.

shuhuayu · 2025-11-20T19:51:01Z

torchtitan/models/moe/moe.py

+        # 1. Reshape inputs to handle each sequence separately: (B, S, N)
+        #    This ensures we calculate P_i and f_i per sequence (Eq 20 & 18).
+        scores_per_seq = scores.view(B, S, N)
+        indices_per_seq = indices.view(B, S, top_k)


This is not used afterwards.

Suggested change

indices_per_seq = indices.view(B, S, top_k)

shuhuayu · 2025-11-20T19:55:05Z

torchtitan/models/moe/moe.py

+        #    f_i = (N / (K * T)) * count_i
+
+        # Flatten the top-k dimension to count hits per sequence: (B, S*K)
+        flat_indices_per_seq = indices_per_seq.view(B, -1)


Suggested change

flat_indices_per_seq = indices_per_seq.view(B, -1)

batch_indices_per_seq = indices.flatten(1)

shuhuayu · 2025-11-21T00:23:24Z

torchtitan/models/moe/moe.py

+        selection_counts = torch.zeros((B, N), device=scores.device, dtype=scores.dtype)
+        src = torch.ones_like(flat_indices_per_seq, dtype=scores.dtype)
+        selection_counts.scatter_add_(1, flat_indices_per_seq, src)


Seems to me we do not need to create a new src here. We may consider using torch.bincount to save memory.

Suggested change

selection_counts = torch.zeros((B, N), device=scores.device, dtype=scores.dtype)

src = torch.ones_like(flat_indices_per_seq, dtype=scores.dtype)

selection_counts.scatter_add_(1, flat_indices_per_seq, src)

offset = (torch.arange(B, device=batch_indices_per_seq.device).unsqueeze(1) * N)

flat_indices = (batch_indices_per_seq + offset).reshape(-1)

selection_counts = torch.bincount(flat_indices, minlength=B * N).reshape(B, N)

selection_counts = selection_counts.to(dtype=scores.dtype)

shuhuayu · 2025-11-21T19:57:17Z

torchtitan/models/moe/moe.py

        super().__init__()

        num_experts = moe_args.num_experts
+        self.topk = moe_args.top_k


for nit

Suggested change

self.topk = moe_args.top_k

self.top_k = moe_args.top_k

shuhuayu · 2025-11-24T20:22:20Z

torchtitan/train.py

            job_config, parallel_dims=parallel_dims, ft_manager=self.ft_manager
        )

+        self.loss_fn = functools.partial(


If subsume this moe loss wrapper into build_loss_fn we can avoid adding the logic here.

shuhuayu · 2025-11-24T20:24:21Z

torchtitan/config/job_config.py

+    load_balance_loss_weight: float = 0
+    """Weight of load balance loss"""
+
+    load_balance_coeff: float | None = 1e-3


Suggested change

load_balance_coeff: float | None = 1e-3

load_balance_bias_coeff: float | None = 1e-3

shuhuayu · 2025-11-24T20:24:53Z

torchtitan/models/deepseek_v3/model/args.py

+        losses_config = job_config.model.extra_losses
+        self.moe_args.load_balance_loss_type = losses_config.load_balance_loss_type
+        self.moe_args.load_balance_loss_weight = losses_config.load_balance_loss_weight
+        self.moe_args.load_balance_coeff = losses_config.load_balance_coeff


Suggested change

self.moe_args.load_balance_coeff = losses_config.load_balance_coeff

self.moe_args.load_balance_bias_coeff = losses_config.load_balance_bias_coeff

shuhuayu · 2025-11-24T20:50:10Z

torchtitan/components/loss.py

+    if isinstance(pred, tuple):
+        pred, load_balance_loss = pred
+        loss = loss_fn(pred, labels)
+        # USE STE to make the magnitude of loss remain the same


Maybe we can be more explicit here.

Suggested change

# USE STE to make the magnitude of loss remain the same

# Add auxiliary loss to the computation graph for gradients in the backward pass,

# but cancel out its numeric value so the forward pass only logs language model task loss.

shuhuayu · 2025-11-24T21:15:12Z

torchtitan/models/moe/moe.py

        )
        out = out.reshape(bs, slen, dim)
-        return out
+


Suggested change

1) make the moe's load_balance_coeff configurable 2) add the batch an…

1c5ddd5

…d seq-wise aux loss for load balance

rakkit requested review from fegin, tianyu-l, wconstab and wwwjn as code owners November 19, 2025 19:16

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025

rakkit commented Nov 19, 2025

View reviewed changes

rakkit mentioned this pull request Nov 19, 2025

question of PP x aux_loss for MoE #1979

Open

tianyu-l requested a review from shuhuayu November 19, 2025 21:34

wwwjn reviewed Nov 21, 2025

View reviewed changes

shuhuayu reviewed Nov 24, 2025

View reviewed changes

	flat_indices_per_seq = indices_per_seq.view(B, -1)
	batch_indices_per_seq = indices.flatten(1)

-        selection_counts = torch.zeros((B, N), device=scores.device, dtype=scores.dtype)
-        src = torch.ones_like(flat_indices_per_seq, dtype=scores.dtype)
-        selection_counts.scatter_add_(1, flat_indices_per_seq, src)
+        offset = (torch.arange(B, device=batch_indices_per_seq.device).unsqueeze(1) * N)
+        flat_indices = (batch_indices_per_seq + offset).reshape(-1)
+        selection_counts = torch.bincount(flat_indices, minlength=B * N).reshape(B, N)
+        selection_counts = selection_counts.to(dtype=scores.dtype)

	load_balance_coeff: float \| None = 1e-3
	load_balance_bias_coeff: float \| None = 1e-3

	self.moe_args.load_balance_coeff = losses_config.load_balance_coeff
	self.moe_args.load_balance_bias_coeff = losses_config.load_balance_bias_coeff

	# USE STE to make the magnitude of loss remain the same
	# Add auxiliary loss to the computation graph for gradients in the backward pass,
	# but cancel out its numeric value so the forward pass only logs language model task loss.



		@dataclass
		class ExtraLosses:

[moe] brings batch/sequence-wise load balance loss #2061

Are you sure you want to change the base?

[moe] brings batch/sequence-wise load balance loss #2061

Conversation

rakkit commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuhuayu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants