pipe engine _aggregate_total_loss: more efficient loss concatenation (#…

…4327) * _aggregate_total_loss: more efficient loss concatenation optimize _aggregate_total_loss function in order to remove dependancy of copying from device to host and back to device. This reduce the runtime on the host. * Fixing the if/else block on which the optimization should take place --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
microsoft · Oct 23, 2023 · a02de22 · a02de22
1 parent 0f2338f
commit a02de22
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/deepspeed/runtime/pipe/engine.py b/deepspeed/runtime/pipe/engine.py
@@ -549,7 +549,7 @@ def _aggregate_total_loss(self):
                 agg_loss /= self.dp_world_size
 
             assert self.global_rank in self.grid.pp_group
-            losses = torch.Tensor([self.dp_group_loss, agg_loss]).to(self.device)
+            losses = torch.stack([self.dp_group_loss, agg_loss])
             if self.is_pipe_parallel:
                 dist.broadcast(tensor=losses, src=self.global_rank, group=self.mpu.get_pipe_parallel_group())
         else: