loss averaging across multiple devices #254

hadipash · 2023-05-04T09:05:55Z

Motivation

Average loss across devices to see better training stats.

zhtmike · 2023-05-04T10:12:29Z

mindocr/utils/callbacks.py

+            @jit
+            def reduce(x):                          # lamda expression is not supported in MindSpore
+                return reduce_sum(x) / device_num   # average loss across all cards
+


it is better to put reduce outside the function

agree. btw, is it necessary to use jit decorator here as we are already in graph mode and the reduce computation should be low-weight and fast. If no, we can simply use self._loss_reduce = lambda x: reduce_sum(x) / device_num

Talked to Jun about it, he said that: 1. callbacks are always executed in native mode, 2. ops.AllReduce() may take a noticeable amount of time in native mode due to some overhead computations in the backend, so it is generally recommended to wrap it with jit.

Although, I agree that reducing single number tensors may be very quick and jit could be an overkill here. Maybe we can benchmark later and see if it is really necessary.

hadipash · 2023-05-08T03:20:03Z

mindocr/utils/callbacks.py

-        else:
-            self._loss_reduce = lambda x: x
+    @jit
+    def _reduce(self, x):


@zhtmike @SamitHuang Please check.

Not sure whether running with ms_function in callback is a stable choice. For MS 1.9, pynative with ms_function is not as stable as MS 2.0. If using ms_function is risky, i don't it is worthy to add jit/ms_function considering the ignorable acceleration on this one-step division computation.

hadipash requested review from SamitHuang and zhtmike May 4, 2023 09:07

zhtmike reviewed May 4, 2023

View reviewed changes

zhtmike approved these changes May 5, 2023

View reviewed changes

hadipash added 2 commits May 8, 2023 11:17

loss averaging across multiple devices

f09167d

move reduce outside of init

bb9d613

hadipash force-pushed the fix branch from 8ffb247 to bb9d613 Compare May 8, 2023 03:18

hadipash commented May 8, 2023

View reviewed changes

hadipash requested a review from zhtmike May 8, 2023 03:49

zhtmike approved these changes May 8, 2023

View reviewed changes

SamitHuang approved these changes May 10, 2023

View reviewed changes

SamitHuang merged commit 62724b4 into mindspore-lab:main May 10, 2023

zhtmike mentioned this pull request May 12, 2023

Runtime error if number of training iterations is large on Ascend #281

Closed

hadipash deleted the fix branch May 23, 2023 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

loss averaging across multiple devices #254

loss averaging across multiple devices #254

Uh oh!

hadipash commented May 4, 2023

Uh oh!

zhtmike May 4, 2023

Uh oh!

SamitHuang May 5, 2023

Uh oh!

hadipash May 5, 2023 •

edited

Loading

Uh oh!

hadipash May 8, 2023

Uh oh!

SamitHuang May 9, 2023

Uh oh!

Uh oh!

loss averaging across multiple devices #254

loss averaging across multiple devices #254

Uh oh!

Conversation

hadipash commented May 4, 2023

Motivation

Uh oh!

zhtmike May 4, 2023

Choose a reason for hiding this comment

Uh oh!

SamitHuang May 5, 2023

Choose a reason for hiding this comment

Uh oh!

hadipash May 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadipash May 8, 2023

Choose a reason for hiding this comment

Uh oh!

SamitHuang May 9, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hadipash May 5, 2023 •

edited

Loading