support save&load of fsdp_optim_state #16

hanwen-sun · 2024-09-03T11:28:48Z

What this pr do:

suport flatten(including padding before shard) and unflatten full_optim_state_dic save and load and test with ut.
support save and load of shard_optim_state_dict.

TODO:

test the memory usage of checkpointing 70b model.
shard_param_on_dim_0(?)

tests/distributed/test_fsdp_optim_state.py

torchacc/dist/fsdp.py

torchacc/utils/optim_utils.py

torchacc/dist/fsdp.py

tests/distributed/test_fsdp_optim_state.py

torchacc/dist/fsdp.py

torchacc/utils/optim_utils.py

torchacc/dist/fsdp.py

torchacc/utils/optim_utils.py

yitongh · 2024-09-09T03:15:29Z

torchacc/dist/fsdp.py

+        We judge whether the optim_state_dict is sharded automatically
+
+        Args:
+            optim_state_dict (Dict[str, Any]): The optimizer states to be loaded.


Please add some comments to explain what optim_state_dict are passed by other ranks when rank0_only is True.

There is no explanation here about how the optim_state_dict parameters for other ranks are passed after the optimizer state is loaded in rank 0. We can let users pass None instead of an empty dictionary (since users may not know what the keys are).

I add the explanation that we will broadcast rank0's optim_state_dict info to other ranks if specified rank0_only, so it's no matter what other rank's pass in and of course other ranks can pass None.

torchacc/dist/fsdp.py

tests/distributed/test_fsdp_optim_state.py

torchacc/dist/fsdp.py

hanwen-sun · 2024-09-14T08:49:49Z

torchacc/dist/fsdp.py

+        We judge whether the optim_state_dict is sharded automatically
+
+        Args:
+            optim_state_dict (Dict[str, Any]): The optimizer states to be loaded.


hanwen-sun · 2024-09-14T08:56:48Z

torchacc/dist/fsdp.py

+            self.model which is sharded.
+        """
+        # for sharded optim_state, we return directly
+        if 'shard_metadata' in optim_state_dict.keys():


I check the world_size between the stored shard_metadata and current shard_metadata and raise a NotImplementedError(is it suitable?) error.

torchacc/dist/distributed_parallel.py

torchacc/utils/optim_utils.py

torchacc/dist/distributed_parallel.py

yitongh · 2024-09-14T09:57:06Z

Overall looks good to me. Pass off to @anw90 for final review.

anw90 · 2024-09-05T01:57:53Z

tests/distributed/test_fsdp_optim_state.py

+    return model, optim
+
+
+def _train_step(


it’s better to replace _train_step with _train because _train_step might suggest that it only includes a single training step.

anw90 · 2024-09-05T01:58:14Z

tests/distributed/test_fsdp_optim_state.py

+    labels = torch.zeros(batch_size, dtype=torch.int64).to(device)
+    loss = model(data)
+    loss = torch.nn.functional.nll_loss(loss, labels)
+    loss.backward()


do we need to call loss.backward here?

No need, but there's no difference whether we do forward only or do forward and backward in the test case.

hanwen-sun added 12 commits September 3, 2024 19:23

dev for optim save&load

7747086

no change resource done

e393b4e

fix bug of params

204445f

linear example done

49fe1ee

net done

b18d579

extract communicate function

283f708

done

8722cb4

yapf

15ac2a1

fix

ad02ad2

fix

0f76b65

fix

79f8574

fix

d395f40

hanwen-sun requested a review from yitongh September 3, 2024 11:36

yitongh reviewed Sep 4, 2024

View reviewed changes

hanwen-sun added 9 commits September 4, 2024 16:18

modify ut

6fec3d1

modify fsdp

8177207

fix

2fc39a4

yapf

faa285b

fix

30e58fc

fix

8dfe87e

comment

a729b08

refine

14c304e

format

ffd4789

yitongh reviewed Sep 4, 2024

View reviewed changes

hanwen-sun added 6 commits September 5, 2024 14:20

remove modify in optim_state_dict

01a6a29

yapf

6df677f

fix

17fbf6f

fix

add2c16

yapf

aab0420

add rank0_only and reconstruct fsdp ut

b7da9fb

hanwen-sun added 6 commits September 7, 2024 08:01

broadcast state

b8ae217

add shard optim_state

de43bd8

comments

4488800

flatten begin

7c20271

flatten save done

f21b655

support flatten=False

ec1f19a

yitongh reviewed Sep 9, 2024

View reviewed changes

hanwen-sun added 3 commits September 12, 2024 20:58

fix

c4767c0

unpad done

380074e

done

fa2137e

hanwen-sun commented Sep 14, 2024

View reviewed changes

yitongh reviewed Sep 14, 2024

View reviewed changes

torchacc/dist/distributed_parallel.py Show resolved Hide resolved

yitongh requested a review from anw90 September 14, 2024 09:57

hanwen-sun added 2 commits September 17, 2024 10:40

done

65b9c70

support transformers

d409e43

hanwen-sun changed the title ~~support fsdp_optim_state~~ support save&load of fsdp_optim_state Sep 20, 2024

hanwen-sun added 6 commits September 22, 2024 16:30

fix

acc2b74

fix name

effe7c1

fix

3b948c5

Fix assert

42fa1f8

format

ddfbf39

add fsdp model check

b866dde

anw90 reviewed Sep 24, 2024

View reviewed changes

hanwen-sun added 4 commits September 24, 2024 11:25

fix assert error

c764c97

change _train_step to _train

8492169

change _train_step to _train

ee38258

format

0ba454f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support save&load of fsdp_optim_state #16

support save&load of fsdp_optim_state #16

hanwen-sun commented Sep 3, 2024 •

edited

Loading

yitongh Sep 9, 2024

hanwen-sun Sep 14, 2024

yitongh Sep 14, 2024

hanwen-sun Sep 23, 2024

hanwen-sun Sep 14, 2024

hanwen-sun Sep 14, 2024

yitongh commented Sep 14, 2024

anw90 Sep 5, 2024

hanwen-sun Sep 24, 2024

anw90 Sep 5, 2024

hanwen-sun Sep 24, 2024

support save&load of fsdp_optim_state #16

Are you sure you want to change the base?

support save&load of fsdp_optim_state #16

Conversation

hanwen-sun commented Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yitongh commented Sep 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanwen-sun commented Sep 3, 2024 •

edited

Loading