Skip to content

Commit 21b56df

Browse files
saumishrfacebook-github-bot
authored andcommitted
Rank local checkpointing in DCP internal without collectives (#989)
Summary: ### Context DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing (XLFormers style checkpointing) which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D72390326
1 parent c1dfeb7 commit 21b56df

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

tests/framework/callbacks/test_dcp_saver.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -991,13 +991,17 @@ class DummyStorageWriter(FileSystemWriter):
991991
def __init__(self, path: str) -> None:
992992
super().__init__(path)
993993

994-
def set_up_storage_writer(self, is_coordinator: bool) -> None:
994+
def set_up_storage_writer(
995+
self, is_coordinator: bool, *args: Any, **kwargs: Any
996+
) -> None:
995997
pass
996998

997999

9981000
class DummyStorageReader(FileSystemReader):
9991001
def __init__(self, path: str) -> None:
10001002
super().__init__(path)
10011003

1002-
def set_up_storage_writer(self, is_coordinator: bool) -> None:
1004+
def set_up_storage_reader(
1005+
self, is_coordinator: bool, *args: Any, **kwargs: Any
1006+
) -> None:
10031007
pass

0 commit comments

Comments
 (0)