Skip to content

[REQUEST] An option to only save the model state_dict when save_checkpoint(), and how to manually save & load the model state_dict when using ZERO3 #2304

Open
@BlinkDL

Description

In my training code, I only save & load the model state_dict (no optimizer states). I find this is good enough with a few steps of warmup, and saves lots of space when training a large model (as I have to save very frequently, like once per hour).

However one can't directly save the model state_dict when using ZERO3.

May I know how I can manually request all GPUs to save their partitioned state_dict, And how to manually load these state_dict to each GPU?

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions