[REQUEST] An option to only save the model state_dict when save_checkpoint(), and how to manually save & load the model state_dict when using ZERO3 #2304
Open
Description
In my training code, I only save & load the model state_dict (no optimizer states). I find this is good enough with a few steps of warmup, and saves lots of space when training a large model (as I have to save very frequently, like once per hour).
However one can't directly save the model state_dict when using ZERO3.
May I know how I can manually request all GPUs to save their partitioned state_dict, And how to manually load these state_dict to each GPU?