[BUG] No universal_checkpoint_info
in the Accelerate+Deepspeed Checkpoint #5430
Open
Description
opened on Apr 17, 2024
I trained model using Accelerate+Deepspeed ZeRO-2 and got a ZeRO-2 checkpoint. The checkpoint structure is listed below. And this is the Google Drive link to my checkpoint.
checkpoint-3/
├── config.json
├── generation_config.json
├── global_step3
│ ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│ ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
│ ├── bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
│ ├── bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
│ └── mp_rank_00_model_states.pt
├── latest
├── model.safetensors
├── rng_state_0.pth
├── rng_state_1.pth
├── rng_state_2.pth
├── rng_state_3.pth
├── scheduler.pt
├── trainer_state.json
├── training_args.bin
└── zero_to_fp32.py
I tried to convert this ZeRO-2 checkpoint to the universal format using ds_to_universal.py
but encountered errors:
args = Namespace(input_folder='experiment_ckpts/tinyllama_expanded_frez_embed-2024-04-16-010251/checkpoint-3', output_folder='experiment_ckpts/tinyllama_expanded_frez_embed-2024-04-16-010251/checkpoint-3_universal', num_extract_workers=10, num_merge_workers=10, keep_temp_folder=False, strict=True)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in experiment_ckpts/tinyllama_expanded_frez_embed-2024-04-16-010251/checkpoint-3 to Universal checkpoint in experiment_ckpts/tinyllama_expanded_frez_embed-2024-04-16-010251/checkpoint-3_universal
Traceback (most recent call last):
File "dist_env_tools/ds_to_universal.py", line 363, in <module>
main(args)
File "dist_env_tools/ds_to_universal.py", line 320, in main
_check_for_required_state(ds_checkpoint)
File "dist_env_tools/ds_to_universal.py", line 311, in _check_for_required_state
assert universal_checkpoint_info is not None, f'Required {UNIVERSAL_CHECKPOINT_INFO} state is missing in checkpoint. Verify that client creates this state.'
AssertionError: Required universal_checkpoint_info state is missing in checkpoint. Verify that client creates this state.
It seems the checkpoint structure is a bit different from Universal Checkpoint examples in Megatron-Deepspeed.
May I ask how can i find the universal_checkpoint_info
in my checkpoint?
Activity