Timeout on creating the index mappings #15

RaymondLi0 · 2023-01-04T16:53:06Z

When launching very long training runs, building the index mappings can take more than 1 minute.
The consequence is that the other ranks will timeout. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/training.py#L962
However the timeout passed to torch.distributed.initialize is 10 mins. Why isn't this value used in torch.distributed.broadcast?

The workaround for now is to first create the index mappings on a single worker, as a preliminary run.

The text was updated successfully, but these errors were encountered:

* add direct meg-ds to hf format script (NVIDIA#110) * add direct meg-ds to hf format script (part2) (NVIDIA#111) * add direct meg-ds to hf format script * split into 2 function * update the usage doc * make scripts executable * add shebang Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Stas Bekman <stas@stason.org>

RaymondLi0 added the bug Something isn't working label Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout on creating the index mappings #15

Timeout on creating the index mappings #15

RaymondLi0 commented Jan 4, 2023

Timeout on creating the index mappings #15

Timeout on creating the index mappings #15

Comments

RaymondLi0 commented Jan 4, 2023