Downloading model in distributed mode #1521

BramVanroy · 2019-10-15T08:14:12Z

🐛 Bug

When running in distributed mode with n processes, a new model will be download n times. I don't think that's what you want. I found this related issue but that only fixed the race condition. Downloads still happen in parallel. Is there a way to only download the model once? Perhaps by passing a local_rank parameter and only downloading when local_rank==0?

Especially for large models this is not ideal as i. they take up a lot of space (multiplied by the number of processes) ii. downloading is extra slow because it happens multiple times in parallel, limiting bandwidth.

15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp0amm9x2s
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp7wpg48uj
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp89svv055
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp7yk94f8s
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27631147.05B/s]
15-Oct 03:12:42 - [INFO]: copying /tmp/tmp89svv055 to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27614197.65B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp7wpg48uj to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27605553.23B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp0amm9x2s to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27599668.53B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp7yk94f8s to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0

An alternative would be to already 'touch' the file in the .cache before downloading, and when it exists, not initiate a new download. (Taking into account sudden abortions.)

The text was updated successfully, but these errors were encountered:

thomwolf · 2019-10-15T08:32:21Z

This should be fixed in most of the examples through the use of torch.distributed.barrier.
E.g. here: https://github.com/huggingface/transformers/blob/master/examples/run_glue.py#L473

Don't hesitate to submit a PR if some examples don't make use of this technique yet.

BramVanroy · 2019-10-15T08:40:44Z

Thanks for the quick reply! So to ensure that I understand this correctly: barrier blocks until all processes are synchronized (i.e. have reached that point). So before we enter the loading of the model, we block and only the first process continues (and downloads the model and vocab). After successfully downloading the required files, the first process also reaches barrier() and thus satisfying the need for all processes to have called the function and lifting the block. Then the other processes also continue (but find that the model has already been downloaded, so get it from cache).

thomwolf · 2019-10-15T08:44:42Z

Yes

BramVanroy closed this as completed Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloading model in distributed mode #1521

Downloading model in distributed mode #1521

BramVanroy commented Oct 15, 2019 •

edited

Loading

thomwolf commented Oct 15, 2019

BramVanroy commented Oct 15, 2019

thomwolf commented Oct 15, 2019

Downloading model in distributed mode #1521

Downloading model in distributed mode #1521

Comments

BramVanroy commented Oct 15, 2019 • edited Loading

🐛 Bug

thomwolf commented Oct 15, 2019

BramVanroy commented Oct 15, 2019

thomwolf commented Oct 15, 2019

BramVanroy commented Oct 15, 2019 •

edited

Loading