Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading model in distributed mode #1521

Closed
BramVanroy opened this issue Oct 15, 2019 · 3 comments
Closed

Downloading model in distributed mode #1521

BramVanroy opened this issue Oct 15, 2019 · 3 comments

Comments

@BramVanroy
Copy link
Collaborator

BramVanroy commented Oct 15, 2019

🐛 Bug

When running in distributed mode with n processes, a new model will be download n times. I don't think that's what you want. I found this related issue but that only fixed the race condition. Downloads still happen in parallel. Is there a way to only download the model once? Perhaps by passing a local_rank parameter and only downloading when local_rank==0?

Especially for large models this is not ideal as i. they take up a lot of space (multiplied by the number of processes) ii. downloading is extra slow because it happens multiple times in parallel, limiting bandwidth.

15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp0amm9x2s
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp7wpg48uj
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp89svv055
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp7yk94f8s
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27631147.05B/s]
15-Oct 03:12:42 - [INFO]: copying /tmp/tmp89svv055 to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27614197.65B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp7wpg48uj to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27605553.23B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp0amm9x2s to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27599668.53B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp7yk94f8s to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0

An alternative would be to already 'touch' the file in the .cache before downloading, and when it exists, not initiate a new download. (Taking into account sudden abortions.)

@thomwolf
Copy link
Member

This should be fixed in most of the examples through the use of torch.distributed.barrier.
E.g. here: https://github.com/huggingface/transformers/blob/master/examples/run_glue.py#L473

Don't hesitate to submit a PR if some examples don't make use of this technique yet.

@BramVanroy
Copy link
Collaborator Author

Thanks for the quick reply! So to ensure that I understand this correctly: barrier blocks until all processes are synchronized (i.e. have reached that point). So before we enter the loading of the model, we block and only the first process continues (and downloads the model and vocab). After successfully downloading the required files, the first process also reaches barrier() and thus satisfying the need for all processes to have called the function and lifting the block. Then the other processes also continue (but find that the model has already been downloaded, so get it from cache).

@thomwolf
Copy link
Member

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants