You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running in distributed mode with n processes, a new model will be download n times. I don't think that's what you want. I found this related issue but that only fixed the race condition. Downloads still happen in parallel. Is there a way to only download the model once? Perhaps by passing a local_rank parameter and only downloading when local_rank==0?
Especially for large models this is not ideal as i. they take up a lot of space (multiplied by the number of processes) ii. downloading is extra slow because it happens multiple times in parallel, limiting bandwidth.
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp0amm9x2s
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp7wpg48uj
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp89svv055
15-Oct 03:08:45 - [INFO]: https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin not found in cache or force_download set to True, downloading to /tmp/tmp7yk94f8s
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27631147.05B/s]
15-Oct 03:12:42 - [INFO]: copying /tmp/tmp89svv055 to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27614197.65B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp7wpg48uj to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27605553.23B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp0amm9x2s to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6552025106/6552025106 [03:57<00:00, 27599668.53B/s]
15-Oct 03:12:43 - [INFO]: copying /tmp/tmp7yk94f8s to cache at /home/bram/.cache/torch/transformers/c146cc96724f27295a0c3ada1fbb3632074adf87e9aef8269e44c9208787f8c8.b986347cbab65fa276683efbb9c2f7ee22552277bcf6e1f1166557ed0852fdf0
An alternative would be to already 'touch' the file in the .cache before downloading, and when it exists, not initiate a new download. (Taking into account sudden abortions.)
The text was updated successfully, but these errors were encountered:
Thanks for the quick reply! So to ensure that I understand this correctly: barrier blocks until all processes are synchronized (i.e. have reached that point). So before we enter the loading of the model, we block and only the first process continues (and downloads the model and vocab). After successfully downloading the required files, the first process also reaches barrier() and thus satisfying the need for all processes to have called the function and lifting the block. Then the other processes also continue (but find that the model has already been downloaded, so get it from cache).
🐛 Bug
When running in distributed mode with
n
processes, a new model will be downloadn
times. I don't think that's what you want. I found this related issue but that only fixed the race condition. Downloads still happen in parallel. Is there a way to only download the model once? Perhaps by passing alocal_rank
parameter and only downloading whenlocal_rank==0
?Especially for large models this is not ideal as i. they take up a lot of space (multiplied by the number of processes) ii. downloading is extra slow because it happens multiple times in parallel, limiting bandwidth.
An alternative would be to already 'touch' the file in the .cache before downloading, and when it exists, not initiate a new download. (Taking into account sudden abortions.)
The text was updated successfully, but these errors were encountered: