-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared memory issues with parallelization #21
Comments
Thanks for reporting this! I will try to reproduce this on my end and see where it breaks. |
My hunch was that if we use parallelization in pytorch's data loader and still do multiprocess tokenization in one process, it is giving those errors. Basically, it tries to do in tokenization in (cpu_workers * cpu_workers) processes (?) and thus eating up shared memory (?) I have removed this multiprocess tokenization and running some experiments. Will let you know how it goes. Your suggestion is also appreciated. |
Both of these happen at different times. All the tokenization happens in the reader even before the training starts (when the dataloader makes batches). I am unable to reproduce this so far. Could you check if it works well with 1 worker? |
Yeah I did try with 1 worker. Had the same errors. (Cant use 0 because this requires at least one worker :D ) Have removed multiprocess tokenization in my code and it works fine. Just to let you know it doesn't happen at starting iterations or epochs. I guess it was after 3-5 epochs. |
I think I'm hitting this. In my setup I'm doing independent runs in parallel threads (not processes, since I'm using LevelDB and it does not support multiprocessing).
Even though I'm using the workaround suggested here: pytorch/pytorch#973 (comment) |
Hi @kdexd
I am running into all kinds of shared memory errors after this commit 9c1ee36
pytorch/pytorch#8976
pytorch/pytorch#973
I guess this parallelization is not stable; sometimes it run while sometimes it breaks (even though after trying possible solutions) such as:
Is there a leak somewhere? Might be best to have a look.
The text was updated successfully, but these errors were encountered: