-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only prepare data on rank 0 #409
Conversation
I got OOM issue after checking out to this PR. It looks like this data preparing approach is consuming much large cpu memory than the previous implementation. |
@puyuanOT It's very unlikely that this PR leads to more CPU memory consumption, because if you look at the code change, the only thing this PR does is reducing the data preparation to a single rank on each node. If running the same script with and without this PR really leads to different result (OOM with this PR), please share your error log and script so that we could try to reproduce and debug. Thanks. |
@puyuanOT Actually there might be one thing related to potential memory consumption increase. Could you try changing the "return torch.load(train_fname), torch.load(eval_fname)" at
|
@conglongli, thank you for your response! I'm fairly certain that this PR has resulted in significantly higher memory consumption, as I manually tested the script with and without the PR. To provide more details, without the PR, it consumed approximately 120GB for my four A10 GPUs; however, with the PR, the peak consumption exceeded 190GB. I will give the fix you suggested a try. |
I figured this out. In the previous implementation:
The code re-preprocesses the data every time the script is called due to the instability of Empirically, when setting |
To clarify, the memory consumption mentioned above are peak consumption. |
@conglongli Not sure if there is way to both avoid re-processing and high memory consumption. |
@puyuanOT Thanks for the investigation. We are currently working on switching to memmap-based data management which hopefully could greatly reduce CPU memory consumption #450. However, it'd be a complex transition thus will take some time. So you might want to first rollback to the old implementation. |
* only prepare data on rank 0 * fix hash
Also change the filename hashing from hash() to hashlib.sha256() because of https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions
Tested with step 1 training_scripts/multi_node/run_66b.sh that PR works fine.