Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only prepare data on rank 0 #409

Merged
merged 2 commits into from
Apr 24, 2023
Merged

only prepare data on rank 0 #409

merged 2 commits into from
Apr 24, 2023

Conversation

conglongli
Copy link
Contributor

@conglongli conglongli commented Apr 24, 2023

Also change the filename hashing from hash() to hashlib.sha256() because of https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions

Tested with step 1 training_scripts/multi_node/run_66b.sh that PR works fine.

@conglongli conglongli merged commit 76c16f1 into master Apr 24, 2023
@conglongli conglongli deleted the data_rank_0 branch April 24, 2023 08:19
@puyuanOT
Copy link

I got OOM issue after checking out to this PR. It looks like this data preparing approach is consuming much large cpu memory than the previous implementation.

@conglongli
Copy link
Contributor Author

conglongli commented Apr 25, 2023

I got OOM issue after checking out to this PR. It looks like this data preparing approach is consuming much large cpu memory than the previous implementation.

@puyuanOT It's very unlikely that this PR leads to more CPU memory consumption, because if you look at the code change, the only thing this PR does is reducing the data preparation to a single rank on each node. If running the same script with and without this PR really leads to different result (OOM with this PR), please share your error log and script so that we could try to reproduce and debug. Thanks.

@conglongli
Copy link
Contributor Author

@puyuanOT Actually there might be one thing related to potential memory consumption increase. Could you try changing the "return torch.load(train_fname), torch.load(eval_fname)" at

return torch.load(train_fname), torch.load(eval_fname)
to "train_dataset, eval_dataset = torch.load(train_fname), torch.load(eval_fname)" and "return train_dataset, eval_dataset"? For users with only 1 GPU, this difference might lead to more memory consumption.

@puyuanOT
Copy link

@conglongli, thank you for your response! I'm fairly certain that this PR has resulted in significantly higher memory consumption, as I manually tested the script with and without the PR. To provide more details, without the PR, it consumed approximately 120GB for my four A10 GPUs; however, with the PR, the peak consumption exceeded 190GB.

I will give the fix you suggested a try.

@puyuanOT
Copy link

puyuanOT commented May 2, 2023

I figured this out.

In the previous implementation:

    fname = str(hash(fname))  # hash the file name to avoid too long file name
    train_fname = f"{output_path}/traindata_{fname}.pt"
    eval_fname = f"{output_path}/evaldata_{fname}.pt"

    cache_found = os.path.isfile(train_fname) and os.path.isfile(eval_fname)
    buf_create_cache = torch.ByteTensor([not cache_found]).cuda()
    torch.distributed.all_reduce(buf_create_cache)

    if buf_create_cache.item() == 0:
        return torch.load(train_fname), torch.load(eval_fname)

The code re-preprocesses the data every time the script is called due to the instability of str(hash(fname)), which returns a different hash each time it's called. Consequently, the code never executes the line torch.load(train_fname), torch.load(eval_fname). Interestingly, the data preprocessing consumes significantly less memory than loading the precomputed torch tensors, so preparing the data each time actually uses less memory than loading the pre-computed tensors.

Empirically, when setting --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP and 'max_length=1024', the script consumes less than 200GB memory when using 4 GPUs (processes) with the old implementation, but it consumes more than 350 GB with the new implementation.

@puyuanOT
Copy link

puyuanOT commented May 2, 2023

To clarify, the memory consumption mentioned above are peak consumption.

@puyuanOT
Copy link

puyuanOT commented May 2, 2023

@conglongli Not sure if there is way to both avoid re-processing and high memory consumption.

@conglongli
Copy link
Contributor Author

@puyuanOT Thanks for the investigation. We are currently working on switching to memmap-based data management which hopefully could greatly reduce CPU memory consumption #450. However, it'd be a complex transition thus will take some time. So you might want to first rollback to the old implementation.

leocnj pushed a commit to leocnj/DeepSpeedExamples that referenced this pull request May 27, 2023
* only prepare data on rank 0

* fix hash
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deespeed chat DeepSpeed Chat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants