Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid poisoning process with CUDA calls as soon as importing #6810

Merged
merged 6 commits into from
Dec 12, 2024

Conversation

HollowMan6
Copy link
Contributor

@HollowMan6 HollowMan6 commented Nov 29, 2024

Call torch.cuda.device_count() > 0 before torch.cuda.is_available(), to give priority to nvml based availability, so that we can try not to poison process with CUDA calls as soon as we execute import deepspeed.

https://github.com/pytorch/pytorch/blob/v2.5.1/torch/cuda/__init__.py#L120-L124

There are 2 reasons to make this change:

Firstly, if we accidentally import deepspeed, since the CUDA runtime initializes when the first CUDA API call is made and caches the device list, changing the CUDA_VISIBLE_DEVICES within the same process after initialization won't have any effect on the visible devices. The specific case:
OpenRLHF/OpenRLHF#524 (comment)

A demo for reproduction before the fix is applied:

import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import deepspeed
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
torch.cuda.set_device('cuda:0')

Secondly, https://pytorch.org/docs/stable/notes/cuda.html

When assessing the availability of CUDA in a given environment (is_available()), PyTorch’s default behavior is to call the CUDA Runtime API method cudaGetDeviceCount. Because this call in turn initializes the CUDA Driver API (via cuInit) if it is not already initialized, subsequent forks of a process that has run is_available() will fail with a CUDA initialization error.

@loadams loadams requested a review from tjruwase December 4, 2024 19:58
@tjruwase
Copy link
Contributor

tjruwase commented Dec 6, 2024

@HollowMan6, thanks for diagnosing the problem and sharing a PR. However, I don't understand why this is the correct solution.

Switch from torch.cuda.is_available() to torch.cuda.device_count() > 0, to give priority to nvml based availability

  1. The link you shared shows the is_available() already gives priority to NVML based checking assuming the correct environment setup. So, I am confused about your above comment.
  2. device_count() succeeds only if NVML works, otherwise it falls back to the solution that causes fork poisoning.

Can you please explain what I am missing? Thanks!

@tjruwase
Copy link
Contributor

tjruwase commented Dec 6, 2024

@HollowMan6, I want to share my thoughts on this problem.

Building on your great analysis here and there, I did some further digging to get a better appreciation of the painful status of CUDA availability discovering on PyTorch. However, I think this is a problem that should be fixed in PyTorch rather than DeepSpeed. This is because DeepSpeed builds on PyTorch and should preserve semantics (for good or bad) as much as practical. So, my three takeaways are as follows:

1. In this case, my understanding is that is_available() is the recommended way to CUDA discovery, despite the limitations.

image

2. There is recommendation for how users can enable NVML based checking for is_available(). We can amplify this.

image

3. However, the docs also notes that NVML is not foolproof.

image

Hope to get your thoughts. Thanks!

@HollowMan6
Copy link
Contributor Author

Hi @tjruwase! Thank you for reviewing the PR! Some comments from me:

  1. The link you shared shows the is_available() already gives priority to NVML based checking assuming the correct environment setup. So, I am confused about your above comment.

Yes, that's correct, but I don't think it's a good idea to force users to set the flag in the environment, as https://pytorch.org/docs/stable/notes/cuda.html noted that the NVML-based CUDA availability assessment provides a weaker guarantee than the default CUDA Runtime API approach (which requires CUDA initialization to succeed). In some circumstances, the NVML-based check may succeed while later CUDA initialization fails. (as you noted)

From my understanding, according to the comments of the code, which mentions that you only want to determine if we are on a GPU or x86 CPU with a torch instead of hoping to guarantee the CUDA initialization will succeed. But for your concerns, if you do want to ensure that CUDA initialization also work, I can change the check into

if torch.cuda.device_count() > 0 and torch.cuda.is_available():  #ignore-cuda

So that we first ensure that there are devices available, and then we do checks about CUDA initialization. If not, we shouldn't do any CUDA call.

Will update the PR for this.

However, I think this is a problem that should be fixed in PyTorch rather than DeepSpeed. This is because DeepSpeed builds on PyTorch and should preserve semantics (for good or bad) as much as practical.

This is unfortunately a CUDA issue and I don't think PyTorch can do much on their side as well: pytorch/pytorch#141678 (comment)

For DeepSpeed, what makes it worse is that we will do a CUDA call as soon as we do an import, which puts developers in a very tough situation when something like OpenRLHF/OpenRLHF#524 (comment) happens again. So, personally, I do hope that this particular case can get fixed on the DeepSpeed side (Don't do any CUDA calls when no CUDA device is available).

  1. device_count() succeeds only if NVML works, otherwise it falls back to the solution that causes fork poisoning.

Yes, that's true and unfortunate, but the issue can get mitigated when NVML works, so it would still be great to do something on the DeepSpeed side.

@HollowMan6 HollowMan6 marked this pull request as draft December 6, 2024 16:18
@HollowMan6 HollowMan6 marked this pull request as ready for review December 6, 2024 17:11
@tjruwase
Copy link
Contributor

tjruwase commented Dec 9, 2024

From my understanding, according to the comments of the code, which mentions that you only want to determine if we are on a GPU or x86 CPU with a torch instead of hoping to guarantee the CUDA initialization will succeed

Actually, you are correct, the intention for this logic is device identification rather than initialization. I missed/forgot this nuance when I first reviewed. However, it seems unclear whether CUDA device identification can be generally performed without fork poisoning since device_count() and NVML-based checking are not guaranteed to always succeed. Nevertheless, your PR is usability improvement. Thanks for the great contribution.

@HollowMan6
Copy link
Contributor Author

HollowMan6 commented Dec 9, 2024

Will fix the formatting issue now.

This failure doesn't seem to relate to this PR (permission denied error) https://github.com/microsoft/DeepSpeed/actions/runs/12240278508/job/34142717128

Call `torch.cuda.device_count() > 0` before `torch.cuda.is_available()`,
to give priority to nvml based availability, so that we can try not to poison
process with CUDA calls as soon as we execute `import deepspeed`.

https://github.com/pytorch/pytorch/blob/v2.5.1/torch/cuda/__init__.py#L120-L124

There are 2 reasons to make this change:

Firstly, if we accidentally import deepspeed, since the CUDA runtime initializes
when the first CUDA API call is made and caches the device list, changing the
CUDA_VISIBLE_DEVICES within the same process after initialization won't have any
effect on the visible devices. The specific case:
OpenRLHF/OpenRLHF#524 (comment)

A demo for reproduction before the fix is applied:

```python
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import deepspeed
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
torch.cuda.set_device('cuda:0')
```

Secondly, https://pytorch.org/docs/stable/notes/cuda.html

When assessing the availability of CUDA in a given environment (is_available()),
PyTorch’s default behavior is to call the CUDA Runtime API method cudaGetDeviceCount.
Because this call in turn initializes the CUDA Driver API (via cuInit) if it is not
already initialized, subsequent forks of a process that has run is_available() will
fail with a CUDA initialization error.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6
Copy link
Contributor Author

Fixed, this is not intelligent, as those are just comments XD
image

@loadams loadams enabled auto-merge December 10, 2024 23:59
@loadams loadams added this pull request to the merge queue Dec 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 12, 2024
@loadams loadams merged commit 9182947 into microsoft:master Dec 12, 2024
11 checks passed
@HollowMan6 HollowMan6 deleted the pr branch December 12, 2024 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants