Skip to content

cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable #1342

Open

Description

Hi, I'm running a cupy test with dask, but ran into a CUDA error that's only reproducible with dask-cuda but not with cupy alone, and hoping that I can get some help here. The error:

2024-05-30 07:53:26,923 - distributed.worker - WARNING - Compute Failed
Key:       ('uniform-fb64484a444179abec146c7c4ac41c23', 103, 0)
Function:  _apply_random_func
args:      (<class 'cupy.random._bit_generator.XORWOW'>, 'uniform', SeedSequence(
    entropy=1,
    spawn_key=(103,),
), (65536, 256), [0.0, 1.0], {})
kwargs:    {}
Exception: "CUDARuntimeError('cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable')"

The script I'm running along with dependency version:

import dask
from dask import array as da
import dask_cuda
from dask_cuda import LocalCUDACluster
from distributed import Client, wait
import cupy

print(dask.__version__)
print(dask_cuda.__version__)
print(cupy.__version__)

# 2024.1.1
# 24.04.00
# 13.1.0

def make_regression(n_samples: int, n_features: int) -> tuple[da.Array, da.Array]:
    rng = da.random.default_rng(1)
    X = rng.uniform(size=(n_samples, n_features), chunks=(256**2, -1))
    y = X.sum(axis=1)
    return X, y


def main(client: Client) -> None:
    X, y = make_regression(n_samples=2**25, n_features=256)
    X, y = client.persist([X, y])
    wait([X, y])

if __name__ == "__main__":
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            with dask.config.set({"array.backend": "cupy"}):
                main(client)

If I run cupy alone, it can generate random numbers just fine:

>>> import cupy
>>> rng = cupy.random.default_rng(1)
>>> rng.uniform(size=256**2)
array([0.85849577, 0.01410829, 0.28965574, ..., 0.6419934 , 0.35287664,
       0.16616132])

Lastly, I'm running the test on a EGX cluster with 4 T4s:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions