Skip to content

[QST] Dask-cudf/Xgboost out of memory error #14016

Open
@rohanpaul14855

Description

@rohanpaul14855

I'm trying to train an xgboost model on a machine with 8xA100 GPUs with 80GB memory each but I'm getting an out of memory error:
MemoryError('std::bad_alloc: out_of_memory: CUDA error at: .../include/rmm/mr/device/cuda_memory_resource.hpp'). The error is slightly different if I use rmm_pool_size parameter but it is still a memory error "MemoryError('std::bad_alloc: out_of_memory: RMM failure at:../include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded')

I'm using a LocalCUDACluster to distribute the workload amongst the 8 GPUs. I can tell by looking at the dask dashboard, that the data is mostly loading into a single GPU and all of the other GPUs are sitting empty and idle.

I read the data using dask_cudf.read_parquet(file_name, blocksize=int(2e4)) and it is a dataframe of size (20459297, 213). Though I would like to try it with much larger datasets. The training completes successfully with a smaller dataframe of size (16304159, 213) and fewer workers but it still mostly uses a single GPU.

Edit: Here's a screenshot of the dashboard when the model is successfully training - note this is with only 2 GPUs and the smaller dataframe noted above

Screenshot

The GPUs are running CUDA 12.0 and driver 525.60.13
Here are the versions of some relevant packages

xgboost                   1.7.4           rapidsai_py310h1395376_6    rapidsai
rapids                    23.08.00        cuda12_py310_230809_g2a5b6f0_0    rapidsai
python                    3.10.12         hd12c33a_0_cpython    conda-forge
rapids-xgboost            23.08.00        cuda12_py310_230809_g2a5b6f0_0    rapidsai
cuda-version              12.0                 hffde075_2    conda-forge
cudf                      23.08.00        cuda12_py310_230809_g8150d38e08_0    rapidsai
dask                      2023.7.1           pyhd8ed1ab_0    conda-forge
dask-core                 2023.7.1           pyhd8ed1ab_0    conda-forge
dask-cuda                 23.08.00        py310_230809_gefbd6ca_0    rapidsai
dask-cudf                 23.08.00        cuda12_py310_230809_g8150d38e08_0    rapidsai

Any help on the memory issue would be much appreciated

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentPythonAffects Python cuDF API.daskDask issuequestionFurther information is requested

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions