[QST] Dask-cudf/Xgboost out of memory error

I'm trying to train an xgboost model on a machine with 8xA100 GPUs with 80GB memory each but I'm getting an out of memory error:
`MemoryError('std::bad_alloc: out_of_memory: CUDA error at: .../include/rmm/mr/device/cuda_memory_resource.hpp')`. The error is slightly different if I use `rmm_pool_size` parameter but it is still a memory error `"MemoryError('std::bad_alloc: out_of_memory: RMM failure at:../include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded')`

I'm using a `LocalCUDACluster` to distribute the workload amongst the 8 GPUs. I can tell by looking at the dask dashboard, that the data is mostly loading into a single GPU and all of the other GPUs are sitting empty and idle. 

I read the data using `dask_cudf.read_parquet(file_name, blocksize=int(2e4))` and it is a dataframe of size `(20459297, 213)`. Though I would like to try it with much larger datasets.  The training completes successfully with a smaller dataframe of size `(16304159, 213)` and fewer workers but it still mostly uses a single GPU. 

Edit: Here's a screenshot of the dashboard when the model is successfully training - note this is with only 2 GPUs and the smaller dataframe noted above

<img width="1672" alt="Screenshot" src="https://github.com/rapidsai/cudf/assets/18453604/98706b71-19e3-4266-9c9c-23472994675c">

The GPUs are running CUDA 12.0 and driver  525.60.13
Here are the versions of some relevant packages

```
xgboost                   1.7.4           rapidsai_py310h1395376_6    rapidsai
rapids                    23.08.00        cuda12_py310_230809_g2a5b6f0_0    rapidsai
python                    3.10.12         hd12c33a_0_cpython    conda-forge
rapids-xgboost            23.08.00        cuda12_py310_230809_g2a5b6f0_0    rapidsai
cuda-version              12.0                 hffde075_2    conda-forge
cudf                      23.08.00        cuda12_py310_230809_g8150d38e08_0    rapidsai
dask                      2023.7.1           pyhd8ed1ab_0    conda-forge
dask-core                 2023.7.1           pyhd8ed1ab_0    conda-forge
dask-cuda                 23.08.00        py310_230809_gefbd6ca_0    rapidsai
dask-cudf                 23.08.00        cuda12_py310_230809_g8150d38e08_0    rapidsai
```

Any help on the memory issue would be much appreciated



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Dask-cudf/Xgboost out of memory error #14016

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Dask-cudf/Xgboost out of memory error #14016

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions