Description
I'm trying to train an xgboost model on a machine with 8xA100 GPUs with 80GB memory each but I'm getting an out of memory error:
MemoryError('std::bad_alloc: out_of_memory: CUDA error at: .../include/rmm/mr/device/cuda_memory_resource.hpp')
. The error is slightly different if I use rmm_pool_size
parameter but it is still a memory error "MemoryError('std::bad_alloc: out_of_memory: RMM failure at:../include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded')
I'm using a LocalCUDACluster
to distribute the workload amongst the 8 GPUs. I can tell by looking at the dask dashboard, that the data is mostly loading into a single GPU and all of the other GPUs are sitting empty and idle.
I read the data using dask_cudf.read_parquet(file_name, blocksize=int(2e4))
and it is a dataframe of size (20459297, 213)
. Though I would like to try it with much larger datasets. The training completes successfully with a smaller dataframe of size (16304159, 213)
and fewer workers but it still mostly uses a single GPU.
Edit: Here's a screenshot of the dashboard when the model is successfully training - note this is with only 2 GPUs and the smaller dataframe noted above

The GPUs are running CUDA 12.0 and driver 525.60.13
Here are the versions of some relevant packages
xgboost 1.7.4 rapidsai_py310h1395376_6 rapidsai
rapids 23.08.00 cuda12_py310_230809_g2a5b6f0_0 rapidsai
python 3.10.12 hd12c33a_0_cpython conda-forge
rapids-xgboost 23.08.00 cuda12_py310_230809_g2a5b6f0_0 rapidsai
cuda-version 12.0 hffde075_2 conda-forge
cudf 23.08.00 cuda12_py310_230809_g8150d38e08_0 rapidsai
dask 2023.7.1 pyhd8ed1ab_0 conda-forge
dask-core 2023.7.1 pyhd8ed1ab_0 conda-forge
dask-cuda 23.08.00 py310_230809_gefbd6ca_0 rapidsai
dask-cudf 23.08.00 cuda12_py310_230809_g8150d38e08_0 rapidsai
Any help on the memory issue would be much appreciated
Metadata
Metadata
Assignees
Labels
Type
Projects
Status