[BUG] dask_cudf - aggregate - to_csv memory error #13220
Description
Describe the bug
A clear and concise description of what the bug is.
I am loading a large dataframe (~60M x 300) by csv via dask_cudf, then looking to do a groupby and sum, and resave this to csv. I get an OOM error - I am using an A100-80GB gpu along with 200GB of RAM.
All rows are numerical values, besides the groupby row left as the index. Thus, this error should be reproducible via a random dataframe.
I noted a similar issue @10426, however this error message is different, therefore I was unsure if this was the case.
Additionally, I do repeatedly get a high cpu garbage collection message, however I assume that is because of the size of the dataframe and many read/writes, correct me if that is not the case.
Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
import numpy as np
import pandas as pd
import cudf
import cupy
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask.utils import parse_bytes
import dask_cudf
cluster = LocalCUDACluster(jit_unspill=True,
rmm_pool_size=parse_bytes("64 GB"),
n_workers = 1,
device_memory_limit=parse_bytes("160 GB"),
local_directory='local_temp',
threads_per_worker=32)
client = Client(cluster)
df = dask_cudf.read_csv('../02_all_study/02_tad_80_cluster_ref.tsv',sep = '\t')
df2 = df.drop('Contig',axis=1)
res = df2.groupby('ref90_cluster').sum()
res.to_csv('04_cluster_groups_csv')
Output (I think the error message is repeating after nanny restarts, but I have included the entire error message for thoroughness (attached as file for size):
dask_to_csv_error.txt
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)] RHEL server
- Method of cuDF install: [conda, Docker, or from source] conda (mamba)
- If method of install is [Docker], provide
docker pull
&docker run
commands used
- If method of install is [Docker], provide
Environment details
Please run and paste the output of the cudf/print_env.sh
script here, to gather any other relevant environment details
Click here to see environment details
**git*** Not inside a git repository ***OS Information*** NAME="Red Hat Enterprise Linux Server" VERSION="7.9 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.9" PRETTY_NAME="Red Hat Enterprise Linux Server 7.9 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.9 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.9" Red Hat Enterprise Linux Server release 7.9 (Maipo) Red Hat Enterprise Linux Server release 7.9 (Maipo) Linux atl1-1-01-006-7-0.pace.gatech.edu 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 9 16:09:48 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux ***GPU Information*** Tue Apr 25 18:48:17 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... On | 00000000:25:00.0 Off | 0 | | N/A 33C P0 61W / 300W | 72218MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 26580 C ...s/rapids-23.04/bin/python 10315MiB | | 0 N/A N/A 27149 C ...s/rapids-23.04/bin/python 61901MiB | +-----------------------------------------------------------------------------+ ***CPU*** Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7513 32-Core Processor Stepping: 1 CPU MHz: 2600.000 CPU max MHz: 2600.0000 CPU min MHz: 1500.0000 BogoMIPS: 5200.16 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 32768K NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 invpcid_single hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq overflow_recov succor smca ***CMake*** /bin/cmake cmake version 2.8.12.2 ***g++*** /usr/lib64/ccache/g++ g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ***nvcc*** /usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/compilers/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0 ***Python*** /storage/home/hcoda1/6/rridley3/data/dir/anaconda3/envs/rapids-23.04/bin/python Python 3.10.10 ***Environment Variables*** PATH : /usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/compilers/extras/qd/bin:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/comm_libs/mpi/bin:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/compilers/bin:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/cuda/bin:/usr/local/pace-apps/spack/packages/linux-rhel7-x86_64/gcc-4.8.5/cuda-11.7.0-7sdye3id7ahz34mzhyzzqbxowjxgxkhu/bin:/storage/home/hcoda1/6/rridley3/.cargo/bin:/storage/home/hcoda1/6/rridley3/data/dir/anaconda3/envs/rapids-23.04/bin:/storage/home/hcoda1/6/rridley3/data/dir/apps:/storage/home/hcoda1/6/rridley3/.aspera/connect/bin:/opt/pace-common/bin:/opt/slurm/current/bin:/opt/pace-system/bin:/usr/lpp/mmfs/bin:/usr/lib64/ccache:/sbin:/bin:/usr/sbin:/usr/bin:/opt/iozone/bin:/storage/home/hcoda1/6/rridley3/edirect LD_LIBRARY_PATH : /usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/comm_libs/nvshmem/lib:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/comm_libs/nccl/lib:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/comm_libs/mpi/lib:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/math_libs/lib64:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/compilers/lib:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/compilers/extras/qd/lib:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/cuda/extras/CUPTI/lib64:/usr/local/pace-apps/manual/packages/nvhpc/Linux_x86_64/22.11/cuda/lib64:/usr/local/pace-apps/spack/packages/linux-rhel7-x86_64/gcc-4.8.5/cuda-11.7.0-7sdye3id7ahz34mzhyzzqbxowjxgxkhu/lib64:/opt/slurm/current/lib:: NUMBAPRO_NVVM : NUMBAPRO_LIBDEVICE : CONDA_PREFIX : /storage/home/hcoda1/6/rridley3/data/dir/anaconda3/envs/rapids-23.04 PYTHON_PATH : conda not found ***pip packages*** /storage/home/hcoda1/6/rridley3/data/dir/anaconda3/envs/rapids-23.04/bin/pip Package Version ----------------------------- ----------- aiofiles 22.1.0 aiohttp 3.8.4 aiosignal 1.3.1 aiosqlite 0.18.0 anyio 3.6.2 aplus 0.11.0 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 arrow 1.2.3 asciitree 0.3.3 astropy 5.2.2 asttokens 2.2.1 async-timeout 4.0.2 attrs 22.2.0 Babel 2.12.1 backcall 0.2.0 backports.functools-lru-cache 1.6.4 beautifulsoup4 4.12.2 blake3 0.2.1 bleach 6.0.0 bokeh 2.4.3 bqplot 0.12.39 branca 0.6.0 brotlipy 0.7.0 cached-property 1.5.2 cachetools 5.3.0 certifi 2022.12.7 cffi 1.15.1 charset-normalizer 2.1.1 click 8.1.3 click-plugins 1.1.1 cligj 0.7.2 cloudpickle 2.2.1 colorama 0.4.6 colorcet 3.0.1 comm 0.1.3 confluent-kafka 1.7.0 contourpy 1.0.7 cryptography 40.0.2 cubinlinker 0.2.2 cucim 23.4.1 cuda-python 11.8.1 cudf 23.4.0 cudf-kafka 23.4.0 cugraph 23.4.0 cuml 23.4.0 cupy 11.6.0 cusignal 23.4.0 cuspatial 23.4.0 custreamz 23.4.0 cuxfilter 23.4.0 cycler 0.11.0 cytoolz 0.12.0 dask 2023.3.2 dask-cuda 23.4.0 dask-cudf 23.4.0 dask-labextension 6.1.0 datashader 0.14.4 datashape 0.5.4 debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 distributed 2023.3.2.1 entrypoints 0.4 executing 1.2.0 fastapi 0.95.1 fastavro 1.7.3 fasteners 0.18 fastjsonschema 2.16.3 fastrlock 0.8 filelock 3.12.0 Fiona 1.9.1 flit_core 3.8.0 folium 0.14.0 fonttools 4.39.3 fqdn 1.5.1 frozendict 2.3.7 frozenlist 1.3.3 fsspec 2023.4.0 future 0.18.3 GDAL 3.6.2 geopandas 0.12.2 graphviz 0.20.1 h5py 3.8.0 holoviews 1.15.4 idna 3.4 imagecodecs 2023.1.23 imageio 2.27.0 importlib-metadata 6.5.0 importlib-resources 5.12.0 ipycytoscape 1.3.3 ipydatawidgets 4.3.2 ipykernel 6.22.0 ipyleaflet 0.17.2 ipympl 0.9.3 ipython 8.12.0 ipython-genutils 0.2.0 ipyvolume 0.6.1 ipyvue 1.8.0 ipyvuetify 1.8.4 ipywebrtc 0.6.0 ipywidgets 8.0.6 isoduration 20.11.0 jedi 0.18.2 Jinja2 3.1.2 joblib 1.2.0 json5 0.9.5 jsonpointer 2.3 jsonschema 4.17.3 jupyter_client 8.2.0 jupyter_core 5.3.0 jupyter-events 0.6.3 jupyter_server 2.5.0 jupyter_server_fileid 0.9.0 jupyter-server-proxy 3.2.2 jupyter_server_terminals 0.4.4 jupyter_server_ydoc 0.8.0 jupyter-ydoc 0.2.3 jupyterlab 3.6.3 jupyterlab-pygments 0.2.2 jupyterlab_server 2.22.1 jupyterlab-widgets 3.0.7 kiwisolver 1.4.4 lazy_loader 0.2 llvmlite 0.39.1 locket 1.0.0 lz4 4.3.2 mapclassify 2.5.0 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 matplotlib-inline 0.1.6 mdurl 0.1.0 mistune 2.0.5 msgpack 1.0.5 multidict 6.0.4 multipledispatch 0.6.0 munch 2.5.0 munkres 1.1.4 nbclassic 0.5.5 nbclient 0.7.3 nbconvert 7.3.1 nbformat 5.8.0 nest-asyncio 1.5.6 networkx 3.1 notebook 6.5.4 notebook_shim 0.2.3 numba 0.56.4 numcodecs 0.11.0 numexpr 2.8.4 numpy 1.23.5 nvtx 0.2.5 packaging 23.1 pandas 1.5.3 pandocfilters 1.5.0 panel 0.14.1 param 1.13.0 parso 0.8.3 partd 1.4.0 patsy 0.5.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.4.0 pip 23.1 pkgutil_resolve_name 1.3.10 platformdirs 3.2.0 pooch 1.7.0 progressbar2 4.2.0 prometheus-client 0.16.0 prompt-toolkit 3.0.38 protobuf 4.21.12 psutil 5.9.5 ptxcompiler 0.7.0 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 10.0.1 pycparser 2.21 pyct 0.4.6 pydantic 1.10.7 pydeck 0.5.0 pyee 8.1.0 pyerfa 2.0.0.3 Pygments 2.15.1 pylibcugraph 23.4.0 pylibraft 23.4.0 pynvml 11.4.1 pyOpenSSL 23.1.1 pyparsing 3.0.9 pyppeteer 1.0.2 pyproj 3.4.0 pyrsistent 0.19.3 PySocks 1.7.1 python-dateutil 2.8.2 python-json-logger 2.0.7 python-utils 3.5.2 pythreejs 2.4.2 pytz 2023.3 pyviz-comms 2.2.1 PyWavelets 1.4.1 PyYAML 6.0 pyzmq 25.0.2 raft-dask 23.4.0 requests 2.28.2 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.3.4 rmm 23.4.0 Rtree 1.0.1 scikit-image 0.20.0 scikit-learn 1.2.2 scipy 1.10.1 seaborn 0.12.2 Send2Trash 1.8.0 setuptools 67.6.1 shapely 2.0.1 simpervisor 0.4 six 1.16.0 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.3.2.post1 spectate 1.0.1 stack-data 0.6.2 starlette 0.26.1 statsmodels 0.13.5 streamz 0.6.4 tables 3.7.0 tabulate 0.9.0 tblib 1.7.0 terminado 0.17.1 threadpoolctl 3.1.0 tifffile 2023.4.12 tiledb 0.21.2 tinycss2 1.2.1 tomli 2.0.1 toolz 0.12.0 tornado 6.3 tqdm 4.65.0 traitlets 5.9.0 traittypes 0.2.1 treelite 3.2.0 treelite-runtime 3.2.0 typing_extensions 4.5.0 ucx-py 0.31.0 unicodedata2 15.0.0 uri-template 1.2.0 urllib3 1.26.15 vaex-astro 0.9.3 vaex-core 4.16.1 vaex-hdf5 0.14.1 vaex-jupyter 0.8.1 vaex-ml 0.18.1 vaex-server 0.8.1 vaex-viz 0.5.4 wcwidth 0.2.6 webcolors 1.13 webencodings 0.5.1 websocket-client 1.5.1 websockets 10.4 wheel 0.40.0 widgetsnbextension 4.0.7 xarray 2023.4.1 xgboost 1.7.5 xyzservices 2023.2.0 y-py 0.5.9 yarl 1.8.2 ypy-websocket 0.8.2 zarr 2.14.2 zict 3.0.0 zipp 3.15.0
Additional context
Add any other context about the problem here.
Activity