[BUG] get_dummies fails in dask-cudf due to dask categorical type checking #7111
opened on Jan 11, 2021
Calling dd.get_dummies
fails with dask-cudf due to Dask's reliance on the pd.api.types.is_categorical_dtype
check from pandas. Our categorical columns do not return True for this check. Instead, we can use cudf.utils.dtype.is_categorical_dtype
This issue is for tracking purposes. We'll (probably) want to abstract this from pandas in upstream Dask.
import pandas as pd
import dask.dataframe as dd
import cudf
df = pd.DataFrame(
ddf = dd.from_pandas(df, 2)
ddf = ddf.categorize(columns=["B"])
dd.get_dummies(ddf, columns=['B']) # works
import pandas as pd
import dask.dataframe as dd
import cudf
df = pd.DataFrame(
ddf = dd.from_pandas(df, 2)
ddf = ddf.map_partitions(cudf.from_pandas)
ddf = ddf.categorize(columns=["B"])
dd.get_dummies(ddf, columns=['B'])
NotImplementedError Traceback (most recent call last)
<ipython-input-9-ec7ee286026f> in <module>
10 ddf = ddf.map_partitions(cudf.from_pandas)
11 ddf = ddf.categorize(columns=["B"])
---> 12 dd.get_dummies(ddf, columns=['B'])
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype, **kwargs)
142 else:
143 if not all(is_categorical_dtype(data[c]) for c in columns):
--> 144 raise NotImplementedError(not_cat_msg)
146 if not all(has_known_categories(data[c]) for c in columns):
NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.
conda list | grep "rapids\|dask"
cudf 0.18.0a210110 cuda_10.2_py37_g04aa30cf11_157 rapidsai-nightly
cuml 0.18.0a210110 cuda10.2_py37_gc021decd9_70 rapidsai-nightly
dask 2020.12.0 pyhd8ed1ab_0 conda-forge
dask-core 2020.12.0 pyhd8ed1ab_0 conda-forge
dask-cuda 0.18.0a201211 py37_39 rapidsai-nightly
dask-cudf 0.18.0a210110 py37_g04aa30cf11_157 rapidsai-nightly
faiss-proc 1.0.0 cuda rapidsai-nightly
libcudf 0.18.0a210110 cuda10.2_g04aa30cf11_157 rapidsai-nightly
libcuml 0.18.0a210110 cuda10.2_gc021decd9_70 rapidsai-nightly
libcumlprims 0.18.0a201203 cuda10.2_gff080f3_0 rapidsai-nightly
librmm 0.18.0a210110 cuda10.2_g94b083a_20 rapidsai-nightly
rmm 0.18.0a210110 cuda_10.2_py37_g94b083a_20 rapidsai-nightly
ucx 1.8.1+g6b29558 cuda10.2_0 rapidsai-nightly
ucx-proc 1.0.0 gpu rapidsai-nightly
ucx-py 0.18.0a210110 py37_g6b29558_10 rapidsai-nightly