Skip to content

[BUG] get_dummies fails in dask-cudf due to dask categorical type checking #7111

Closed
@beckernick

Description

Calling dd.get_dummies fails with dask-cudf due to Dask's reliance on the pd.api.types.is_categorical_dtype check from pandas. Our categorical columns do not return True for this check. Instead, we can use cudf.utils.dtype.is_categorical_dtype.

This issue is for tracking purposes. We'll (probably) want to abstract this from pandas in upstream Dask.

import pandas as pd
import dask.dataframe as dd
import cudf

df = pd.DataFrame(
    {"A":["a","b","b"],
     "B":[1,2,3]
    })
ddf = dd.from_pandas(df, 2)
ddf = ddf.categorize(columns=["B"])
dd.get_dummies(ddf, columns=['B']) # works
import pandas as pd
import dask.dataframe as dd
import cudfdf = pd.DataFrame(
    {"A":["a","b","b"],
     "B":[1,2,3]
    })
ddf = dd.from_pandas(df, 2)
ddf = ddf.map_partitions(cudf.from_pandas)
ddf = ddf.categorize(columns=["B"])
dd.get_dummies(ddf, columns=['B'])
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-9-ec7ee286026f> in <module>
     10 ddf = ddf.map_partitions(cudf.from_pandas)
     11 ddf = ddf.categorize(columns=["B"])
---> 12 dd.get_dummies(ddf, columns=['B'])

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype, **kwargs)
    142         else:
    143             if not all(is_categorical_dtype(data[c]) for c in columns):
--> 144                 raise NotImplementedError(not_cat_msg)
    145 
    146         if not all(has_known_categories(data[c]) for c in columns):

NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.
conda list | grep "rapids\|dask"
cudf                      0.18.0a210110   cuda_10.2_py37_g04aa30cf11_157    rapidsai-nightly
cuml                      0.18.0a210110   cuda10.2_py37_gc021decd9_70    rapidsai-nightly
dask                      2020.12.0          pyhd8ed1ab_0    conda-forge
dask-core                 2020.12.0          pyhd8ed1ab_0    conda-forge
dask-cuda                 0.18.0a201211           py37_39    rapidsai-nightly
dask-cudf                 0.18.0a210110   py37_g04aa30cf11_157    rapidsai-nightly
faiss-proc                1.0.0                      cuda    rapidsai-nightly
libcudf                   0.18.0a210110   cuda10.2_g04aa30cf11_157    rapidsai-nightly
libcuml                   0.18.0a210110   cuda10.2_gc021decd9_70    rapidsai-nightly
libcumlprims              0.18.0a201203   cuda10.2_gff080f3_0    rapidsai-nightly
librmm                    0.18.0a210110   cuda10.2_g94b083a_20    rapidsai-nightly
rmm                       0.18.0a210110   cuda_10.2_py37_g94b083a_20    rapidsai-nightly
ucx                       1.8.1+g6b29558       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.18.0a210110   py37_g6b29558_10    rapidsai-nightly

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    PythonAffects Python cuDF API.bugSomething isn't workingdaskDask issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions