Skip to content

[BUG] Multiple DataFrame.loc operations gives confusing error message upon compute on Dask-cuDF #11434

Open
@alextxu

Description

@alextxu

Describe the bug
After creating a Dask-cuDF data frame, if I perform multiple .loc operations on it using boolean Dask-cuDF series, then when I compute the data frame, it produces a runtime error with the message cuDF failure at: ../src/stream_compaction/apply_boolean_mask.cu:73: Column size mismatch. A similar snippet works as expected on cuDF.

Steps/Code to reproduce bug

import dask_cudf
import cudf
ddf1 = dask_cudf.from_cudf(cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6]}), npartitions=2)
f1 = dask_cudf.from_cudf(cudf.Series([False, True, True]), npartitions=2)
f2 = dask_cudf.from_cudf(cudf.Series([True, False]), npartitions=2)
ddf2 = ddf1.loc[f1]
ddf3 = ddf2.loc[f2]
print(ddf2.compute())
print(ddf3.compute())

The above code produces the following output:

   a  b                        
1  2  5
2  3  6                                                       
Traceback (most recent call last):     
  File "temp.py", line 9, in <module>
    print(ddf3.compute())
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 292, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 575, in compute
    results = schedule(dsk, keys, **kwargs)    
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 554, in get_sync
    return get_async(                                         
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 497, in get_async                                       
    for key, res_info, failed in queue_get(queue).result():
  File "/opt/conda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 539, in submit
    fut.set_result(fn(*args, **kwargs))
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 235, in batch_execute_tasks
    return [execute_task(*a) for a in it]
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 235, in <listcomp>
    return [execute_task(*a) for a in it]
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 226, in execute_task
    result = pack_exception(e, dumps)
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 221, in execute_task
    result = _execute_task(task, data)
  File "/opt/conda/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/opt/conda/lib/python3.8/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/opt/conda/lib/python3.8/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/opt/conda/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/opt/conda/lib/python3.8/site-packages/dask/utils.py", line 39, in apply
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 6330, in apply_and_enforce
    df = func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/methods.py", line 37, in loc
    return df.loc[iindexer]
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 127, in __getitem__
    return self._getitem_tuple_arg(arg)
  File "/opt/conda/lib/python3.8/site-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 267, in _getitem_tuple_arg
    df = columns_df._apply_boolean_mask(tmp_arg[0])
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/indexed_frame.py", line 1696, in _apply_boolean_mask
    libcudf.stream_compaction.apply_boolean_mask(
  File "cudf/_lib/stream_compaction.pyx", line 101, in cudf._lib.stream_compaction.apply_boolean_mask
RuntimeError: cuDF failure at: ../src/stream_compaction/apply_boolean_mask.cu:73: Column size mismatch

Expected behavior
Expected output (verified with cudf instead of dask-cudf):

   a  b
1  2  5
2  3  6
   a  b
1  2  5

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker
    • docker run -it --rm --gpus all --ipc=host --network=host -v .

Environment details
cuDF version 22.4.0a0+306.g0cb75a4913

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentbugSomething isn't workingdaskDask issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions