Skip to content

[BUG] Performance difference between cudf and dask_cudf when reading jsonl files #10867

Open
@miguelusque

Description

@miguelusque

Describe the bug
Hi, I have noticed a difference in performance when reading a jsonl file with cudf and dask_cudf.

In both cases, I will be using only 1 GPU.

I have the following files (see details below):

  • jsonl_cudf.py
  • jsonl_dask_cudf.py

Please find below the execution time when I run them on a DGX1 v100 (16GBs):

(rapids) root@6ccf9a94ad0e:/rapids/notebooks/host# python jsonl_cudf.py 
4.183666706085205
(rapids) root@6ccf9a94ad0e:/rapids/notebooks/host# python jsonl_dask_cudf.py
6.8754589557647705

The scripts content is as follows:
json_cudf.py

import cudf
import time

start = time.time()
df = cudf.read_json("x00_002GB.jsonl", lines=True)
end = time.time()
print(end - start)

and
jsonl_dask_cudf.py

import dask_cudf
import time

start = time.time()
df = dask_cudf.read_json("x00_002GB.jsonl", lines=True)
end = time.time()
print(end - start)

Steps/Code to reproduce bug
Hi @shwina , as discussed in the Slack channel, I will send you an email with the link to the dataset used. Thanks!

Expected behavior
Not such a huge difference in performance.

Environment overview (please complete the following information)
DGX-A100, cuda 11.5, rapids 22.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentPerformancePerformance related issuebugSomething isn't workingdaskDask issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions