Open
Description
Describe the bug
Hi, I have noticed a difference in performance when reading a jsonl file with cudf and dask_cudf.
In both cases, I will be using only 1 GPU.
I have the following files (see details below):
- jsonl_cudf.py
- jsonl_dask_cudf.py
Please find below the execution time when I run them on a DGX1 v100 (16GBs):
(rapids) root@6ccf9a94ad0e:/rapids/notebooks/host# python jsonl_cudf.py
4.183666706085205
(rapids) root@6ccf9a94ad0e:/rapids/notebooks/host# python jsonl_dask_cudf.py
6.8754589557647705
The scripts content is as follows:
json_cudf.py
import cudf
import time
start = time.time()
df = cudf.read_json("x00_002GB.jsonl", lines=True)
end = time.time()
print(end - start)
and
jsonl_dask_cudf.py
import dask_cudf
import time
start = time.time()
df = dask_cudf.read_json("x00_002GB.jsonl", lines=True)
end = time.time()
print(end - start)
Steps/Code to reproduce bug
Hi @shwina , as discussed in the Slack channel, I will send you an email with the link to the dataset used. Thanks!
Expected behavior
Not such a huge difference in performance.
Environment overview (please complete the following information)
DGX-A100, cuda 11.5, rapids 22.04