Skip to content

read_deltalake vs read_parquet performance #58

Open
@j-bennet

Description

I did a quick test reading DeltaLake data in a notebook on a Coiled cluster from s3, with dd.read_parquet vs ddt.read_deltalake.

Cluster: https://cloud.coiled.io/clusters/245026/information?account=dask-engineering.

Data is located in s3://coiled-datasets/delta/.

Results:

dataset computation timing (read_parquet) timing (read_deltalake)
ds20f_100M ddf["int1"].sum().compute() CPU times: user 43.5 ms, sys: 10.7 ms, total: 54.2 ms, Wall time: 8.04 s CPU times: user 159 ms, sys: 38.4 ms, total: 198 ms, Wall time: 55.3 s
ds20f_100M ddf.describe().compute() CPU times: user 256 ms, sys: 28.7 ms, total: 284 ms, Wall time: 20.7 s CPU times: user 380 ms, sys: 60.7 ms, total: 441 ms, Wall time: 1min 10s
ds25f_250M ddf["int1"].sum().compute() CPU times: user 67.1 ms, sys: 15.6 ms, total: 82.7 ms, Wall time: 16.7 s CPU times: user 666 ms, sys: 176 ms, total: 842 ms, Wall time: 3min 59s
ds25f_250M ddf.describe().compute() CPU times: user 605 ms, sys: 70.3 ms, total: 675 ms, Wall time: 1min 10s CPU times: user 1.02 s, sys: 181 ms, total: 1.2 s, Wall time: 4min 2s
ds50f_500M ddf["int1"].sum().compute() CPU times: user 204 ms, sys: 49.2 ms, total: 253 ms, Wall time: 1min 2s CPU times: user 2.93 s, sys: 626 ms, total: 3.56 s, Wall time: 16min 46s
ds50f_500M ddf.describe().compute() CPU times: user 3.59 s, sys: 383 ms, total: 3.97 s, Wall time: 5min 53s killed before finished

This doesn't look good, and needs looking into.

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions