P2P shuffle is slow with string dtypes #7880

mrocklin · 2023-06-02T20:43:38Z

import coiled
import dask.dataframe as dd
from dask.distributed import wait

cluster = coiled.Cluster(
    n_workers=30,
    worker_cpu=4,
    region="us-east-2",  # start workers close to data to minimize costs
    arm=True,
)

client = cluster.get_client()

# this takes 1m21s
df = dd.read_parquet("s3://coiled-datasets/uber-lyft-tlc/")
df = df.set_index("request_datetime", shuffle="tasks").persist()
_ = wait(df)

# this takes 2m12s
df = dd.read_parquet("s3://coiled-datasets/uber-lyft-tlc/")
df = df.set_index("request_datetime", shuffle="p2p").persist()
_ = wait(df)

GIL contention is very high during the p2p shuffle (also during tasks) and cpu usage is at 100+%, implying, maybe, that the creation/deletion of lots of Python objects is slowing us down considerably.

cc @hendrikmakait @jrbourbeau

hendrikmakait · 2023-06-05T17:18:13Z

I'm wondering how to address this best. In your example, the string columns are of type string[python], so converting back to that type feels like the right thing to do for me even if it's overly expensive. Maybe converting string[python] to string[arrow] during a P2P shuffle if dataframe.convert-string is set and raising a warning of we encounter a string[python] column and it's not set would be the right way?

mrocklin · 2023-06-05T20:43:30Z

Sorry, I had the config option set. Dtypes coming in were string[pyarrow].

mrocklin · 2023-06-05T20:44:12Z

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

fjetter · 2023-11-16T17:48:24Z

Brief update about this in case somebody stumbled over it.

with current main we're mostly at the same performance as tasks. However, when disabling disk (dask.config.set({"distributed.p2p.disk": True})) entirely we're at 10s so disk appears to slow us down much more than one would naively assume (this is trying to be addressed in #8323)

Method	Duration
tasks	1min 8s
p2p	1min 15s
p2p (w/out disk)	10s

hendrikmakait self-assigned this Jun 5, 2023

hendrikmakait added performance shuffle labels Jun 5, 2023

hendrikmakait mentioned this issue Jun 8, 2023

Improved conversion between pyarrow and pandas in P2P shuffling #7896

Merged

2 tasks

hendrikmakait mentioned this issue Aug 16, 2023

[Tracking] Advancements for P2P #8043

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2P shuffle is slow with string dtypes #7880

P2P shuffle is slow with string dtypes #7880

mrocklin commented Jun 2, 2023 •

edited

Loading

hendrikmakait commented Jun 5, 2023

mrocklin commented Jun 5, 2023

mrocklin commented Jun 5, 2023

fjetter commented Nov 16, 2023 •

edited

Loading

P2P shuffle is slow with string dtypes #7880

P2P shuffle is slow with string dtypes #7880

Comments

mrocklin commented Jun 2, 2023 • edited Loading

hendrikmakait commented Jun 5, 2023

mrocklin commented Jun 5, 2023

mrocklin commented Jun 5, 2023

fjetter commented Nov 16, 2023 • edited Loading

mrocklin commented Jun 2, 2023 •

edited

Loading

fjetter commented Nov 16, 2023 •

edited

Loading