Skip to content

P2P shuffle questions #5939

@mrocklin

Description

@mrocklin

Howdy folks, I took a look through the recent P2P shuffle implemention. I had a few questions and thoughts. I suspect that most of this is for @gjoseph92 and @fjetter

Just making sure I understand things

So we've taken the structure in the POC and ...

  • added classes and things
  • added the ability for multiple simultaneous shuffles (nice thought with the shuffle ID)
  • stripped out performance considerations (buffering, playing between the thread pool and event loop
  • stripped out disk
  • Generally cleaned things up

Is this correct? Am I missing other large changes?

Some random thoughts

These are just thoughts for future improvements as we evolve. I'm looking for "yup, that's maybe a good idea" or "nope, that's probably dumb because of x, y, z"

  • At some point we might consider engaging the scheduler, rather than getting the list of workers in the first create task. My guess is that we'll need the scheduler anyway when we start thinking about resilience
  • things like DataFrame.groupby should probably be pulled into the sync part of tasks, right? My guess is that we moved this onto the event loop because we were emphasizing simplicity over performance in this round. Is that correct or was there some other issue going on?
  • Any objection to adding "p2p" to Dask as an option there?

Future efforts

I see two major efforts:

  1. Disk and performance
  2. Resilience / worker failure

Assuming that we had magical solutions to both of those problems, what else is missing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions