-
-
Notifications
You must be signed in to change notification settings - Fork 750
Open
Description
Howdy folks, I took a look through the recent P2P shuffle implemention. I had a few questions and thoughts. I suspect that most of this is for @gjoseph92 and @fjetter
Just making sure I understand things
So we've taken the structure in the POC and ...
- added classes and things
- added the ability for multiple simultaneous shuffles (nice thought with the shuffle ID)
- stripped out performance considerations (buffering, playing between the thread pool and event loop
- stripped out disk
- Generally cleaned things up
Is this correct? Am I missing other large changes?
Some random thoughts
These are just thoughts for future improvements as we evolve. I'm looking for "yup, that's maybe a good idea" or "nope, that's probably dumb because of x, y, z"
- At some point we might consider engaging the scheduler, rather than getting the list of workers in the first create task. My guess is that we'll need the scheduler anyway when we start thinking about resilience
- things like DataFrame.groupby should probably be pulled into the sync part of tasks, right? My guess is that we moved this onto the event loop because we were emphasizing simplicity over performance in this round. Is that correct or was there some other issue going on?
- Any objection to adding
"p2p"to Dask as an option there?
Future efforts
I see two major efforts:
- Disk and performance
- Resilience / worker failure
Assuming that we had magical solutions to both of those problems, what else is missing?
Metadata
Metadata
Assignees
Labels
No labels