P2P Be robust to connection timeouts during shuffle_receive #8011

fjetter · 2023-07-18T09:15:39Z

When sending shards, we currently rely on the distributed.comm.timeouts.connect to be sufficiently large to be able to establish a connection in case of a cold CommPool.

However, if the remote is struggling with a blocked event loop, this timeout can be too short, particularly as long as #7698 is not fixed.

We could consider retrying the sending on CommErrors. If the remote was dead, the shuffle extension would fail the tasks such that I believe we can be arbitrarily generous with retries during this operation.

distributed/distributed/shuffle/_worker_plugin.py

Lines 128 to 132 in 2be7f35

    
           return await self.rpc(address).shuffle_receive( 
        
               data=to_serialize(shards), 
        
               shuffle_id=self.id, 
        
               run_id=self.run_id, 
        
           )

The text was updated successfully, but these errors were encountered:

hendrikmakait · 2023-07-18T09:21:34Z

When adding retries, we should make sure to include checking whether the shuffle run has been closed in the retried code.

hendrikmakait · 2023-07-28T17:54:33Z

After merging #7698, P2P shuffle tests have begun to become very flaky. I've tracked this down to a change in the order in which the scheduler is informed about the worker leaving and listeners are being stopped. This now causes the shuffle tasks to fail with a CommClosedError before the scheduler is able to restart the shuffle. As a result, the scheduler will think the tasks genuinely experienced a problem and will not restart them.

As a result, implementing a retry mechanism has become increasingly important. @charlesbluca - let's coordinate on this!

fjetter · 2023-07-31T08:23:12Z

I could see us also work around #7698 by warming up our connection pools before the shuffle commences. Something like

class P2PWorkerPlugin:

    async def warmup_comms(self):
        await asyncio.gather(
            self.rpc(w).ping()
            for w in self.workers
        )

Maybe the scheduler initiates this during shuffle init, idk. as long as we have fewer than 512 workers (that's the comm pool), this will likely be a good thing and avoid having any transfer tasks fail due to a connection error. The problem still exists, of course but this way most users will likely not be impacted

fjetter · 2023-07-31T08:31:00Z

@hendrikmakait can you elaborate more about the ordering during shutdown that is leading to this? My intuition says that stopping the listener right away as done in #7698 is the correct way and I wonder if we have to change other actions to make this more consistent or if my intuition is wrong

fjetter · 2023-07-31T11:03:10Z

When adding retries, we should make sure to include checking whether the shuffle run has been closed in the retried code.

Thought about this briefly again. I'm not entirely sure if this is necessary. I'd expect the remote to either ignore the pushed data or even raise a proper exception if the shuffle was already closed. Is it actually necessary to add more logic on sender side?

Just to be clear, I think we should only retry on OSError

hendrikmakait · 2023-07-31T11:21:24Z

Thought about this briefly again. I'm not entirely sure if this is necessary. I'd expect the remote to either ignore the pushed data or even raise a proper exception if the shuffle was already closed. Is it actually necessary to add more logic on sender side?

I've thought of this as an early stopping mechanism for the retries. If the shuffle has been closed (e.g., because a remote worker dropped), there's no point in retrying until we maxed out whatever retry limitations we've set. That worker won't respond anymore.

hendrikmakait · 2023-07-31T11:44:07Z

rt them.

The problem is that by stopping the listeners, RPCs start throwing errors. Remotes using RPC calls start receiving these errors (notably P2P's send buffers and barrier) and raise them. This causes P2P tasks to transition to erred. For all the scheduler knows, those are legitimate errors, because it has not removed the worker yet and therefore we haven't restarted the shuffle, which would have invalidated all those erred tasks coming in from workers. If P2P tasks failed due to some legitimate error (or exceeded suspicious count), we will not restart the P2P shuffle.

For the scheduler to remove the worker, its batched stream to the worker needs to be closed. The worker won't close the stream before it has

canceled all asynchronous instructions
torn down all preloads
closed all extensions
torn down all plugins
torn down the scheduler RPC
stopped all services

so there's now plenty of time for task-erred messages to arrive at the scheduler.

Previously, the worker would have stopped its listeners after it had closed the batched stream to the scheduler. So in almost all scenarios, the scheduler would have removed the worker and caused the shuffle to restart before any task-erred messages would start coming in. The notable exception is some networking-issue where it takes a while for the close-stream message to arrive at the scheduler.

hendrikmakait · 2023-07-31T11:51:47Z

TL;DR: I think it's generally a good idea to close listeners early and not let them potentially interfere with shutdown. The problem is just that it adds a race condition we haven't covered in P2P restarts so far.

The most holistic way of solving this is handling the errors that would be caused by this on the receiving end in P2P. The easiest approach should be retrying with some generously-configured limits to ensure that there will never be a deadlock. Checking if we can stop retrying because the shuffle has been closed (e.g., due to a remote worker leaving) seem like a worthwhile optimization should not be strictly necessary. Cancelled tasks would just block workers longer than necessary.

hendrikmakait · 2023-08-16T07:39:51Z

After merging #7698, P2P shuffle tests have begun to become very flaky. I've tracked this down to a change in the order in which the scheduler is informed about the worker leaving and listeners are being stopped. This now causes the shuffle tasks to fail with a CommClosedError before the scheduler is able to restart the shuffle. As a result, the scheduler will think the tasks genuinely experienced a problem and will not restart them.

This problem has been solved more holistically (see #8088).

fjetter added enhancement Improve existing functionality or make things work better networking good first issue Clearly described and easy to accomplish. Good for beginners to the project. stability Issue or feature related to cluster stability (e.g. deadlock) labels Jul 18, 2023

fjetter assigned charlesbluca Jul 25, 2023

hendrikmakait mentioned this issue Jul 28, 2023

[Tracking] Advancements for P2P #8043

Open

15 tasks

hendrikmakait mentioned this issue Jul 28, 2023

Exclude comm handshake from connect timeout #7698

Merged

This was referenced Jul 31, 2023

test_ready_remove_worker broken on Python 3.11 #8054

Open

Flakiness in test_shuffle.py #8074

Open

charlesbluca mentioned this issue Aug 22, 2023

Add configurable retry mechanism to ShuffleRun.send #8124

Merged

2 tasks

hendrikmakait closed this as completed in #8124 Oct 10, 2023

hendrikmakait mentioned this issue May 30, 2024

[P2P] Be robust to timeouts during shuffle barrier #8659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2P Be robust to connection timeouts during shuffle_receive #8011

P2P Be robust to connection timeouts during shuffle_receive #8011

fjetter commented Jul 18, 2023

hendrikmakait commented Jul 18, 2023

hendrikmakait commented Jul 28, 2023

fjetter commented Jul 31, 2023 •

edited

Loading

fjetter commented Jul 31, 2023

fjetter commented Jul 31, 2023

hendrikmakait commented Jul 31, 2023

hendrikmakait commented Jul 31, 2023 •

edited

Loading

hendrikmakait commented Jul 31, 2023

hendrikmakait commented Aug 16, 2023

P2P Be robust to connection timeouts during shuffle_receive #8011

P2P Be robust to connection timeouts during shuffle_receive #8011

Comments

fjetter commented Jul 18, 2023

hendrikmakait commented Jul 18, 2023

hendrikmakait commented Jul 28, 2023

fjetter commented Jul 31, 2023 • edited Loading

fjetter commented Jul 31, 2023

fjetter commented Jul 31, 2023

hendrikmakait commented Jul 31, 2023

hendrikmakait commented Jul 31, 2023 • edited Loading

hendrikmakait commented Jul 31, 2023

hendrikmakait commented Aug 16, 2023

fjetter commented Jul 31, 2023 •

edited

Loading

hendrikmakait commented Jul 31, 2023 •

edited

Loading