-
-
Notifications
You must be signed in to change notification settings - Fork 734
Closed
Labels
Description
There is a possibility for a worker to suicide by fetching too much data at once
Old version:
distributed/distributed/worker.py
Lines 2754 to 2757 in 84cbb09
while self.data_needed and ( | |
len(self.in_flight_workers) < self.total_out_connections | |
or self.comm_nbytes < self.comm_threshold_bytes | |
): |
New version:
distributed/distributed/worker_state_machine.py
Lines 1425 to 1429 in 3647cfe
if ( | |
len(self.in_flight_workers) >= self.total_out_connections | |
and self.comm_nbytes >= self.comm_threshold_bytes | |
): | |
return {}, [] |
Simple example
Let's assume that keys have on average ~500MB
Local memory available is ~10GB
total_out_connections: 50 (default)
That would cause us to fetch up to 25GB at once which would kill the worker immediately.
This problem is exacerbated if the keys data sizes are underestimated (e.g. bias/bug in dask.sizeof)
cc @crusaderky