Skip to content

Worker can memory overflow by fetching too much data at once #6208

@fjetter

Description

@fjetter

There is a possibility for a worker to suicide by fetching too much data at once

Old version:

while self.data_needed and (
len(self.in_flight_workers) < self.total_out_connections
or self.comm_nbytes < self.comm_threshold_bytes
):

New version:

if (
len(self.in_flight_workers) >= self.total_out_connections
and self.comm_nbytes >= self.comm_threshold_bytes
):
return {}, []

Simple example

Let's assume that keys have on average ~500MB
Local memory available is ~10GB
total_out_connections: 50 (default)

That would cause us to fetch up to 25GB at once which would kill the worker immediately.

This problem is exacerbated if the keys data sizes are underestimated (e.g. bias/bug in dask.sizeof)

cc @crusaderky

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions