`TaskGroup.nbytes_in_memory` miscounted for replicated keys

I think there is a logic error with bookkeping for `TaskGroup.nbytes_in_memory`. There's a discrepancy between how we increment it and decrement it when multiple workers hold the same key.

In `transition_memory_released`, we decrement it by `nbytes` once for every worker that holds that task: https://github.com/dask/distributed/blob/6340b5b34a582ff1ffe3eaf9dd7f3aac33d0a983/distributed/scheduler.py#L2646-L2649
Whereas in `_propagate_forgotten`, we decrement it once by `nbytes` if there are any workers holding the task, regardless of how many. This doesn't match with `transition_memory_released`:
https://github.com/dask/distributed/blob/646b12be37ef82e2afa2baf2827716746fa93173/distributed/scheduler.py#L7339-L7341

On the creation side, in `TaskState.set_nbytes`, we only increment it by the _diff_ between the last known value and the current value. If the key is being copied to multiple workers, this difference is usually 0: https://github.com/dask/distributed/blob/6340b5b34a582ff1ffe3eaf9dd7f3aac33d0a983/distributed/scheduler.py#L1556-L1562

In short, I think `TaskGroup.nbytes_in_memory` is incremented once per key, but decremented once per _copy_ of the key.

If `nbytes` can be different for different workers, then to do this bookkeeping correctly, I think we'd also need to track `TaskState.total_nbytes` (size of all copies of the key), then decrement by that once in `transition_memory_released` and `_propagate_forgotten`.

Discovered in https://github.com/dask/distributed/pull/4925#issuecomment-863600812. I think #4925 made this more apparent, since it encourages more data replication.

cc @crusaderky since you know more about replicated keys.

	for ws in ts._who_has:
	del ws._has_what[ts]
	ws._nbytes -= ts_nbytes
	ts._group._nbytes_in_memory -= ts_nbytes

	def set_nbytes(self, nbytes: Py_ssize_t):
	diff: Py_ssize_t = nbytes
	old_nbytes: Py_ssize_t = self._nbytes
	if old_nbytes >= 0:
	diff -= old_nbytes
	self._group._nbytes_total += diff
	self._group._nbytes_in_memory += diff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`TaskGroup.nbytes_in_memory` miscounted for replicated keys #4927

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	ts_nbytes: Py_ssize_t = ts.get_nbytes()
	if ts._who_has:
	ts._group._nbytes_in_memory -= ts_nbytes

Uh oh!

TaskGroup.nbytes_in_memory miscounted for replicated keys #4927

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`TaskGroup.nbytes_in_memory` miscounted for replicated keys #4927