Skip to content

[ML] Investigate alternative methods for sharing job memory usage information #34084

Closed
@davidkyle

Description

@davidkyle

When there are multiple ml nodes in the cluster the job allocation decision is made based on the number of open jobs on each node and how much memory they use. Job memory usage is store in the job configuration and is updated periodically during the job's run when a model size stats doc is emitted by autodetect. This can lead to frequent job config updates (cluster state updates) particularly so for historical look-back jobs.

  1. Consider moving the job's established memory usage from the config as it is a result of the job running not part of it's setup.
  2. Consider alternative methods go gather the open job's memory usage and make that information trivially available to the code making the allocation decision.

This is pertinent to the job config migration project #32905 where the job's memory usage is not available in the cluster state during the allocation decision. A temporary work around was implemented in #33994 basing the decision on the job count rather than memory usage.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions