[ML] Add processor autoscaling decider #89645

dimitris-athanasiou · 2022-08-26T10:03:19Z

This adds a processor decider to the ML autoscaling decider service.
This first implementation is simple and naive. It simply computes
the required processor capacity to be the max trained model deployment
threads_per_allocation for the node, and the sum of all processors
required by trained model deployments for the tier.

elasticsearchmachine · 2022-08-26T10:03:43Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2022-08-26T10:03:44Z

Hi @dimitris-athanasiou, I've created a changelog YAML for you.

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlMemoryAutoscalingDecider.java

droberts195

I had a first look at this but stopped looking at the details after I realised there might be a problem with float vs rounded CPU quota. I think we need to make sure that's round-tripped consistently for Cloud to Elasticsearch via node.processors and then Elasticsearch to Cloud via autoscaling responses to avoid the risk of needless scale ups.

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlMemoryAutoscalingDecider.java

.../ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlProcessorAutoscalingCapacity.java

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

...n/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlProcessorAutoscalingDecider.java

dimitris-athanasiou · 2022-09-07T10:26:53Z

@elasticmachine update branch

This adds a processor decider to the ML autoscaling decider service. This first implementation is simple and naive. It simply computes the required processor capacity to be the max trained model deployment `threads_per_allocation` for the node, and the sum of all processors required by trained model deployments for the tier.

dimitris-athanasiou · 2022-09-07T12:02:15Z

@droberts195 I have updated the PR to make use of the newly added Processors class. We now store the ML node attribute as a double and thus we can return the exact value when we return current capacity.

droberts195

LGTM, although I think there is one thing that could be improved in the future. Please add a comment about that.

droberts195 · 2022-09-07T13:40:52Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlProcessorAutoscalingDecider.java

+
+        final MlProcessorAutoscalingCapacity requiredCapacity = computeRequiredCapacity(trainedModelAssignmentMetadata).build();
+
+        if (requiredCapacity.tierProcessors().roundUp() == currentCapacity.tierProcessors().roundUp()) {


This line means that if we have 2.5 processors and need 3 then we won't scale up.

Even if it's not fixed in this PR I think it's worth adding a comment to call that out. Maybe we can fix it in a followup for 8.6.

If we get here we are definitely not in a scenario where we want to scale up. We know all our model deployments are satisfied. The reason I added that check is to provide a nicer message why we don't downscale nor upscale if we're right where we need to be.

We know all our model deployments are satisfied.

But the question is whether 2.5 processors should "satisfy" a model that wanted 3. In TrainedModelAssignmentRebalancer we call getNodeAllocatedProcessors(discoveryNode).roundUp(), so it will happily allocate a model that wants 3 processors to a node that has 2.5. I think it's a followup for the future to consider whether we should scale up in that scenario. Maybe the code should be changed in TrainedModelAssignmentRebalancer rather than here though.

Anyway, I'm happy to leave it as-is for the first version.

Right, I see the point now. OK, I'll merge as-is and revisit later.

In elastic#89645 I switched the ml node attribute for allocated processors to write a string representation of a double value. However, that means the attribute value cannot be parsed in a mixed cluster version when the node that tries to read it is before 8.5.0. This commit addresses this problem by introducing a new attribute for the version the has a string representation of the double value and keeps writing the older version of the attribute as integer. It also refactors the logic of parsing the attribute value given a node in a single place as the logic now involves BWC and the duplication is adding complexity.

* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...

* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java

* main: (283 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...

…9896) In #89645 I switched the ml node attribute for allocated processors to write a string representation of a double value. However, that means the attribute value cannot be parsed in a mixed cluster version when the node that tries to read it is before 8.5.0. This commit addresses this problem by introducing a new attribute for the version the has a string representation of the double value and keeps writing the older version of the attribute as integer. It also refactors the logic of parsing the attribute value given a node in a single place as the logic now involves BWC and the duplication is adding complexity.

When a processor autoscaling decider was added in elastic#89645, an unwanted change of behaviour sneaked in. In particular, if we cannot determine required memory capacity, we previously returned `new AutoscalingDeciderResult(null)` where as we now return an autoscaling result with no memory capacity and whatever the result of the processor decider is. Previously, if we returned a result with null capacity, the cluster would remain as-is. Now, it is possible to cause unwanted scaling. This commit fixes this by checking if the memory decider result was undetermined and returns an empty result if so. Also, some logging warnings have been added to pop up scenarios that shouldn't happen like when the memory tracker is not called by the master node or it has no memory estimate for anomaly detection or analytics jobs.

…0259) When a processor autoscaling decider was added in #89645, an unwanted change of behaviour sneaked in. In particular, if we cannot determine required memory capacity, we previously returned `new AutoscalingDeciderResult(null)` where as we now return an autoscaling result with no memory capacity and whatever the result of the processor decider is. Previously, if we returned a result with null capacity, the cluster would remain as-is. Now, it is possible to cause unwanted scaling. This commit fixes this by checking if the memory decider result was undetermined and returns an empty result if so. Also, some logging warnings have been added to pop up scenarios that shouldn't happen like when the memory tracker is not called by the master node or it has no memory estimate for anomaly detection or analytics jobs.

…astic#90259) When a processor autoscaling decider was added in elastic#89645, an unwanted change of behaviour sneaked in. In particular, if we cannot determine required memory capacity, we previously returned `new AutoscalingDeciderResult(null)` where as we now return an autoscaling result with no memory capacity and whatever the result of the processor decider is. Previously, if we returned a result with null capacity, the cluster would remain as-is. Now, it is possible to cause unwanted scaling. This commit fixes this by checking if the memory decider result was undetermined and returns an empty result if so. Also, some logging warnings have been added to pop up scenarios that shouldn't happen like when the memory tracker is not called by the master node or it has no memory estimate for anomaly detection or analytics jobs.

…0259) (#90335) When a processor autoscaling decider was added in #89645, an unwanted change of behaviour sneaked in. In particular, if we cannot determine required memory capacity, we previously returned `new AutoscalingDeciderResult(null)` where as we now return an autoscaling result with no memory capacity and whatever the result of the processor decider is. Previously, if we returned a result with null capacity, the cluster would remain as-is. Now, it is possible to cause unwanted scaling. This commit fixes this by checking if the memory decider result was undetermined and returns an empty result if so. Also, some logging warnings have been added to pop up scenarios that shouldn't happen like when the memory tracker is not called by the master node or it has no memory estimate for anomaly detection or analytics jobs.

dimitris-athanasiou added >enhancement :ml Machine learning v8.5.0 labels Aug 26, 2022

dimitris-athanasiou requested a review from droberts195 August 26, 2022 10:03

elasticsearchmachine added the Team:ML Meta label for the ML team label Aug 26, 2022

dimitris-athanasiou commented Aug 26, 2022

View reviewed changes

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlMemoryAutoscalingDecider.java Show resolved Hide resolved

dimitris-athanasiou commented Aug 26, 2022

View reviewed changes

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlMemoryAutoscalingDecider.java Show resolved Hide resolved

droberts195 reviewed Aug 30, 2022

View reviewed changes

dimitris-athanasiou added 3 commits September 7, 2022 13:33

Update docs/changelog/89645.yaml

40f6d1a

Make use of Processors class

8dc4ac3

dimitris-athanasiou force-pushed the processor-autoscaling-decider branch from 0461e7d to 8dc4ac3 Compare September 7, 2022 11:48

dimitris-athanasiou added 2 commits September 7, 2022 15:48

Do not call Processors.of(0.0)

0b750e4

Merge branch 'main' into processor-autoscaling-decider

04d5d81

droberts195 approved these changes Sep 7, 2022

View reviewed changes

dimitris-athanasiou merged commit 407dc18 into elastic:main Sep 7, 2022

dimitris-athanasiou deleted the processor-autoscaling-decider branch September 7, 2022 15:03

dimitris-athanasiou mentioned this pull request Sep 7, 2022

[ML] Maintain ml node attribute for processors as integer for BWC #89896

Merged

pgomulka mentioned this pull request Sep 8, 2022

Rolling upgrade failures #89915

Closed

dimitris-athanasiou mentioned this pull request Sep 22, 2022

[ML] Do not make autoscaling decision when memory is undetermined #90259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Add processor autoscaling decider #89645

[ML] Add processor autoscaling decider #89645

Uh oh!

dimitris-athanasiou commented Aug 26, 2022

Uh oh!

elasticsearchmachine commented Aug 26, 2022

Uh oh!

elasticsearchmachine commented Aug 26, 2022

Uh oh!

Uh oh!

Uh oh!

droberts195 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dimitris-athanasiou commented Sep 7, 2022

Uh oh!

dimitris-athanasiou commented Sep 7, 2022

Uh oh!

droberts195 left a comment

Uh oh!

droberts195 Sep 7, 2022

Uh oh!

dimitris-athanasiou Sep 7, 2022

Uh oh!

droberts195 Sep 7, 2022

Uh oh!

dimitris-athanasiou Sep 7, 2022

Uh oh!

Uh oh!


		final MlProcessorAutoscalingCapacity requiredCapacity = computeRequiredCapacity(trainedModelAssignmentMetadata).build();

		if (requiredCapacity.tierProcessors().roundUp() == currentCapacity.tierProcessors().roundUp()) {

[ML] Add processor autoscaling decider #89645

[ML] Add processor autoscaling decider #89645

Uh oh!

Conversation

dimitris-athanasiou commented Aug 26, 2022

Uh oh!

elasticsearchmachine commented Aug 26, 2022

Uh oh!

elasticsearchmachine commented Aug 26, 2022

Uh oh!

Uh oh!

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dimitris-athanasiou commented Sep 7, 2022

Uh oh!

dimitris-athanasiou commented Sep 7, 2022

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

droberts195 Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

droberts195 Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!