-
Notifications
You must be signed in to change notification settings - Fork 25.3k
[ML] Add processor autoscaling decider #89645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add processor autoscaling decider #89645
Conversation
Pinging @elastic/ml-core (Team:ML) |
Hi @dimitris-athanasiou, I've created a changelog YAML for you. |
...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlMemoryAutoscalingDecider.java
Show resolved
Hide resolved
...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlMemoryAutoscalingDecider.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a first look at this but stopped looking at the details after I realised there might be a problem with float vs rounded CPU quota. I think we need to make sure that's round-tripped consistently for Cloud to Elasticsearch via node.processors
and then Elasticsearch to Cloud via autoscaling responses to avoid the risk of needless scale ups.
...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlMemoryAutoscalingDecider.java
Show resolved
Hide resolved
.../ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlProcessorAutoscalingCapacity.java
Outdated
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java
Outdated
Show resolved
Hide resolved
...n/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlProcessorAutoscalingDecider.java
Outdated
Show resolved
Hide resolved
@elasticmachine update branch |
This adds a processor decider to the ML autoscaling decider service. This first implementation is simple and naive. It simply computes the required processor capacity to be the max trained model deployment `threads_per_allocation` for the node, and the sum of all processors required by trained model deployments for the tier.
0461e7d
to
8dc4ac3
Compare
@droberts195 I have updated the PR to make use of the newly added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I think there is one thing that could be improved in the future. Please add a comment about that.
|
||
final MlProcessorAutoscalingCapacity requiredCapacity = computeRequiredCapacity(trainedModelAssignmentMetadata).build(); | ||
|
||
if (requiredCapacity.tierProcessors().roundUp() == currentCapacity.tierProcessors().roundUp()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line means that if we have 2.5 processors and need 3 then we won't scale up.
Even if it's not fixed in this PR I think it's worth adding a comment to call that out. Maybe we can fix it in a followup for 8.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get here we are definitely not in a scenario where we want to scale up. We know all our model deployments are satisfied. The reason I added that check is to provide a nicer message why we don't downscale nor upscale if we're right where we need to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know all our model deployments are satisfied.
But the question is whether 2.5 processors should "satisfy" a model that wanted 3. In TrainedModelAssignmentRebalancer
we call getNodeAllocatedProcessors(discoveryNode).roundUp()
, so it will happily allocate a model that wants 3 processors to a node that has 2.5. I think it's a followup for the future to consider whether we should scale up in that scenario. Maybe the code should be changed in TrainedModelAssignmentRebalancer
rather than here though.
Anyway, I'm happy to leave it as-is for the first version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I see the point now. OK, I'll merge as-is and revisit later.
In elastic#89645 I switched the ml node attribute for allocated processors to write a string representation of a double value. However, that means the attribute value cannot be parsed in a mixed cluster version when the node that tries to read it is before 8.5.0. This commit addresses this problem by introducing a new attribute for the version the has a string representation of the double value and keeps writing the older version of the attribute as integer. It also refactors the logic of parsing the attribute value given a node in a single place as the logic now involves BWC and the duplication is adding complexity.
* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...
* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java
* main: (283 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...
…9896) In #89645 I switched the ml node attribute for allocated processors to write a string representation of a double value. However, that means the attribute value cannot be parsed in a mixed cluster version when the node that tries to read it is before 8.5.0. This commit addresses this problem by introducing a new attribute for the version the has a string representation of the double value and keeps writing the older version of the attribute as integer. It also refactors the logic of parsing the attribute value given a node in a single place as the logic now involves BWC and the duplication is adding complexity.
When a processor autoscaling decider was added in elastic#89645, an unwanted change of behaviour sneaked in. In particular, if we cannot determine required memory capacity, we previously returned `new AutoscalingDeciderResult(null)` where as we now return an autoscaling result with no memory capacity and whatever the result of the processor decider is. Previously, if we returned a result with null capacity, the cluster would remain as-is. Now, it is possible to cause unwanted scaling. This commit fixes this by checking if the memory decider result was undetermined and returns an empty result if so. Also, some logging warnings have been added to pop up scenarios that shouldn't happen like when the memory tracker is not called by the master node or it has no memory estimate for anomaly detection or analytics jobs.
…0259) When a processor autoscaling decider was added in #89645, an unwanted change of behaviour sneaked in. In particular, if we cannot determine required memory capacity, we previously returned `new AutoscalingDeciderResult(null)` where as we now return an autoscaling result with no memory capacity and whatever the result of the processor decider is. Previously, if we returned a result with null capacity, the cluster would remain as-is. Now, it is possible to cause unwanted scaling. This commit fixes this by checking if the memory decider result was undetermined and returns an empty result if so. Also, some logging warnings have been added to pop up scenarios that shouldn't happen like when the memory tracker is not called by the master node or it has no memory estimate for anomaly detection or analytics jobs.
…astic#90259) When a processor autoscaling decider was added in elastic#89645, an unwanted change of behaviour sneaked in. In particular, if we cannot determine required memory capacity, we previously returned `new AutoscalingDeciderResult(null)` where as we now return an autoscaling result with no memory capacity and whatever the result of the processor decider is. Previously, if we returned a result with null capacity, the cluster would remain as-is. Now, it is possible to cause unwanted scaling. This commit fixes this by checking if the memory decider result was undetermined and returns an empty result if so. Also, some logging warnings have been added to pop up scenarios that shouldn't happen like when the memory tracker is not called by the master node or it has no memory estimate for anomaly detection or analytics jobs.
…0259) (#90335) When a processor autoscaling decider was added in #89645, an unwanted change of behaviour sneaked in. In particular, if we cannot determine required memory capacity, we previously returned `new AutoscalingDeciderResult(null)` where as we now return an autoscaling result with no memory capacity and whatever the result of the processor decider is. Previously, if we returned a result with null capacity, the cluster would remain as-is. Now, it is possible to cause unwanted scaling. This commit fixes this by checking if the memory decider result was undetermined and returns an empty result if so. Also, some logging warnings have been added to pop up scenarios that shouldn't happen like when the memory tracker is not called by the master node or it has no memory estimate for anomaly detection or analytics jobs.
This adds a processor decider to the ML autoscaling decider service.
This first implementation is simple and naive. It simply computes
the required processor capacity to be the max trained model deployment
threads_per_allocation
for the node, and the sum of all processorsrequired by trained model deployments for the tier.