Skip to content

[ML] Add processor autoscaling decider #89645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

dimitris-athanasiou
Copy link
Contributor

This adds a processor decider to the ML autoscaling decider service.
This first implementation is simple and naive. It simply computes
the required processor capacity to be the max trained model deployment
threads_per_allocation for the node, and the sum of all processors
required by trained model deployments for the tier.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Aug 26, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @dimitris-athanasiou, I've created a changelog YAML for you.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a first look at this but stopped looking at the details after I realised there might be a problem with float vs rounded CPU quota. I think we need to make sure that's round-tripped consistently for Cloud to Elasticsearch via node.processors and then Elasticsearch to Cloud via autoscaling responses to avoid the risk of needless scale ups.

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

This adds a processor decider to the ML autoscaling decider service.
This first implementation is simple and naive. It simply computes
the required processor capacity to be the max trained model deployment
`threads_per_allocation` for the node, and the sum of all processors
required by trained model deployments for the tier.
@dimitris-athanasiou dimitris-athanasiou force-pushed the processor-autoscaling-decider branch from 0461e7d to 8dc4ac3 Compare September 7, 2022 11:48
@dimitris-athanasiou
Copy link
Contributor Author

@droberts195 I have updated the PR to make use of the newly added Processors class. We now store the ML node attribute as a double and thus we can return the exact value when we return current capacity.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I think there is one thing that could be improved in the future. Please add a comment about that.


final MlProcessorAutoscalingCapacity requiredCapacity = computeRequiredCapacity(trainedModelAssignmentMetadata).build();

if (requiredCapacity.tierProcessors().roundUp() == currentCapacity.tierProcessors().roundUp()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line means that if we have 2.5 processors and need 3 then we won't scale up.

Even if it's not fixed in this PR I think it's worth adding a comment to call that out. Maybe we can fix it in a followup for 8.6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get here we are definitely not in a scenario where we want to scale up. We know all our model deployments are satisfied. The reason I added that check is to provide a nicer message why we don't downscale nor upscale if we're right where we need to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know all our model deployments are satisfied.

But the question is whether 2.5 processors should "satisfy" a model that wanted 3. In TrainedModelAssignmentRebalancer we call getNodeAllocatedProcessors(discoveryNode).roundUp(), so it will happily allocate a model that wants 3 processors to a node that has 2.5. I think it's a followup for the future to consider whether we should scale up in that scenario. Maybe the code should be changed in TrainedModelAssignmentRebalancer rather than here though.

Anyway, I'm happy to leave it as-is for the first version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I see the point now. OK, I'll merge as-is and revisit later.

@dimitris-athanasiou dimitris-athanasiou merged commit 407dc18 into elastic:main Sep 7, 2022
@dimitris-athanasiou dimitris-athanasiou deleted the processor-autoscaling-decider branch September 7, 2022 15:03
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Sep 7, 2022
In elastic#89645 I switched the ml node attribute for allocated processors
to write a string representation of a double value. However, that
means the attribute value cannot be parsed in a mixed cluster version
when the node that tries to read it is before 8.5.0.

This commit addresses this problem by introducing a new attribute
for the version the has a string representation of the double value
and keeps writing the older version of the attribute as integer.

It also refactors the logic of parsing the attribute value given
a node in a single place as the logic now involves BWC and the
duplication is adding complexity.
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 8, 2022
* main: (175 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 8, 2022
* main: (175 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...

# Conflicts:
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 8, 2022
* main: (283 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...
droberts195 pushed a commit that referenced this pull request Sep 8, 2022
…9896)

In #89645 I switched the ml node attribute for allocated processors
to write a string representation of a double value. However, that
means the attribute value cannot be parsed in a mixed cluster version
when the node that tries to read it is before 8.5.0.

This commit addresses this problem by introducing a new attribute
for the version the has a string representation of the double value
and keeps writing the older version of the attribute as integer.

It also refactors the logic of parsing the attribute value given
a node in a single place as the logic now involves BWC and the
duplication is adding complexity.
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Sep 23, 2022
When a processor autoscaling decider was added in elastic#89645, an unwanted
change of behaviour sneaked in. In particular, if we cannot determine
required memory capacity, we previously returned `new AutoscalingDeciderResult(null)`
where as we now return an autoscaling result with no memory capacity
and whatever the result of the processor decider is.

Previously, if we returned a result with null capacity, the cluster
would remain as-is. Now, it is possible to cause unwanted scaling.

This commit fixes this by checking if the memory decider result
was undetermined and returns an empty result if so.

Also, some logging warnings have been added to pop up scenarios
that shouldn't happen like when the memory tracker is not called
by the master node or it has no memory estimate for anomaly detection
or analytics jobs.
dimitris-athanasiou added a commit that referenced this pull request Sep 26, 2022
…0259)

When a processor autoscaling decider was added in #89645, an unwanted
change of behaviour sneaked in. In particular, if we cannot determine
required memory capacity, we previously returned `new AutoscalingDeciderResult(null)`
where as we now return an autoscaling result with no memory capacity
and whatever the result of the processor decider is.

Previously, if we returned a result with null capacity, the cluster
would remain as-is. Now, it is possible to cause unwanted scaling.

This commit fixes this by checking if the memory decider result
was undetermined and returns an empty result if so.

Also, some logging warnings have been added to pop up scenarios
that shouldn't happen like when the memory tracker is not called
by the master node or it has no memory estimate for anomaly detection
or analytics jobs.
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Sep 26, 2022
…astic#90259)

When a processor autoscaling decider was added in elastic#89645, an unwanted
change of behaviour sneaked in. In particular, if we cannot determine
required memory capacity, we previously returned `new AutoscalingDeciderResult(null)`
where as we now return an autoscaling result with no memory capacity
and whatever the result of the processor decider is.

Previously, if we returned a result with null capacity, the cluster
would remain as-is. Now, it is possible to cause unwanted scaling.

This commit fixes this by checking if the memory decider result
was undetermined and returns an empty result if so.

Also, some logging warnings have been added to pop up scenarios
that shouldn't happen like when the memory tracker is not called
by the master node or it has no memory estimate for anomaly detection
or analytics jobs.
elasticsearchmachine pushed a commit that referenced this pull request Sep 26, 2022
…0259) (#90335)

When a processor autoscaling decider was added in #89645, an unwanted
change of behaviour sneaked in. In particular, if we cannot determine
required memory capacity, we previously returned `new AutoscalingDeciderResult(null)`
where as we now return an autoscaling result with no memory capacity
and whatever the result of the processor decider is.

Previously, if we returned a result with null capacity, the cluster
would remain as-is. Now, it is possible to cause unwanted scaling.

This commit fixes this by checking if the memory decider result
was undetermined and returns an empty result if so.

Also, some logging warnings have been added to pop up scenarios
that shouldn't happen like when the memory tracker is not called
by the master node or it has no memory estimate for anomaly detection
or analytics jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants