[ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations #74691

benwtrent · 2021-06-29T13:27:54Z

This commit addresses two problems:

Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be.
When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what.

One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be 2gb it has been changed to 1.2gb. This accounts for the "window of uncertainty" for JVM sizes.

closes: #74709

elasticmachine · 2021-06-29T15:50:20Z

Pinging @elastic/ml-core (Team:ML)

droberts195

I left a few comments on specific lines.

A more general comment is that I am not sure how much effect all the changes of ceil to round and switching between integer and floating point arithmetic at different steps of calculations really has. I would expect those changes to only make a difference of a few bytes here and there. The biggest change by far seems to be the 2GB -> 1280MB change to the cutoff for choosing which of two values possible values to choose for our ill-defined inverse function. If this is correct then it would be good to change the PR description to reflect this.

droberts195 · 2021-06-29T16:00:10Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/utils/NativeMemoryCalculator.java

@@ -139,8 +129,8 @@ private static long allowedBytesForMl(long machineMemory, Long jvmSize, int maxM
            // 2GB node -> 66%
            // 16GB node -> 87%
            // 64GB node -> 90%
-            long memoryPercent = Math.min(90, (int)Math.ceil(((machineMemory - jvmSize - OS_OVERHEAD) / (double)machineMemory) * 100.0D));
-            return (long)(machineMemory * (memoryPercent / 100.0));
+            double memoryPercent = Math.min(0.90, (machineMemory - jvmSize - OS_OVERHEAD) / (double)machineMemory);


Suggested change

double memoryPercent = Math.min(0.90, (machineMemory - jvmSize - OS_OVERHEAD) / (double)machineMemory);

double memoryProportion = Math.min(0.90, (machineMemory - jvmSize - OS_OVERHEAD) / (double)machineMemory);

droberts195 · 2021-06-29T16:03:08Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/utils/NativeMemoryCalculator.java

-        if (nodeSize < ByteSizeValue.ofGb(2).getBytes()) {
-            return (long) (nodeSize * 0.40);
+    public static long dynamicallyCalculateJvmSizeFromNodeSize(long nodeSize) {
+        if (nodeSize < ByteSizeValue.ofMb(1280).getBytes()) {


It would be good to add a comment here that the reason this is different to the value used in the JVM heap size formula is that that formula does not have a well-defined inverse, so we pick a value here that makes the inverse pick the correct value for some node sizes we know Cloud uses, in particular 2GB.

droberts195 · 2021-06-29T16:10:55Z

...ck/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/utils/NativeMemoryCalculatorTests.java

+            Tuple.tuple(1073741824L, 432013312L), // 1GB and true JVM size
+            Tuple.tuple(2147483648L, 536870912L), // 2GB ...
+            Tuple.tuple(4294967296L, 1073741824L), // 4GB ...
+            Tuple.tuple(8589934592L, 2147483648L), // 8GB ...
+            Tuple.tuple(17179869184L, 2147483648L), // 16GB ...
+            Tuple.tuple(34359738368L, 2147483648L), // 32GB ...
+            Tuple.tuple(68719476736L, 2147483648L) // 64GB ...


Does this test still pass if you add in some values that are close to the round multiples, for example 0.9GB, 1.1GB, 1.9GB and 2.1GB? If it doesn't then we need to check with the Cloud team how exact their node sizes are across different Cloud providers.

Also, 58GB is a node size that gets used with one of the IaaS providers I believe.

@droberts195 I will add those conditions!

For these, I went through every node size in GCP and looked at the native size and JVM size. I can work on doing that with some other providers

droberts195 · 2021-06-29T16:14:46Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

@@ -586,6 +596,19 @@ public AutoscalingDeciderResult scale(Settings configuration, AutoscalingDecider
                    .build()));
    }

+    static AutoscalingCapacity ensureScaleDown(AutoscalingCapacity scaleDownResult, AutoscalingCapacity currentCapacity) {


The fact that this method is needed suggests there are bugs in the other calculations. It's good to have it as a safety net, but I think it would be good if it could log a warning if it changes the answer by more than the few bytes that could be attributed to rounding errors.

but I think it would be good if it could log a warning if it changes the answer by more than the few bytes that could be attributed to rounding errors.

100%, I will add a log line.

I do think we need this method as this has been an issue in a couple known scenarios and it can be exceptionally frustrating to troubleshoot.

mark-vieira · 2021-06-29T20:42:00Z

@elasticmachine retest this please

droberts195

I still have one question, which is about how sensitive we are to a 2GB container reporting exactly 2GB available within the container.

If it turned out that a container provisioned as a 2GB container actually only has, say, 1.999GB memory within the container then we have a problem with the discontinuity in our JVM heap size function occurring at the 2GB mark.

If our code no longer goes haywire when the machine memory is close to, but not exactly, 1GB or 2GB, then I am happy to merge this now.

But if there's still sensitivity just above or below 1GB or 2GB then we should make extra tweaks to move the sensitivity into an area that Cloud isn't near at the moment.

droberts195 · 2021-06-30T11:06:54Z

...ck/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/utils/NativeMemoryCalculatorTests.java

+            Tuple.tuple(1073741824L, 432013312L), // 1GB and true JVM size
+            Tuple.tuple(2147483648L, 536870912L), // 2GB ...


Does this test fail for any of 0.9GB, 1.1GB, 1.9GB and 2.1GB? If it does then we may have a problem that will hit us in production.

@droberts195 please see lines: https://github.com/elastic/elasticsearch/pull/74691/files#diff-633121fb7250eeda7eb60908a4d89e0d0007a60c2e01fe452c6e962933208822R72

I add and subtract 10MB from every node value and make sure things still pass.

Additionally, we never allow things to be below 1GB. When autoscaling, that is the smallest size we will ever ask for.

@droberts195 I will additionally add a clause where the trueJvm isn't provided for either the native bytes size nor the inverse check. This way we use the dynamic jvm calculation.

Everything still checksout fine.

please see lines: https://github.com/elastic/elasticsearch/pull/74691/files#diff-633121fb7250eeda7eb60908a4d89e0d0007a60c2e01fe452c6e962933208822R72

Sorry, I didn't look closely enough

Additionally, we never allow things to be below 1GB. When autoscaling, that is the smallest size we will ever ask for.

The situation I am thinking of is where Docker reports a container as 1GB but the programs running inside it think the machine they're running on has 0.9999GB. But it sounds like this is covered.

This PR LGRM to merge then 🚀

…toscaling-node-size-calculation

benwtrent · 2021-06-30T12:46:59Z

@elasticmachine update branch

…lation

…thub.com:benwtrent/elasticsearch into feature/ml-fix-autoscaling-node-size-calculation

benwtrent · 2021-06-30T14:43:18Z

run elasticsearch-ci/part-2

… and improve scaling size estimations (elastic#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: elastic#74709

…g down and improve scaling size estimations (#74691) (#74780) * [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: #74709

…ng down and improve scaling size estimations (#74691) (#74781) * [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: #74709

…ng down and improve scaling size estimations (#74691) (#74782) * [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: #74709

…handles that (#74822) context commit may be null. This should only really happen early in a cluster's life cycle or if a node was just recently brought online. Mainly because the current node sizes have not been discovered yet and cached. This change should really have been part of #74691

benwtrent added :ml Machine learning >bug labels Jun 29, 2021

elasticsearchmachine added the v8.0.0 label Jun 29, 2021

[ML] improve memory calculation estimation for scaling decisions

6e35a2b

benwtrent force-pushed the feature/ml-fix-autoscaling-node-size-calculation branch from 87e8c94 to 6e35a2b Compare June 29, 2021 15:07

benwtrent added the v7.14.0 label Jun 29, 2021

benwtrent marked this pull request as ready for review June 29, 2021 15:50

elasticmachine added the Team:ML Meta label for the ML team label Jun 29, 2021

benwtrent changed the title ~~[ML] improve memory calculation estimation for scaling decisions~~ [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations Jun 29, 2021

droberts195 reviewed Jun 29, 2021

View reviewed changes

benwtrent added 2 commits June 29, 2021 13:01

Fixing test

ed2a12b

addressing PR comments

ba0b815

droberts195 approved these changes Jun 30, 2021

View reviewed changes

benwtrent added 2 commits June 30, 2021 08:09

extending test running

a866c11

Merge remote-tracking branch 'upstream/master' into feature/ml-fix-au…

8cda42f

…toscaling-node-size-calculation

elasticmachine and others added 3 commits June 30, 2021 08:47

Merge branch 'master' into feature/ml-fix-autoscaling-node-size-calcu…

26f65f5

…lation

fixing tests

29daa11

Merge branch 'feature/ml-fix-autoscaling-node-size-calculation' of gi…

4af1e67

…thub.com:benwtrent/elasticsearch into feature/ml-fix-autoscaling-node-size-calculation

mark-vieira added v7.15.0 and removed v7.14.0 labels Jun 30, 2021

benwtrent merged commit f9991e6 into elastic:master Jun 30, 2021

benwtrent deleted the feature/ml-fix-autoscaling-node-size-calculation branch June 30, 2021 17:49

benwtrent mentioned this pull request Jun 30, 2021

[7.13] [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691) #74782

Merged

droberts195 mentioned this pull request Jul 1, 2021

[ML] autoscaling context current capacity could be null, this commit handles that #74822

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

	double memoryPercent = Math.min(0.90, (machineMemory - jvmSize - OS_OVERHEAD) / (double)machineMemory);
	double memoryProportion = Math.min(0.90, (machineMemory - jvmSize - OS_OVERHEAD) / (double)machineMemory);

		Tuple.tuple(1073741824L, 432013312L), // 1GB and true JVM size
		Tuple.tuple(2147483648L, 536870912L), // 2GB ...

[ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations #74691

[ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations #74691

Uh oh!

Conversation

benwtrent commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jun 29, 2021

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mark-vieira commented Jun 29, 2021

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Jun 30, 2021

Uh oh!

benwtrent commented Jun 30, 2021

Uh oh!

Uh oh!

benwtrent commented Jun 29, 2021 •

edited

Loading