OWLS-86407 - Fixed the logic to start clustered managed server pods when admin server pod not running #2093

ankedia · 2020-12-09T16:56:38Z

Fixed the logic to start clustered managed server pods when admin server pod is not running. The previous logic was assuming that admin server pod is scheduled and ready when starting managed server pods. Added a unit test for the new behavior.

Integration test results at - https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/3305

…admin server pod is not running

russgold · 2020-12-09T23:10:31Z

operator/src/test/java/oracle/kubernetes/operator/steps/ManagedServerUpIteratorStepTest.java

@@ -249,6 +248,19 @@ public void whenConcurrencyLimitIs1_secondClusteredServerDoesNotStartIfFirstIsNo
    assertThat(getStartedManagedServers(), hasSize(1));
  }



Please get in the habit of writing your unit tests before you make production code changes. It not only guides what you then implement, it also leads to writing better tests.

Start with the name. It should explain what is going on, and this doesn't, as it is simply not the case that we always want to start a managed server pod when there is no admin server. The case, based on the JIRA, is that if we scale up a cluster, we will start a managed server even if the admin server is not running, and the test should reflect that. So maybe something like 'whileAdminServerStopped_canStartManagedServer' would be better.

In the body of the test the first call is to createDomainPresenceInfoWithServers(). It is really hard to tell what that is doing here. In the current test class, it was called exactly once: passing the name of the admin server, to indicate that was already running. Remember Robert Martin's admonition on names, I would probably create a new method that does what this call does, but is named createDomainPresenceInfoWithNoAdminServer. Perhaps we need, then, another call named createDomainPresenceInfoWithAdminServer to make it clearer. Since there are no invocations that actually pass in more servers, that should suffice.

Changed the unit test method name and added new method createDomainPresenceInfoWithNoAdminServer as suggested.

russgold · 2020-12-10T00:20:28Z

operator/src/test/java/oracle/kubernetes/operator/helpers/DomainPresenceInfoTest.java

-    assertThat(info.getNumScheduledServers("cluster1"), equalTo(2L));
-    assertThat(info.getNumScheduledServers("cluster2"), equalTo(2L));
-    assertThat(info.getNumScheduledServers(null), equalTo(1L));
+    assertThat(info.getNumScheduledManagedServers("cluster1", "admin-server"), equalTo(2L));


This is really unclear. The old call seemed to indicate that we were getting the number of managed servers in the specified cluster. Why do we also pass in "admin-server"? The interface is problematic.

Previous call used to get both admin server (if scheduled) and managed servers that are scheduled. I was passing in the admin-server so that admin server can be filtered out from list and only scheduled managed servers are returned. My latest change adds a new method so original unit test stays the same.

So the problem is that DomainPresenceInfo doesn't know which server is the admin server. I think we should address that, but not right now, perhaps, as I don't see an easy and obvious solution.

russgold · 2020-12-10T00:29:00Z

operator/src/main/java/oracle/kubernetes/operator/steps/ManagedServerUpIteratorStep.java

    }

    private boolean canStartConcurrently(long numReady) {
-      return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady - 1)))
+      return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady)))


This would be a good time to do some refactoring. The expression is not so easy to read and understand.

I would start by extracting maxConcurrency == 0 into a method named 'ignoreConcurrencyLimits' and list that first in the expression, not last. Then you don't really need this.maxConcurrency > 0 as that is implied (unless there is some support for negative values).

The remaining part is still too complex and not obvious. So extract numStarted.get() into its own method, getNumServersStarted as there is no reason for all of the callers to know that there is an Atomic backing the value. Then can you explain what the rest of the logic is?

Made the refactoring changes. The remaining logic is to check if the number of started servers is less than (maxConcurrency + ready managed servers for cluster "clusterName"). So if maxConcurrency is 1 and one server is started, it'll wait for that server to come to ready state before starting a new server. Please see #1855 (comment) for the pseudo code of this logic.

I looked at the comment. The pseudo code doesn't explain the logic, either.

It appears to me that the intent of numStarted < maxConcurrency + numReady is to look at the number of pods currently starting up (I think K8S calls that waiting) and compare that how many may be started at once. If so, the actual comparison should be:

(numStarted - numReady) < maxConcurrency or

numWaiting < maxConcurrency.

If so, I would extract a numPodsWaiting method:

// returns the number of pods currently starting up private int numWaiting(int numReady) { return getNumStarted() - numReady; }

and compare that to the maxConcurrency.

Yes, that's correct. I think k8s calls pods that are currently starting up as not ready or inactive. We can go with waiting if that clarifies the intention of code. I'll go ahead and push above change.

"not ready" or "inactive" work, too, if that is the terminology. Let's just be consistent and use whatever the standard is.

russgold · 2020-12-10T00:29:37Z

operator/src/main/java/oracle/kubernetes/operator/steps/ManagedServerUpIteratorStep.java

@@ -168,21 +171,23 @@ public NextAction apply(Packet packet) {

      if (startDetailsQueue.isEmpty()) {
        return doNext(new ManagedServerUpAfterStep(getNext()), packet);
-      } else if (hasServerAvailableToStart(packet.getSpi(DomainPresenceInfo.class))) {
+      } else if (hasServerAvailableToStart(packet)) {


good change. There was no need to expose the implementation as it had been.

anpanigr

I verified the following integration test usecases in domain-on-pv model

Start a domain with with two managed servers in a dynamic cluster
Stop the admin serer
Stop managed serer2
Start the manged server2

I see manged server2 is coming up in the absence of Admin server.

russgold · 2020-12-10T17:00:57Z

operator/src/main/java/oracle/kubernetes/operator/steps/ManagedServerUpIteratorStep.java

-      return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady)))
-          || (this.maxConcurrency == 0);
+      return (ignoreConcurrencyLimits() ||
+              (this.maxConcurrency > 0) && (getNumServersStarted() < (this.maxConcurrency + numReady)));


You should no longer need (this.maxConcurrency > 0) here, as you cannot reach this condition when it is 0.

Can you explain the logic in the rest of the expression? Why can you start a new server when the number of servers started is less than the concurrency limit added to the number ready?

ok, I'll make the change. As per OWLS-83431, we want to start one clustered server at a time and wait for it be scheduled before starting a new server in same cluster. So number of started servers in a cluster should be less than or equal to number of scheduled servers in that cluster. Initially when there are no servers scheduled, it'll start one server and wait for it be scheduled before starting second server. The scheduling happens very fast (few milli-seconds). Please see #1855 (comment) for more details.

russgold · 2020-12-11T15:20:07Z

operator/src/main/java/oracle/kubernetes/operator/steps/ManagedServerUpIteratorStep.java

    }

    private boolean canStartConcurrently(long numReady) {
-      return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady - 1)))
-          || (this.maxConcurrency == 0);
+      return (ignoreConcurrencyLimits() || numPodsNotReady(numReady) < this.maxConcurrency);


Shouldn't the parentheses be around the expression on the right, not the entire condition? Does Intellij not highlight it for you?

Huh. Apparently not.

russgold

LGTM

ankedia added 3 commits December 8, 2020 17:43

OWLS-86407 - Fixed logic to start clustered managed server pods when …

a1f8aa8

…admin server pod is not running

minor change

848221c

Add a unit test case for OWLS-86407

b1214a3

ankedia requested review from russgold, alai8 and rjeberhard December 9, 2020 16:56

alai8 approved these changes Dec 9, 2020

View reviewed changes

russgold suggested changes Dec 10, 2020

View reviewed changes

anpanigr approved these changes Dec 10, 2020

View reviewed changes

ankedia added 2 commits December 10, 2020 16:16

changes to address PR review comments.

d6821af

PR review comments.

cffde40

russgold reviewed Dec 10, 2020

View reviewed changes

ankedia added 3 commits December 10, 2020 19:00

Remove redundant condition.

94a5967

Refactoring change as per the review comment.

6cc0719

Chnaged method name to be consistent.

482dc9a

russgold reviewed Dec 11, 2020

View reviewed changes

russgold approved these changes Dec 11, 2020

View reviewed changes

rjeberhard merged commit 9005e90 into develop Dec 11, 2020

ankedia deleted the owls_86407 branch December 14, 2020 13:41

ankedia restored the owls_86407 branch February 5, 2021 22:18

ankedia deleted the owls_86407 branch March 3, 2021 21:50

		@@ -249,6 +248,19 @@ public void whenConcurrencyLimitIs1_secondClusteredServerDoesNotStartIfFirstIsNo
		assertThat(getStartedManagedServers(), hasSize(1));
		}

OWLS-86407 - Fixed the logic to start clustered managed server pods when admin server pod not running #2093

OWLS-86407 - Fixed the logic to start clustered managed server pods when admin server pod not running #2093

Uh oh!

Conversation

ankedia commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anpanigr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

russgold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ankedia commented Dec 9, 2020 •

edited

Loading