Skip to content

OWLS-86407 - Fixed the logic to start clustered managed server pods when admin server pod not running #2093

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 11, 2020

Conversation

ankedia
Copy link
Member

@ankedia ankedia commented Dec 9, 2020

Fixed the logic to start clustered managed server pods when admin server pod is not running. The previous logic was assuming that admin server pod is scheduled and ready when starting managed server pods. Added a unit test for the new behavior.

Integration test results at - https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/3305

@@ -249,6 +248,19 @@ public void whenConcurrencyLimitIs1_secondClusteredServerDoesNotStartIfFirstIsNo
assertThat(getStartedManagedServers(), hasSize(1));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get in the habit of writing your unit tests before you make production code changes. It not only guides what you then implement, it also leads to writing better tests.

Start with the name. It should explain what is going on, and this doesn't, as it is simply not the case that we always want to start a managed server pod when there is no admin server. The case, based on the JIRA, is that if we scale up a cluster, we will start a managed server even if the admin server is not running, and the test should reflect that. So maybe something like 'whileAdminServerStopped_canStartManagedServer' would be better.

In the body of the test the first call is to createDomainPresenceInfoWithServers(). It is really hard to tell what that is doing here. In the current test class, it was called exactly once: passing the name of the admin server, to indicate that was already running. Remember Robert Martin's admonition on names, I would probably create a new method that does what this call does, but is named createDomainPresenceInfoWithNoAdminServer. Perhaps we need, then, another call named createDomainPresenceInfoWithAdminServer to make it clearer. Since there are no invocations that actually pass in more servers, that should suffice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the unit test method name and added new method createDomainPresenceInfoWithNoAdminServer as suggested.

assertThat(info.getNumScheduledServers("cluster1"), equalTo(2L));
assertThat(info.getNumScheduledServers("cluster2"), equalTo(2L));
assertThat(info.getNumScheduledServers(null), equalTo(1L));
assertThat(info.getNumScheduledManagedServers("cluster1", "admin-server"), equalTo(2L));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really unclear. The old call seemed to indicate that we were getting the number of managed servers in the specified cluster. Why do we also pass in "admin-server"? The interface is problematic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous call used to get both admin server (if scheduled) and managed servers that are scheduled. I was passing in the admin-server so that admin server can be filtered out from list and only scheduled managed servers are returned. My latest change adds a new method so original unit test stays the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem is that DomainPresenceInfo doesn't know which server is the admin server. I think we should address that, but not right now, perhaps, as I don't see an easy and obvious solution.

}

private boolean canStartConcurrently(long numReady) {
return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady - 1)))
return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good time to do some refactoring. The expression is not so easy to read and understand.

I would start by extracting maxConcurrency == 0 into a method named 'ignoreConcurrencyLimits' and list that first in the expression, not last. Then you don't really need this.maxConcurrency > 0 as that is implied (unless there is some support for negative values).

The remaining part is still too complex and not obvious. So extract numStarted.get() into its own method, getNumServersStarted as there is no reason for all of the callers to know that there is an Atomic backing the value. Then can you explain what the rest of the logic is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the refactoring changes. The remaining logic is to check if the number of started servers is less than (maxConcurrency + ready managed servers for cluster "clusterName"). So if maxConcurrency is 1 and one server is started, it'll wait for that server to come to ready state before starting a new server. Please see #1855 (comment) for the pseudo code of this logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the comment. The pseudo code doesn't explain the logic, either.

It appears to me that the intent of numStarted < maxConcurrency + numReady is to look at the number of pods currently starting up (I think K8S calls that waiting) and compare that how many may be started at once. If so, the actual comparison should be:

(numStarted - numReady) < maxConcurrency or

numWaiting < maxConcurrency.

If so, I would extract a numPodsWaiting method:

// returns the number of pods currently starting up
private int numWaiting(int numReady) {
  return getNumStarted() - numReady;
}

and compare that to the maxConcurrency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. I think k8s calls pods that are currently starting up as not ready or inactive. We can go with waiting if that clarifies the intention of code. I'll go ahead and push above change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not ready" or "inactive" work, too, if that is the terminology. Let's just be consistent and use whatever the standard is.

@@ -168,21 +171,23 @@ public NextAction apply(Packet packet) {

if (startDetailsQueue.isEmpty()) {
return doNext(new ManagedServerUpAfterStep(getNext()), packet);
} else if (hasServerAvailableToStart(packet.getSpi(DomainPresenceInfo.class))) {
} else if (hasServerAvailableToStart(packet)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good change. There was no need to expose the implementation as it had been.

Copy link
Member

@anpanigr anpanigr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the following integration test usecases in domain-on-pv model

Start a domain with with two managed servers in a dynamic cluster
Stop the admin serer
Stop managed serer2
Start the manged server2

I see manged server2 is coming up in the absence of Admin server.

return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady)))
|| (this.maxConcurrency == 0);
return (ignoreConcurrencyLimits() ||
(this.maxConcurrency > 0) && (getNumServersStarted() < (this.maxConcurrency + numReady)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should no longer need (this.maxConcurrency > 0) here, as you cannot reach this condition when it is 0.

Can you explain the logic in the rest of the expression? Why can you start a new server when the number of servers started is less than the concurrency limit added to the number ready?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll make the change. As per OWLS-83431, we want to start one clustered server at a time and wait for it be scheduled before starting a new server in same cluster. So number of started servers in a cluster should be less than or equal to number of scheduled servers in that cluster. Initially when there are no servers scheduled, it'll start one server and wait for it be scheduled before starting second server. The scheduling happens very fast (few milli-seconds). Please see #1855 (comment) for more details.

}

private boolean canStartConcurrently(long numReady) {
return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady - 1)))
|| (this.maxConcurrency == 0);
return (ignoreConcurrencyLimits() || numPodsNotReady(numReady) < this.maxConcurrency);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the parentheses be around the expression on the right, not the entire condition? Does Intellij not highlight it for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Apparently not.

Copy link
Member

@russgold russgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rjeberhard rjeberhard merged commit 9005e90 into develop Dec 11, 2020
@ankedia ankedia deleted the owls_86407 branch December 14, 2020 13:41
@ankedia ankedia restored the owls_86407 branch February 5, 2021 22:18
@ankedia ankedia deleted the owls_86407 branch March 3, 2021 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants