-
Notifications
You must be signed in to change notification settings - Fork 217
OWLS-86407 - Fixed the logic to start clustered managed server pods when admin server pod not running #2093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…admin server pod is not running
@@ -249,6 +248,19 @@ public void whenConcurrencyLimitIs1_secondClusteredServerDoesNotStartIfFirstIsNo | |||
assertThat(getStartedManagedServers(), hasSize(1)); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please get in the habit of writing your unit tests before you make production code changes. It not only guides what you then implement, it also leads to writing better tests.
Start with the name. It should explain what is going on, and this doesn't, as it is simply not the case that we always want to start a managed server pod when there is no admin server. The case, based on the JIRA, is that if we scale up a cluster, we will start a managed server even if the admin server is not running, and the test should reflect that. So maybe something like 'whileAdminServerStopped_canStartManagedServer' would be better.
In the body of the test the first call is to createDomainPresenceInfoWithServers()
. It is really hard to tell what that is doing here. In the current test class, it was called exactly once: passing the name of the admin server, to indicate that was already running. Remember Robert Martin's admonition on names, I would probably create a new method that does what this call does, but is named createDomainPresenceInfoWithNoAdminServer
. Perhaps we need, then, another call named createDomainPresenceInfoWithAdminServer
to make it clearer. Since there are no invocations that actually pass in more servers, that should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the unit test method name and added new method createDomainPresenceInfoWithNoAdminServer
as suggested.
assertThat(info.getNumScheduledServers("cluster1"), equalTo(2L)); | ||
assertThat(info.getNumScheduledServers("cluster2"), equalTo(2L)); | ||
assertThat(info.getNumScheduledServers(null), equalTo(1L)); | ||
assertThat(info.getNumScheduledManagedServers("cluster1", "admin-server"), equalTo(2L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really unclear. The old call seemed to indicate that we were getting the number of managed servers in the specified cluster. Why do we also pass in "admin-server"? The interface is problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous call used to get both admin server (if scheduled) and managed servers that are scheduled. I was passing in the admin-server so that admin server can be filtered out from list and only scheduled managed servers are returned. My latest change adds a new method so original unit test stays the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is that DomainPresenceInfo doesn't know which server is the admin server. I think we should address that, but not right now, perhaps, as I don't see an easy and obvious solution.
} | ||
|
||
private boolean canStartConcurrently(long numReady) { | ||
return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady - 1))) | ||
return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a good time to do some refactoring. The expression is not so easy to read and understand.
I would start by extracting maxConcurrency == 0 into a method named 'ignoreConcurrencyLimits' and list that first in the expression, not last. Then you don't really need this.maxConcurrency > 0
as that is implied (unless there is some support for negative values).
The remaining part is still too complex and not obvious. So extract numStarted.get()
into its own method, getNumServersStarted
as there is no reason for all of the callers to know that there is an Atomic backing the value. Then can you explain what the rest of the logic is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the refactoring changes. The remaining logic is to check if the number of started servers is less than (maxConcurrency + ready managed servers for cluster "clusterName"). So if maxConcurrency is 1
and one server is started, it'll wait for that server to come to ready state before starting a new server. Please see #1855 (comment) for the pseudo code of this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the comment. The pseudo code doesn't explain the logic, either.
It appears to me that the intent of numStarted < maxConcurrency + numReady
is to look at the number of pods currently starting up (I think K8S calls that waiting
) and compare that how many may be started at once. If so, the actual comparison should be:
(numStarted - numReady) < maxConcurrency
or
numWaiting < maxConcurrency
.
If so, I would extract a numPodsWaiting
method:
// returns the number of pods currently starting up
private int numWaiting(int numReady) {
return getNumStarted() - numReady;
}
and compare that to the maxConcurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. I think k8s calls pods that are currently starting up as not ready
or inactive
. We can go with waiting
if that clarifies the intention of code. I'll go ahead and push above change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not ready" or "inactive" work, too, if that is the terminology. Let's just be consistent and use whatever the standard is.
@@ -168,21 +171,23 @@ public NextAction apply(Packet packet) { | |||
|
|||
if (startDetailsQueue.isEmpty()) { | |||
return doNext(new ManagedServerUpAfterStep(getNext()), packet); | |||
} else if (hasServerAvailableToStart(packet.getSpi(DomainPresenceInfo.class))) { | |||
} else if (hasServerAvailableToStart(packet)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good change. There was no need to expose the implementation as it had been.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified the following integration test usecases in domain-on-pv model
Start a domain with with two managed servers in a dynamic cluster
Stop the admin serer
Stop managed serer2
Start the manged server2
I see manged server2 is coming up in the absence of Admin server.
return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady))) | ||
|| (this.maxConcurrency == 0); | ||
return (ignoreConcurrencyLimits() || | ||
(this.maxConcurrency > 0) && (getNumServersStarted() < (this.maxConcurrency + numReady))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should no longer need (this.maxConcurrency > 0)
here, as you cannot reach this condition when it is 0.
Can you explain the logic in the rest of the expression? Why can you start a new server when the number of servers started is less than the concurrency limit added to the number ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll make the change. As per OWLS-83431, we want to start one clustered server at a time and wait for it be scheduled before starting a new server in same cluster. So number of started servers in a cluster should be less than or equal to number of scheduled servers in that cluster. Initially when there are no servers scheduled, it'll start one server and wait for it be scheduled before starting second server. The scheduling happens very fast (few milli-seconds). Please see #1855 (comment) for more details.
} | ||
|
||
private boolean canStartConcurrently(long numReady) { | ||
return ((this.maxConcurrency > 0) && (numStarted.get() < (this.maxConcurrency + numReady - 1))) | ||
|| (this.maxConcurrency == 0); | ||
return (ignoreConcurrencyLimits() || numPodsNotReady(numReady) < this.maxConcurrency); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the parentheses be around the expression on the right, not the entire condition? Does Intellij not highlight it for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. Apparently not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixed the logic to start clustered managed server pods when admin server pod is not running. The previous logic was assuming that admin server pod is scheduled and ready when starting managed server pods. Added a unit test for the new behavior.
Integration test results at - https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/3305