-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Exclude specific transport actions from request size limit check #17951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude specific transport actions from request size limit check #17951
Conversation
@martijnvg: Can you check please? |
Maybe add a test that making a bunch of requests that are excluded doesn't trigger it the same way you test that aren't does trigger it? Maybe also be nice to have a case where you push it to the limit and trigger the failure and then fire in a bunch more of the requests that don't count and make sure they don't trip it. |
I agree with @nik9000, this really needs tests. |
@danielmitterdorfer this change does look good, but like @nik9000 and @jasontedor say it would be good if we can add tests. Maybe what would be a simple tests is if we set the limit size to 0, form a cluster and start a transport client, then send requests via the transport client (which will fail with breaker exceptions, but that is expected) |
I can add a test that excluded requests don't trigger the circuit breaker but I think the second test that @nik9000 suggested (push to the limit, fire more requests that shouldn't trigger the breaker) would require explicit control over when request scheduling (i.e. a deterministic order when requests finish) otherwise I expect the test to be quite unstable as the limit is not on individual request size but on the sum of all in flight requests. I like the approach suggested by @martijnvg as it will uncover other actions that should also be excluded from the limit check. Thanks for the idea. |
Is it enough to wedge the request in a running state? If so you can use TestTaskPlugin to get in integration test for it. And I think an integration test for this kind of thing is important. But the whole 0 size breaker thing should work too. Because this has tripped in tests I wonder if we're too close to the edge? It looks like the limit is 100% of the heap though. Am I reading that right? |
@@ -113,7 +113,7 @@ protected TransportReplicationAction(Settings settings, String actionName, Trans | |||
transportService.registerRequestHandler(actionName, request, ThreadPool.Names.SAME, new OperationTransportHandler()); | |||
transportService.registerRequestHandler(transportPrimaryAction, request, executor, new PrimaryOperationTransportHandler()); | |||
// we must never reject on because of thread pool capacity on replicas | |||
transportService.registerRequestHandler(transportReplicaAction, replicaRequest, executor, true, | |||
transportService.registerRequestHandler(transportReplicaAction, replicaRequest, executor, true, true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielmitterdorfer One thing that we do with TransportReplicationAction
is that we ensure that if a replication action is successful on a primary, it will not be subjected to queue length limits on replicas (you don't want request on replicas to be subjected to EsRejectedExecutionException
). That's what the first true
parameter corresponding to forceExecution
here is doing. But it seems like we are subjecting the replication requests to the size limits and I think that that might be wrong, or at least inconsistent with how we manage the queues for replication requests. Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasontedor This change just ensured that I don't change the previous behavior (i.e. leave limiting enabled). With your explanation it makes sense to exclude this action from request size limiting. I'll change that.
In addition to requiring tests, I left a question about the handling of replication requests. |
I think the test with a zero breaker limit is even better because it uncovers the actions that we need to exclude from request size limiting to keep a cluster running that does not receive client requests.
Yes, the default limit is |
Got it. So the tests we think are failing spuriously come from this?
Makes sense. I like it! |
In the past,
|
I pushed another commit which adds an integration test now as suggested by @martijnvg. Can you have a look @jasontedor / @nik9000? I also excluded all transport actions from request size limiting which were uncovered by the test and decided to overload the constructor of some transport base classes instead of changing all call sites as it seemed to be the lesser of the two evils to me. |
.build(); | ||
assertAcked(client().admin().cluster().prepareUpdateSettings().setTransientSettings(settings)); | ||
|
||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use expectThrows
here.
I don't like that there isn't a test for each of the actions that are being whitelisted from the size checks, especially |
final RequestHandlerRegistry reg = transportServiceAdapter.getRequestHandler(action); | ||
if (reg == null) { | ||
throw new ActionNotFoundTransportException(action); | ||
} | ||
long reservedBytes = 0; | ||
if (reg.isCheckSizeLimit()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I reading this right? We are not counting bytes from requests that are not subject to the limits in the sum of bytes of all outstanding requests? Is that we want? Or do we want to include the bytes but just not subject the request to the threshold check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Where would you include bytes that come from requests that are excluded from the limit check and what would you do with them then? The circuit breaker API provides a method addWithoutBreaking(long)
but this still counts all added bytes in the same internal data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without digging into it it sounds like addWithoutBreaking
is what we want for those requests? They still take up space so why not count them? I get that this could push us past the limit but that seems, if not a good thing, at least appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is especially important in the case of TransportReplicationAction
where the request sizes can get big. So we definitely want to count those bytes, but still exclude these requests from the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nik9000 Yes, you're right. That was just a misunderstanding. It makes total sense to use addWithoutBreaking
here as certain actions can also lead to a pretty large request size (like the TransportReplicationAction
that Jason has mentioned).
@jasontedor: I've added a proof of concept test for TRA ( |
emptyMap(), emptySet(), Version.CURRENT); | ||
|
||
transportA.connectToNode(nodeB); | ||
clusterService = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. Corrected.
@jasontedor Thanks for the review. I'll add then other tests in a similar fashion. |
After discussion with @bleskes and @jasontedor we reduce the list of excluded requests to the bare minimum:
All other requests will trip the in-flight request circuit breaker. The two mentioned requests are excluded because they are (a) very lightweight (in terms of memory usage) and (b) if pings fail, we probably move a lot of data around. As the intention of in-flight request size limiting on transport level is to prevent too much GC activity we have decided to be more conservative and include all but the ping requests. |
@jasontedor I have pushed another commit that just excludes pings from the limit check and added tests for that. As TRA is now treated as any other action, I've also removed |
import java.util.concurrent.atomic.AtomicBoolean; | ||
import java.util.concurrent.atomic.AtomicInteger; | ||
|
||
class Action extends TransportReplicationAction<Action.Request, Action.Request, Action.Response> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this refactoring is needed anymore. Can you revert it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@jasontedor Both nitpicks are gone now. :) |
@@ -77,6 +80,10 @@ public boolean isForceExecution() { | |||
return forceExecution; | |||
} | |||
|
|||
public boolean isCheckSizeLimit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the name of this (and the corresponding field)? Maybe canTripCircuitBreaker
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, what I like about the current name is that it captures the intention (skip the in-flight request limit check) but is not implementation specific (i.e. trip the circuit breaker). However, you seem to dislike it. What about a more specific name: checkInflightRequestLimit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name isCheckSizeLimit
is not clear what it's for, but it's really used only to check whether or not the request being handled can cause the circuit breaker to trip so it should be implementation specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the old name is not ideal but I think checkInflightRequestLimit
is suitable here. I still think this name should not need to change in case we'd implement the check differently (not that we want to do this, it's just a thought experiment to make clear that canTripCircuitBreaker
is implementation specific. Even in that case I'd go so far as to call it canTripInFlightRequestCircuitBreaker
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in that case I'd go so far as to call it
canTripInFlightRequestCircuitBreaker
I'm fine with this, but do prefer my less-wordy suggestion. 😇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's call it a tie. ;) I've changed it.
@danielmitterdorfer Two more comments, and then we are going to be good. |
@jasontedor I addressed your comments now. |
assertThat(pingProbeB.completedPings(), greaterThanOrEqualTo(minExpectedPings)); | ||
} | ||
|
||
private static class PingProbe extends MockTransportService.Tracer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, just what I had in mind. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was just not aware of the tracing stuff. I also was not too keen on waiting in the test. But now I know about it. :)
LGTM. |
Thanks for your review @jasontedor |
We add support to explicitly exclude specific transport actions from the request size limit check. We also exclude the following request types currently: *MasterPingRequest * PingRequest
eb3c4e6
to
ddbfda2
Compare
We don't want transport clients to disconnect due to circuit breakers preventing the liveness request from executing. Relates to elastic#17951
We don't want transport clients to disconnect due to circuit breakers preventing the liveness request from executing. Relates to #17951
With this commit we exclude certain HTTP requests that are needed to inspect the cluster from HTTP request limiting to ensure these commands are processed even in critical memory conditions. Relates elastic#17951, relates elastic#18145
We add support to explicitly exclude specific transport actions from the request size limit check which are needed to keep cluster-internal operations working:
We exclude the following request types: