Skip to content

Exclude specific transport actions from request size limit check #17951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

danielmitterdorfer
Copy link
Member

@danielmitterdorfer danielmitterdorfer commented Apr 25, 2016

We add support to explicitly exclude specific transport actions from the request size limit check which are needed to keep cluster-internal operations working:

We exclude the following request types:

  • MasterPingRequest
  • PingRequest

@danielmitterdorfer
Copy link
Member Author

@martijnvg: Can you check please?

@nik9000
Copy link
Member

nik9000 commented Apr 25, 2016

Maybe add a test that making a bunch of requests that are excluded doesn't trigger it the same way you test that aren't does trigger it? Maybe also be nice to have a case where you push it to the limit and trigger the failure and then fire in a bunch more of the requests that don't count and make sure they don't trip it.

@jasontedor
Copy link
Member

I agree with @nik9000, this really needs tests.

@martijnvg
Copy link
Member

martijnvg commented Apr 25, 2016

@danielmitterdorfer this change does look good, but like @nik9000 and @jasontedor say it would be good if we can add tests. Maybe what would be a simple tests is if we set the limit size to 0, form a cluster and start a transport client, then send requests via the transport client (which will fail with breaker exceptions, but that is expected)

@danielmitterdorfer
Copy link
Member Author

I can add a test that excluded requests don't trigger the circuit breaker but I think the second test that @nik9000 suggested (push to the limit, fire more requests that shouldn't trigger the breaker) would require explicit control over when request scheduling (i.e. a deterministic order when requests finish) otherwise I expect the test to be quite unstable as the limit is not on individual request size but on the sum of all in flight requests.

I like the approach suggested by @martijnvg as it will uncover other actions that should also be excluded from the limit check. Thanks for the idea.

@nik9000
Copy link
Member

nik9000 commented Apr 25, 2016

explicit control over when request scheduling

Is it enough to wedge the request in a running state? If so you can use TestTaskPlugin to get in integration test for it. And I think an integration test for this kind of thing is important. But the whole 0 size breaker thing should work too.

Because this has tripped in tests I wonder if we're too close to the edge? It looks like the limit is 100% of the heap though. Am I reading that right?

@@ -113,7 +113,7 @@ protected TransportReplicationAction(Settings settings, String actionName, Trans
transportService.registerRequestHandler(actionName, request, ThreadPool.Names.SAME, new OperationTransportHandler());
transportService.registerRequestHandler(transportPrimaryAction, request, executor, new PrimaryOperationTransportHandler());
// we must never reject on because of thread pool capacity on replicas
transportService.registerRequestHandler(transportReplicaAction, replicaRequest, executor, true,
transportService.registerRequestHandler(transportReplicaAction, replicaRequest, executor, true, true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielmitterdorfer One thing that we do with TransportReplicationAction is that we ensure that if a replication action is successful on a primary, it will not be subjected to queue length limits on replicas (you don't want request on replicas to be subjected to EsRejectedExecutionException). That's what the first true parameter corresponding to forceExecution here is doing. But it seems like we are subjecting the replication requests to the size limits and I think that that might be wrong, or at least inconsistent with how we manage the queues for replication requests. Can you clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasontedor This change just ensured that I don't change the previous behavior (i.e. leave limiting enabled). With your explanation it makes sense to exclude this action from request size limiting. I'll change that.

@jasontedor
Copy link
Member

In addition to requiring tests, I left a question about the handling of replication requests.

@danielmitterdorfer
Copy link
Member Author

I think the test with a zero breaker limit is even better because it uncovers the actions that we need to exclude from request size limiting to keep a cluster running that does not receive client requests.

Because this has tripped in tests I wonder if we're too close to the edge? It looks like the limit is 100% of the heap though. Am I reading that right?

Yes, the default limit is min(100%, parent_breaker_limit) of the heap (as discussed in the original PR). In certain tests we explicitly set a low limit though (in NettyHttpRequestSizeLimitIT its currently 2kb). So we're hitting "just" this low limit not 100% of the heap.

@nik9000
Copy link
Member

nik9000 commented Apr 25, 2016

So we're hitting "just" this low limit not 100% of the heap.

Got it. So the tests we think are failing spuriously come from this?

keep a cluster running that does not receive client requests

Makes sense. I like it!

@danielmitterdorfer
Copy link
Member Author

So the tests we think are failing spuriously come from this?

In the past, NettyHttpRequestSizeLimitIT either failed for several reasons but never because we hit 100% of the heap. IIRC the reasons were:

  • We did not send enough requests so that there were enough of them in flight in certain scenarios (fixed directly in the test)
  • Internal transport requests hit a limit when they shouldn't be included in the size limiting in the first place (to be fixed with this PR)

@danielmitterdorfer
Copy link
Member Author

I pushed another commit which adds an integration test now as suggested by @martijnvg. Can you have a look @jasontedor / @nik9000?

I also excluded all transport actions from request size limiting which were uncovered by the test and decided to overload the constructor of some transport base classes instead of changing all call sites as it seemed to be the lesser of the two evils to me.

.build();
assertAcked(client().admin().cluster().prepareUpdateSettings().setTransientSettings(settings));

try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use expectThrows here.

@jasontedor
Copy link
Member

I don't like that there isn't a test for each of the actions that are being whitelisted from the size checks, especially TransportReplicationAction; this behavior is too important to not be testing.

final RequestHandlerRegistry reg = transportServiceAdapter.getRequestHandler(action);
if (reg == null) {
throw new ActionNotFoundTransportException(action);
}
long reservedBytes = 0;
if (reg.isCheckSizeLimit()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this right? We are not counting bytes from requests that are not subject to the limits in the sum of bytes of all outstanding requests? Is that we want? Or do we want to include the bytes but just not subject the request to the threshold check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Where would you include bytes that come from requests that are excluded from the limit check and what would you do with them then? The circuit breaker API provides a method addWithoutBreaking(long) but this still counts all added bytes in the same internal data structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without digging into it it sounds like addWithoutBreaking is what we want for those requests? They still take up space so why not count them? I get that this could push us past the limit but that seems, if not a good thing, at least appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is especially important in the case of TransportReplicationAction where the request sizes can get big. So we definitely want to count those bytes, but still exclude these requests from the checks.

Copy link
Member Author

@danielmitterdorfer danielmitterdorfer Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nik9000 Yes, you're right. That was just a misunderstanding. It makes total sense to use addWithoutBreaking here as certain actions can also lead to a pretty large request size (like the TransportReplicationAction that Jason has mentioned).

@danielmitterdorfer
Copy link
Member Author

@jasontedor: I've added a proof of concept test for TRA (TransportReplicationLimitTests). I am not really familiar with this area but I tried to isolate exactly this request and ensure that the request gets through without triggering the circuit breaker but others doesn't (as the limit is zero). I'd first iterate on this test and then add tests for the other actions.

emptyMap(), emptySet(), Version.CURRENT);

transportA.connectToNode(nodeB);
clusterService = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. Corrected.

@danielmitterdorfer
Copy link
Member Author

@jasontedor Thanks for the review. I'll add then other tests in a similar fashion.

@danielmitterdorfer
Copy link
Member Author

danielmitterdorfer commented May 9, 2016

After discussion with @bleskes and @jasontedor we reduce the list of excluded requests to the bare minimum:

  • MasterPingRequest
  • PingRequest

All other requests will trip the in-flight request circuit breaker. The two mentioned requests are excluded because they are (a) very lightweight (in terms of memory usage) and (b) if pings fail, we probably move a lot of data around. As the intention of in-flight request size limiting on transport level is to prevent too much GC activity we have decided to be more conservative and include all but the ping requests.

@danielmitterdorfer
Copy link
Member Author

@jasontedor I have pushed another commit that just excludes pings from the limit check and added tests for that. As TRA is now treated as any other action, I've also removed TransportReplicationLimitTests again.

import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicInteger;

class Action extends TransportReplicationAction<Action.Request, Action.Request, Action.Response> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this refactoring is needed anymore. Can you revert it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jasontedor jasontedor self-assigned this May 11, 2016
@danielmitterdorfer
Copy link
Member Author

@jasontedor Both nitpicks are gone now. :)

@@ -77,6 +80,10 @@ public boolean isForceExecution() {
return forceExecution;
}

public boolean isCheckSizeLimit() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the name of this (and the corresponding field)? Maybe canTripCircuitBreaker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, what I like about the current name is that it captures the intention (skip the in-flight request limit check) but is not implementation specific (i.e. trip the circuit breaker). However, you seem to dislike it. What about a more specific name: checkInflightRequestLimit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name isCheckSizeLimit is not clear what it's for, but it's really used only to check whether or not the request being handled can cause the circuit breaker to trip so it should be implementation specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the old name is not ideal but I think checkInflightRequestLimit is suitable here. I still think this name should not need to change in case we'd implement the check differently (not that we want to do this, it's just a thought experiment to make clear that canTripCircuitBreaker is implementation specific. Even in that case I'd go so far as to call it canTripInFlightRequestCircuitBreaker).

Copy link
Member

@jasontedor jasontedor May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in that case I'd go so far as to call it canTripInFlightRequestCircuitBreaker

I'm fine with this, but do prefer my less-wordy suggestion. 😇

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's call it a tie. ;) I've changed it.

@jasontedor
Copy link
Member

@danielmitterdorfer Two more comments, and then we are going to be good.

@danielmitterdorfer
Copy link
Member Author

@jasontedor I addressed your comments now.

assertThat(pingProbeB.completedPings(), greaterThanOrEqualTo(minExpectedPings));
}

private static class PingProbe extends MockTransportService.Tracer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, just what I had in mind. Thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was just not aware of the tracing stuff. I also was not too keen on waiting in the test. But now I know about it. :)

@jasontedor
Copy link
Member

LGTM.

@danielmitterdorfer
Copy link
Member Author

Thanks for your review @jasontedor

We add support to explicitly exclude specific transport actions
from the request size limit check.

We also exclude the following request types currently:

*MasterPingRequest
* PingRequest
@danielmitterdorfer danielmitterdorfer merged commit ddbfda2 into elastic:master May 13, 2016
s1monw added a commit to s1monw/elasticsearch that referenced this pull request May 28, 2016
We don't want transport clients to disconnect due to circuit breakers
preventing the liveness request from executing.

Relates to elastic#17951
s1monw added a commit that referenced this pull request May 31, 2016
We don't want transport clients to disconnect due to circuit breakers
preventing the liveness request from executing.

Relates to #17951
danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this pull request Jun 13, 2016
With this commit we exclude certain HTTP requests that are needed to inspect the cluster
from HTTP request limiting to ensure these commands are processed even in critical
memory conditions.

Relates elastic#17951, relates elastic#18145
danielmitterdorfer added a commit that referenced this pull request Jun 15, 2016
With this commit we exclude certain HTTP requests that are needed to inspect the cluster
from HTTP request limiting to ensure these commands are processed even in critical
memory conditions.

Relates #17951, relates #18145, closes #18833
danielmitterdorfer added a commit that referenced this pull request Jun 15, 2016
With this commit we exclude certain HTTP requests that are needed to inspect the cluster
from HTTP request limiting to ensure these commands are processed even in critical
memory conditions.

Relates #17951, relates #18145, closes #18833
danielmitterdorfer added a commit that referenced this pull request Jun 15, 2016
With this commit we exclude certain HTTP requests that are needed to inspect the cluster
from HTTP request limiting to ensure these commands are processed even in critical
memory conditions.

Relates #17951, relates #18145, closes #18833
@danielmitterdorfer danielmitterdorfer deleted the breaker-excludes branch June 15, 2016 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants