Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarify behavior of hedge_on_per_try_timeout #12983

Merged
merged 11 commits into from
Nov 30, 2020

Conversation

ikonst
Copy link
Contributor

@ikonst ikonst commented Sep 4, 2020

No description provided.

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy[\w/]*/(v1alpha\d?|v1|v2alpha\d?|v2))|(api/envoy/type/(matcher/)?\w+.proto).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #12983 was opened by ikonst.

see: more, trace.

@ikonst
Copy link
Contributor Author

ikonst commented Sep 4, 2020

@snowp I had a hard time understanding the original phrasing so I took a stab at rewriting it, but I'm not really sure I'm describing how it actually works.

@snowp
Copy link
Contributor

snowp commented Sep 4, 2020

Thanks for improving the docs!

@mpuncel You wanna give this a look before I do?

@ikonst Can you fix DCO? There are steps to do so in CONTRIBUTING.md

@ikonst ikonst force-pushed the patch-1 branch 2 times, most recently from c27e3bc to 1582210 Compare September 4, 2020 15:13
@stale
Copy link

stale bot commented Sep 11, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 11, 2020
@ikonst
Copy link
Contributor Author

ikonst commented Sep 13, 2020

bump @mpuncel

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 13, 2020
Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the docs!

I think you'll also want to update the V3 docs, V2 is on its way out. The V4alpha docs should then be automatically updated once you run the proto_format.sh script.

api/envoy/api/v2/route/route_components.proto Outdated Show resolved Hide resolved
api/envoy/api/v2/route/route_components.proto Outdated Show resolved Hide resolved
// response headers would otherwise be retried according the specified
// :ref:`RetryPolicy <envoy_api_msg_route.RetryPolicy>`.
// Indicates that a hedged request should be sent when the per-try timeout is hit.
// This will only occur if the :ref:`RetryPolicy <envoy_api_msg_route.RetryPolicy>` also indicates that
Copy link
Contributor

@mpuncel mpuncel Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this part is right, retry policy doesn't matter when per try timeout is hit, only whether hedge_on_per_try_timeout is set

Copy link
Contributor

@mpuncel mpuncel Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference comment:

RetryStatus RetryStateImpl::shouldHedgeRetryPerTryTimeout(DoRetryCallback callback) {
  // A hedged retry on per try timeout is always retried if there are retries
  // left. NOTE: this is a bit different than non-hedged per try timeouts which
  // are only retried if the applicable retry policy specifies either
  // RETRY_ON_5XX or RETRY_ON_GATEWAY_ERROR. This is because these types of
  // retries are associated with a stream reset which is analogous to a gateway
  // error. When hedging on per try timeout is enabled, however, there is no
  // stream reset.
  return shouldRetry(true, callback);
}

(the comment wording there is also confusing!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I was actually working off my experiencing using this feature: I only started seeing (hedged) retries once I set x-envoy-retry-on.

Copy link
Contributor Author

@ikonst ikonst Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this code, it seems that retries_remaining_ would only get initialized to a non-zero value if retry_on_ is set:

if (retry_on_ != 0 && request_headers.EnvoyMaxRetries()) {
uint64_t temp;
if (absl::SimpleAtoi(request_headers.getEnvoyMaxRetriesValue(), &temp)) {
// The max retries header takes precedence if set.
retries_remaining_ = temp;
}
}

Thus shouldRetry would end up returning RetryStatus::NoRetryLimitExceeded for the hedged requests.

I can imagine this wasn't intended, since it surprised me too, but on the other hand it's congruent with how non-hedged retrying behaves.

// This will only occur if the :ref:`RetryPolicy <envoy_api_msg_route.RetryPolicy>` also indicates that
// timed out requests should be retried (e.g. retry_on set to 'gateway-error' etc). (Other retry policies
// would also apply, but would only have effect if the response came back before the request was hedged against;
// otherwise such responses would simply be discarded as a retry is already in flight.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be clearer stated as

"Any response received after the timeout and subsequent hedge attempt will never be retried, no matter the RetryPolicy"

Copy link
Contributor Author

@ikonst ikonst Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming timeout 150ms, per-try timeout of 50ms, 3 retries and retry-on: 5xx policy, and hedging enabled:

0ms: Request 1 sent.
50ms: Request 1 times out, (hedged) request 2 sent.
75ms: Request 2 (hedged) returns 500.
150ms: Request 1 times out.

Would there be a 3rd request?

(for simplicity assuming no exponential backoff in those timings)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there would be a 3rd request. Request 2 is considered a new attempt, so it will be retried if it times out or returns a 500. If request 1 comes back with a 500 after request 2 has already been sent, that will be dropped and not retried

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I read (Any response received after the timeout) AND (subsequent hedge attempt) -> will never be retried 🤦

// * After per-try timeout, an error response would be discard, as a retry in the form of a hedged request is already in progress.
//
// Note: For this to have effect, the :ref:`RetryPolicy <envoy_api_msg_route.RetryPolicy>` must be one that retries on timeout
// (e.g. `gateway-error`).
Copy link
Contributor

@mpuncel mpuncel Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is true, you don't need to have gateway-error to have it be retried. A per try timeout is always retried when hedging is enabled

Copy link
Contributor Author

@ikonst ikonst Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpuncel

I tested it and found it to be true. (In fact, wasted quite some time trying to understand why hedging didn't work for me before I tried adding a simple x-envoy-retry-on to my calls.).

I actually wrote this above in reply to a similar question you raised:

Looking at this code, it seems that retries_remaining_ would only get initialized to a non-zero value if retry_on_ is set:

if (retry_on_ != 0 && request_headers.EnvoyMaxRetries()) {
uint64_t temp;
if (absl::SimpleAtoi(request_headers.getEnvoyMaxRetriesValue(), &temp)) {
// The max retries header takes precedence if set.
retries_remaining_ = temp;
}
}

Thus shouldRetry would end up returning RetryStatus::NoRetryLimitExceeded for the hedged requests.

I can imagine this wasn't intended, since it surprised me too, but on the other hand it's congruent with how non-hedged retrying behaves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting! I do think that isn't intentional. In that case I'm not really sure how to word the comment, maybe you could say "you must have a RetryPolicy that retries at least one error code and specify the max number of retries". You don't have to have gateway-error specifically though it looks like from that code snippet

Ilya Konstantinov and others added 6 commits September 24, 2020 22:48
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
@snowp
Copy link
Contributor

snowp commented Sep 29, 2020

@ikonst I think this change looks good now. Could you a) merge master b) fix the formatting issue and c) apply the same change to the V3 docs (then run tools/proto_format.sh fix to propagate to v4alpha)?

Also friendly reminder that force pushing breaks the reviewing flow for many, so avoid it if you can.

Thanks!

@ikonst
Copy link
Contributor Author

ikonst commented Sep 29, 2020

I'm not a fan of rebases either since they break the already-reviewed / new-commits separation. (Perhaps I did this one because I forgot some sign-offs mid-way?)

@snowp snowp added the waiting label Oct 8, 2020
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
@ikonst
Copy link
Contributor Author

ikonst commented Nov 26, 2020

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit didn't fail.

🐱

Caused by: a #12983 (comment) was created by @ikonst.

see: more, trace.

@ikonst
Copy link
Contributor Author

ikonst commented Nov 26, 2020

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit didn't fail.

🐱

Caused by: a #12983 (comment) was created by @ikonst.

see: more, trace.

Signed-off-by: Ilya Konstantinov <ilya.konstantinov@gmail.com>
@ikonst
Copy link
Contributor Author

ikonst commented Nov 26, 2020

@snowp ^

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@envoyproxy/api-shepherds for API review and V2 sign off

@htuch
Copy link
Member

htuch commented Nov 30, 2020

/lgtm v2-freeze
/lgtm api

@htuch htuch merged commit 035509e into envoyproxy:master Nov 30, 2020
@ikonst ikonst deleted the patch-1 branch November 30, 2020 17:00
mpuncel added a commit to mpuncel/envoy that referenced this pull request Dec 2, 2020
* master: (70 commits)
  upstream: avoid reset after end_stream in TCP HTTP upstream (envoyproxy#14106)
  bazelci: add fuzz coverage (envoyproxy#14179)
  dependencies: allowlist CVE-2020-8277 to prevent false positives. (envoyproxy#14228)
  cleanup: replace ad-hoc [0, 1] value types with UnitFloat (envoyproxy#14081)
  Update docs for skywalking tracer (envoyproxy#14210)
  Fix some errors in the switch statement when decode dubbo response (envoyproxy#14207)
  Windows: enable tests and envoy-static.exe pdb file (envoyproxy#13688)
  http: add Kill Request HTTP filter (envoyproxy#14170)
  dependencies: fix release_dates error behavior. (envoyproxy#14216)
  thrift filter: support skip decoding data after metadata in the thrift message (envoyproxy#13592)
  update cares (envoyproxy#14213)
  docs: clarify behavior of hedge_on_per_try_timeout (envoyproxy#12983)
  repokitteh: add support for randomized auto-assign. (envoyproxy#14185)
  [grpc] validate grpc config for illegal characters (envoyproxy#14129)
  server: Return nullopt when process_context is nullptr (envoyproxy#14181)
  [Windows] Fix thrift proxy tests (envoyproxy#13220)
  kafka: add missing unit tests (envoyproxy#14195)
  doc: mention gperftools explicitly in PPROF.md (envoyproxy#14199)
  Removed `--use-fake-symbol-table` option. (envoyproxy#14178)
  filter contract: clarification around local replies (envoyproxy#14193)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants