New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

RFC: Consistent timeout handling in Collector pipelines #11948

Open

jmacd wants to merge 9 commits into open-telemetry:main from jmacd:jmacd/consistent_timeout

Contributor

jmacd commented Dec 18, 2024

Description

Calls for deadline-awareness across common Collector pipeline components, including batch processors, queue sender, retry and
timeout senders.

Link to tracking issue

Part of #11183

Testing

n/a

Documentation

This is a new RFC. As these changes are accepted and implemented, user-facing documentation will be added.

jmacd added 3 commits

December 16, 2024 16:38


          New file draft

e70217d


          Draft proposals

c8f7d6f


          Add detail.

b225283

codecov bot commented Dec 18, 2024 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.62%. Comparing base (4593ba7) to head (46f1e79).
Report is 48 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11948      +/-   ##
==========================================
+ Coverage   91.59%   91.62%   +0.03%     
==========================================
  Files         449      447       -2     
  Lines       23761    23731      -30     
==========================================
- Hits        21763    21743      -20     
+ Misses       1623     1613      -10     
  Partials      375      375

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jmacd added 6 commits

December 18, 2024 13:59


          Revisions.

dc8a6d2


          Lint

065cff5


          Shorten

d806842


          Chlog

0725f32


          Typo

ed69aaf


          Typos

46f1e79

Contributor Author

jmacd commented Dec 19, 2024

I will present this RFC in the next two Collector SIG meetings 1/7/2025 (APAC/PT) and 1/15/2025 (NA).

jmacd marked this pull request as ready for review

December 19, 2024 19:07

jmacd requested a review from a team as a code owner

December 19, 2024 19:07

jmacd requested a review from evan-bradley

December 19, 2024 19:07

bogdandrutu reviewed

View reviewed changes

docs/rfcs/consistent-timeout-handling.md

Comment on lines +32 to +36

+              - A request arrives with an already-expired deadline. Should the component
+                immediately return a deadline-exceeded error status?
+              - A request arrives with a viable deadline, but the request does not
+                succeed in time. Should the component immediately return a deadline-exceeded
+                status, or should it wait for its response?

Member

bogdandrutu Jan 2, 2025

Would be also good to extend this into a requirement and have a "virtual" component after receivers and probably before exporters (not sure is useful for every processor as well) that checks for the deadline and returns immediately if expired.

Contributor Author

jmacd Jan 8, 2025

I agree, an implementation inside the pipeline apparatus will extend consistent timeout behavior to all components.

docs/rfcs/consistent-timeout-handling.md

Comment on lines +37 to +39

+              - A request arrives and has to acquire a resource (e.g., space in a queue) that
+                is not immediately available. Should the component fail "fast" or stall the
+                request, hoping the resource will become available before the deadline?

Member

bogdandrutu Jan 2, 2025

As mentioned before, there are use-cases for both, so we should make sure we support both models.

docs/rfcs/consistent-timeout-handling.md

Comment on lines +40 to +46

+              - A request arrives and the component calls for an additive timeout that is
+                greater than the request's deadline.  For example:
+                - a batch processor is configured with `1s` timeout, and an arriving
+                  request has a `0.5s` timeout
+                - timeout sender has `5s` configured timeout, arriving request has `2s` timeout
+                - retry sender has a maximum elapsed timeout of `1m`, arriving request has `5s`
+                  timeout.

Member

bogdandrutu Jan 2, 2025

Not sure I understand what is the "expected" behavior here and especially the derived requirement related to timeout sender.

Also, keep in mind that users may use a persistent queue, when timeout has a different behavior probably.

Contributor Author

jmacd Jan 8, 2025

I'm also not sure what is expected, and I wonder if users have considered / wanted the ability to ignore a timeout. The use-case that I've seen is a OTLP endpoint accepting gRPC requests from a gateway collector. The gateway collector uses a 5s default timeout, and the service wants to permit its own backend 15s. The receiver cannot abide a 5s timeout, is there any way to simply disregard it?

For the batcher case, it's different: the batcher could bypass itself if the timeout is less than the interval.

For the persistent queue case, I believe we should interpret timeout as a request deadline, not as some kind of time-to-live parameter. That means, the timeout applies to the enqueue operation, and I expect timeout to happen if the disk gets slow.

docs/rfcs/consistent-timeout-handling.md

+              - Batch processor/sender propagates maximum deadline in batch
+              - If enabled, queue sender blocks until queue space is available
+              - Timeout sender configured not to lower an already-configured timeout.

Member

bogdandrutu Jan 2, 2025

This is the canonical golang behavior, why not doing this?

Contributor Author

jmacd Jan 8, 2025

With this behavior, there's no way for the user to request a longer timeout. There's no way for the exporter to allow the user-configured timeout to pass w/o imposing a timeout of its own. I would like to see behavior like "honor the user's timeout up to 1 minute, use 10 seconds for requests w/o a timeout".

docs/rfcs/consistent-timeout-handling.md Show resolved Hide resolved

docs/rfcs/consistent-timeout-handling.md

Comment on lines +211 to +217

+                  select {
+                  case <-ctx.Done():
+                      // the context is canceled, maybe deadline-exceeded.
+                      return ctx.Err()
+                  default:
+                      // OK to continue
+                  }

Member

bogdandrutu Jan 2, 2025

I think is nicer to check for Err. See

// If Done is not yet closed, Err returns nil.
// If Done is closed, Err returns a non-nil error explaining why:
// Canceled if the context was canceled
// or DeadlineExceeded if the context's deadline passed.
// After Err returns a non-nil error, successive calls to Err return the same error.
Err() error

Contributor Author

jmacd Jan 8, 2025

Isn't that line 214?

Member

bogdandrutu Jan 10, 2025

Just do if err = ctx.Err(); err != nil { return err }

docs/rfcs/consistent-timeout-handling.md

Comment on lines +279 to +296

+              ### Queue sender
+              A new field will be introduced, with default matching the
+              original behavior of this component.
+              ```golang
+                // FailFast indicates that the queue should immediately
+                // reject requests when the queue is full, without considering
+                // the request deadline. Default: true.
+                FailFast bool `mapstructure:"fail_fast"`
+              ```
+              In case the new FailFast flag is false, there are two cases:
+. The request has a deadline. In this case, wait until the deadline
+                 to enqueue the request.
+. The request has no deadline. In this case, let the request block
+                 indefinitely until it can be enqueued.

Member

bogdandrutu Jan 2, 2025

Nice

docs/rfcs/consistent-timeout-handling.md


		## Specific proposals

		### Timeout sender

Member

bogdandrutu Jan 2, 2025

Still struggle with this and what you try to achieve. Have you seen a real problem with this and what do you try to achieve.

docs/rfcs/consistent-timeout-handling.md

+              to use deadlines always and/or use separate pipelines for requests
+              with and without deadlines.
+              ### Receiver helper

Member

bogdandrutu Jan 2, 2025

There are some behaviors that we can automatically check for all the receivers like the deadline expires by adding virtual components in the processing graph while initialization the service.

Some components that require configuration do need a helper/user code change.

docs/rfcs/consistent-timeout-handling.md

Comment on lines +327 to +328

		- `min_timeout` (duration): Limits the allowable timeout for new requests to a minimum value. >=0 means deadline checking.
		- `timeout` (duration): Limits the allowable timeout for new requests to a maximum value. Must be >= 0.

Member

bogdandrutu Jan 2, 2025

Why not having a processor for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet