-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement timeout mechanism for several metrics components #2402
Comments
Just for the record: https://pypi.org/project/timeout-decorator/ |
Also #385 |
Based on the linked issues, the previous decision was to not use the signal based approach for implementing generic timeout mechanism because it won't work outside of the main thread, and it can interfere with applications using the same signal handler already. For some background,
Based on that, I think it's reasonable to just pass down a timeout duration where possible and document that implementors should respect the timeout. This doesn't actually implement "SHOULD complete or abort within some timeout" but is not terribly opinionated. It would be great if the mechanism we add for metrics can also be used in the tracing SDK, where timeouts were never implemented previously. The problem is that adding a timeout optional parameter to the SpanExporter, SpanProcessor etc. interface signatures will break existing implementations of those interfaces when we go to call them. We could potentially defensively call with the timeout param and fallback to not passing timeout e.g. here try:
self._active_span_processor.shutdown(timeout=timeout)
except TypeError:
self._active_span_processor.shutdown() |
I think potentially we should try to add |
Looks like I lied about JS. They accept some timeout parameters as config options and apply them: https://github.com/open-telemetry/opentelemetry-js/blob/cfda625e83a164c9e19745392879c32aedcfe76f/packages/opentelemetry-sdk-trace-base/src/BasicTracerProvider.ts#L153 The obvious downside with this approach is you can't change the timeout at different call sites. I think it still satisfies the spec though. And we could add it without the try/except stuff. |
Ok.. new issue is the OTLP exporter accepts a timeout parameter on it's own. The really weird thing is, this is considered a timeout for just RPCs to the OTLP endpoint and we use an exponential backoff that will delay things way past the default of 10s all the way to 63 seconds plus whatever delay we had from the RPCs. And on top of that.. the timeout for PeriodicExportingMetricReader and BatchSpanProcessor have a opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/trace/export/__init__.py Line 143 in 29e4bab
I'm not even sure the correct behavior at this point |
So my understanding is that there are several |
I put up a draft PR to add some of the metrics timeouts: #2653
For tracing the only implemented timeout is force flush, which is indeed implemented at several levels:
The BatchSpanProcessor is completely ignoring the |
So a couple of things to here then:
|
Not just OTLP but other exporters jaeger, zipkin have their own timeout that can be configured. The 63 seconds delay in OTLP is an outlier because we only do that when we encounter transient network issues. The number used to be much higher as 15 min but we bumped it down to 63 seconds after some deliberation (spec doesn't really say anything as of today). My understanding is that |
I think that's true if the exporter performs that stuff asynchronously after returning from the export() call. Correct me if I'm wrong but I think our OTLP exporter blocks in this case which also blocks the BSP worker thread. The spec (and our implementations) will call export() once at at time. So the worst case result is that OTLP exporter will
for a total of ~213 seconds. All the while blocking the BSP's worker thread, meaning it also won't respond to force flush events (on shutdown). I suppose this is a separate issue from this one, but the situation is currently very broken. |
@lzchen that's right thanks for outlining it. I guess my proposals are 1. pass timeout duration down through the call stacke.g. Pros:
Cons:
2. handle timeout at the highest levele.g. Pros:
Cons:
My PR is implementing the first approach |
+1 for first approach. |
OTLP exporter as of today
In an ideal scenario the BSP has to cancel the export call after 30s irrespective of exporter handling mechanism. But it doesn't do that with 30s export mills timeout passed to via env/arg. Question for the first approach - I maybe misunderstanding something but there are two different timeouts which are to be considered. 1. |
I guess what I'm proposing is all of these blocking calls respect the timeout they receive as the maximum for the entire blocking call e.g. This is basically how Go contexts work out of the box. When you create a context with a timeout, it calculates the deadline as Our |
The other problem with the approach 2 is if the thread running the blocking call doesn't complete within the timeout, it will just be leaked and there is no real way to cancel it. We could use multiprocessing instead and actually terminate the background process, but then the exporters/processors and call arguments must be pickelable. With approach 1, an exporter ignoring the timeout will block indefinitely, but a good behaving exporter/processor will actually be able to degrade more gracefully. The exception to this is if we ever support asyncio where awaitables can easily be cancelled/timed out from the parent call and the child may respond to cancellation gracefully. We could do both option 1 and 2 for asyncio. |
Makes sense, +1 for the first approach. |
All right +1 for the first approach as well 👍 |
Cool, PR is ready for review for everything except async callbacks. I will do that in a separate PR I think. Async callbacks also need to be written with a forward compatible signature. Maybe a config dataclass would make more sense for async callbacks? def callback(config: CallbackConfig) -> Iterable[Observation]:
... I feel like users are more likely to forget to add |
|
Yes trying to prevent them from making mistakes and messing up the signature. Im implementing this in #2664 |
collect
doesn't return anything, so returning True until we implement timeout/error handling mechanism seemed reasonable to me. What do you think?Originally posted by @lonewolf3739 in #2401 (comment)
Also, the timeout mechanism is required in several parts of the metrics spec:
The text was updated successfully, but these errors were encountered: