Implement timeout mechanism for several metrics components #2402

ocelotl · 2022-01-24T18:44:22Z

collect doesn't return anything, so returning True until we implement timeout/error handling mechanism seemed reasonable to me. What do you think?

Originally posted by @lonewolf3739 in #2401 (comment)

Also, the timeout mechanism is required in several parts of the metrics spec:

The text was updated successfully, but these errors were encountered:

ocelotl · 2022-03-24T16:48:13Z

Just for the record: https://pypi.org/project/timeout-decorator/

ocelotl · 2022-03-24T16:49:34Z

Also #385

aabmass · 2022-04-26T20:40:00Z

Based on the linked issues, the previous decision was to not use the signal based approach for implementing generic timeout mechanism because it won't work outside of the main thread, and it can interfere with applications using the same signal handler already.

For some background,

The Java methods return a future like object which lets user join() with a certain timeout
JS doesn't appear to do anything special, just returns a Promise or accepts a callback. There is no way to cancel
Go simply passes the ctx and tells the implementor to respect the cancellation/timeout
.NET accepts timeout for shutdown and force flush which the implementor should respect

Based on that, I think it's reasonable to just pass down a timeout duration where possible and document that implementors should respect the timeout. This doesn't actually implement "SHOULD complete or abort within some timeout" but is not terribly opinionated.

It would be great if the mechanism we add for metrics can also be used in the tracing SDK, where timeouts were never implemented previously. The problem is that adding a timeout optional parameter to the SpanExporter, SpanProcessor etc. interface signatures will break existing implementations of those interfaces when we go to call them. We could potentially defensively call with the timeout param and fallback to not passing timeout e.g. here

try:
  self._active_span_processor.shutdown(timeout=timeout)
except TypeError:
  self._active_span_processor.shutdown()

aabmass · 2022-04-26T21:07:00Z

I think potentially we should try to add *args, **kwargs or config objects to any user implementable interfaces going forward to avoid this kind of issue.

aabmass · 2022-04-28T19:22:44Z

Looks like I lied about JS. They accept some timeout parameters as config options and apply them: https://github.com/open-telemetry/opentelemetry-js/blob/cfda625e83a164c9e19745392879c32aedcfe76f/packages/opentelemetry-sdk-trace-base/src/BasicTracerProvider.ts#L153

The obvious downside with this approach is you can't change the timeout at different call sites. I think it still satisfies the spec though. And we could add it without the try/except stuff.

aabmass · 2022-04-29T21:29:33Z

Ok.. new issue is the OTLP exporter accepts a timeout parameter on it's own. The really weird thing is, this is considered a timeout for just RPCs to the OTLP endpoint and we use an exponential backoff that will delay things way past the default of 10s all the way to 63 seconds plus whatever delay we had from the RPCs.

And on top of that.. the timeout for PeriodicExportingMetricReader and BatchSpanProcessor have a export_timeout_millis of 30s (way shorter than the max backoff), but neither of those classes pass the timeout to the exporter

opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/trace/export/__init__.py

Line 143 in 29e4bab

export_timeout_millis: float = None,

I'm not even sure the correct behavior at this point

lzchen · 2022-05-02T17:20:23Z

So my understanding is that there are several timeouts that we have implemented per component that all could contribute to or be the actual timeout that users experience in the metric pipeline and we are not sure which one(s), if any, to use?

aabmass · 2022-05-02T17:44:20Z

I put up a draft PR to add some of the metrics timeouts: #2653

So my understanding is that there are several timeouts that we have implemented per component

For tracing the only implemented timeout is force flush, which is indeed implemented at several levels:

SynchronousMultiSpanProcessor passes a timeout to each sub-processor and checks the deadline for the whole thing:

opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py

Lines 187 to 194 in 5456988

    
           deadline_ns = _time_ns() + timeout_millis * 1000000 
        
           for sp in self._span_processors: 
        
               current_time_ns = _time_ns() 
        
               if current_time_ns >= deadline_ns: 
        
                   return False 
        
               if not sp.force_flush((deadline_ns - current_time_ns) // 1000000): 
        
                   return False

ConcurrentMultiSpanProcessor does similar, but runs the sub-processors in a thread pool executor:

opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/trace/__init__.py

Lines 267 to 274 in 5456988

    
           for sp in self._span_processors:  # type: SpanProcessor 
        
               future = self._executor.submit(sp.force_flush, timeout_millis) 
        
               futures.append(future) 
        
           timeout_sec = timeout_millis / 1e3 
        
           done_futures, not_done_futures = concurrent.futures.wait( 
        
               futures, timeout_sec 
        
           )

Note this can be redundant when used with SpanProcessors that respect the timeout already like BatchSpanProcessor

BatchSpanProcessor sets the timeout while waiting for the background thread, which should respect the timeout:

opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/trace/export/__init__.py

Line 392 in 5456988

ret = flush_request.event.wait(timeout_millis / 1e3)

The BatchSpanProcessor is completely ignoring the export_timeout_millis constructor param except that it uses it as a default for force flushing timeout.

lzchen · 2022-05-02T20:22:40Z

So a couple of things to here then:

Implement metrics similarly to tracing in terms of supporting export_timeout for export in exporters and for flush in metric readers (do we even support flush for this).
Implement shutdown_timeout for shutdown in metricsreader and exporters.
Implement shutdown_timeout for shutdown in tracing exporters.
Implement export_timeout for export in exporters for trace exporters.

srikanthccv · 2022-05-03T03:04:37Z

Ok.. new issue is the OTLP exporter accepts a timeout parameter on it's own. The really weird thing is, this is considered a timeout for just RPCs to the OTLP endpoint and we use an exponential backoff that will delay things way past the default of 10s all the way to 63 seconds plus whatever delay we had from the RPCs.

And on top of that.. the timeout for PeriodicExportingMetricReader and BatchSpanProcessor have a export_timeout_millis of 30s (way shorter than the max backoff), but neither of those classes pass the timeout to the exporter

opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/trace/export/__init__.py

Line 143 in 29e4bab

export_timeout_millis: float = None,

I'm not even sure the correct behavior at this point

Not just OTLP but other exporters jaeger, zipkin have their own timeout that can be configured. The 63 seconds delay in OTLP is an outlier because we only do that when we encounter transient network issues. The number used to be much higher as 15 min but we bumped it down to 63 seconds after some deliberation (spec doesn't really say anything as of today). My understanding is that export_timeout_millis is for the export call from processor but the exporter itself could do different things. For example an exporter X devides the batch into 4 different chunks and exports them. So here the export timeout on exporter is for each such call but the processor has the maximum timeout of 30s for export call regardless what exporter does internally.

aabmass · 2022-05-03T15:27:06Z

For example an exporter X devides the batch into 4 different chunks and exports them. So here the export timeout on exporter is for each such call but the processor has the maximum timeout of 30s for export call regardless what exporter does internally.

I think that's true if the exporter performs that stuff asynchronously after returning from the export() call. Correct me if I'm wrong but I think our OTLP exporter blocks in this case which also blocks the BSP worker thread. The spec (and our implementations) will call export() once at at time. So the worst case result is that OTLP exporter will

try to export, with a 30 second timeout, fail and wait 2 seconds
try to export, with a 30 second timeout, fail and wait 4 seconds
try to export, with a 30 second timeout, fail and wait 8 seconds
try to export, with a 30 second timeout, fail and wait 16 seconds
try to export, with a 30 second timeout, fail and wait 32 seconds

for a total of ~213 seconds. All the while blocking the BSP's worker thread, meaning it also won't respond to force flush events (on shutdown). I suppose this is a separate issue from this one, but the situation is currently very broken.

aabmass · 2022-05-03T15:44:01Z

@lzchen that's right thanks for outlining it. I guess my proposals are

1. pass timeout duration down through the call stack

e.g. MeterProvider.shutdown(timeout_millis) -> PeriodicExportingMetricReader.collect(timeout_millis) -> PeriodicExportingMetricReader -> Exporter.export(timeout_millis). This allows just implementing the timeout in the lowest level.

Pros:

explicit
we make no opinion on how to implement the timeout
- most RPC/HTTP frameworks that exporters use already accept and handle timeouts.
more compatible with different asynchronous frameworks going forward

Cons:

have to pass timeout arg through the call chain
can be ignored by misbehaving exporters

2. handle timeout at the highest level

e.g. MeterProvider.shutdown() would start a background thread (or submit to a thread pool) and wait for the thread to finish for timeout_millis.

Pros:

simple
one size fits all
timeout is configurable

Cons:

paying for an extra thread/thread pool which could probably be avoided.
- the SDK is already creating quite a few threads so this cost is growing.
not very compatible with other asynchronous frameworks like asyncio

My PR is implementing the first approach

lzchen · 2022-05-03T16:41:41Z

+1 for first approach.

srikanthccv · 2022-05-04T02:37:40Z

For example an exporter X devides the batch into 4 different chunks and exports them. So here the export timeout on exporter is for each such call but the processor has the maximum timeout of 30s for export call regardless what exporter does internally.

I think that's true if the exporter performs that stuff asynchronously after returning from the export() call. Correct me if I'm wrong but I think our OTLP exporter blocks in this case which also blocks the BSP worker thread. The spec (and our implementations) will call export() once at at time. So the worst case result is that OTLP exporter will

try to export, with a 30 second timeout, fail and wait 2 seconds

try to export, with a 30 second timeout, fail and wait 4 seconds

try to export, with a 30 second timeout, fail and wait 8 seconds

try to export, with a 30 second timeout, fail and wait 16 seconds

try to export, with a 30 second timeout, fail and wait 32 seconds

for a total of ~213 seconds. All the while blocking the BSP's worker thread, meaning it also won't respond to force flush events (on shutdown). I suppose this is a separate issue from this one, but the situation is currently very broken.

OTLP exporter as of today

try to export, with a 30 10 second timeout, fail and wait 2 seconds
try to export, with a 30 10 second timeout, fail and wait 4 seconds
try to export, with a 30 10 second timeout, fail and wait 8 seconds
try to export, with a 30 10 second timeout, fail and wait 16 seconds
try to export, with a 30 10 second timeout, fail and wait 32 seconds
try to export, with a 30 10 second timeout, fail and wait 64 seconds and finally give up

In an ideal scenario the BSP has to cancel the export call after 30s irrespective of exporter handling mechanism. But it doesn't do that with 30s export mills timeout passed to via env/arg.

Question for the first approach - I maybe misunderstanding something but there are two different timeouts which are to be considered. 1. OTEL_METRIC_EXPORT_TIMEOUT Which I think is for the entire export call (30s). 2. OTEL_EXPORTER_OTLP_TIMEOUT (10s) for the rpc call (an exporter may have multiple rpc calls). So passing timeout_millis down the call stack till exporter is not the really the expected behaviour?

aabmass · 2022-05-04T15:14:16Z

So passing timeout_millis down the call stack till exporter is not the really the expected behaviour?

I guess what I'm proposing is all of these blocking calls respect the timeout they receive as the maximum for the entire blocking call e.g. Exporter.export(..., timeout) should never block longer than the timeout passed in. If the exporter is doing queueing or backoffs, it can implement that as it pleases, respecting the overall deadline for the export. That would work for the scenario where OTEL_METRIC_EXPORT_TIMEOUT < OTEL_EXPORTER_OTLP_TIMEOUT, or maybe you're sending cumulatives every 5 seconds and you'd rather drop a slow request and send the next collection interval's data instead of doing exponential backoff.

This is basically how Go contexts work out of the box. When you create a context with a timeout, it calculates the deadline as now + timeout. If you make a child contexts with a different timeout (e.g. a single request in a set of backoffs), it will respect the sooner deadline between parent and child contexts. That is the most intuitive behavior to me.

Our SynchronousMultiSpanProcessor.force_flush() already behaves this way for example.

aabmass · 2022-05-04T15:25:49Z

The other problem with the approach 2 is if the thread running the blocking call doesn't complete within the timeout, it will just be leaked and there is no real way to cancel it. We could use multiprocessing instead and actually terminate the background process, but then the exporters/processors and call arguments must be pickelable. With approach 1, an exporter ignoring the timeout will block indefinitely, but a good behaving exporter/processor will actually be able to degrade more gracefully.

The exception to this is if we ever support asyncio where awaitables can easily be cancelled/timed out from the parent call and the child may respond to cancellation gracefully. We could do both option 1 and 2 for asyncio.

srikanthccv · 2022-05-04T16:12:34Z

Makes sense, +1 for the first approach.

ocelotl · 2022-05-04T18:59:29Z

All right +1 for the first approach as well 👍

aabmass · 2022-05-05T22:39:59Z

Cool, PR is ready for review for everything except async callbacks. I will do that in a separate PR I think. Async callbacks also need to be written with a forward compatible signature. Maybe a config dataclass would make more sense for async callbacks?

def callback(config: CallbackConfig) -> Iterable[Observation]:
  ...

I feel like users are more likely to forget to add **kwargs for callbacks, and this way is easier for them to grock. We really don't want to break async instrument callbacks for users since this is a common use case.

lzchen · 2022-05-06T18:08:15Z

I feel like users are more likely to forget to add **kwargs for callbacks, and this way is easier for them to grock. We really don't want to break async instrument callbacks for users since this is a common use case.

What does grock mean?
Do you mean that if we recommended users to use **kwargs they would misuse it? Are we trying to make sure mistakes won't be made by forcing them to pass in a dataclass?

aabmass · 2022-05-06T19:29:11Z

2. Do you mean that if we recommended users to use **kwargs they would misuse it? Are we trying to make sure mistakes won't be made by forcing them to pass in a dataclass?

Yes trying to prevent them from making mistakes and messing up the signature. Im implementing this in #2664

ocelotl mentioned this issue Jan 24, 2022

Refactor SDK MeterProvider #2401

Merged

4 tasks

ocelotl added the metrics label Jan 24, 2022

ocelotl assigned aabmass Mar 24, 2022

srikanthccv mentioned this issue Mar 24, 2022

Define a timeout mechanism for asynchronous instruments callbacks #2120

Closed

ocelotl added the sdk Affects the SDK package. label Mar 24, 2022

ocelotl changed the title ~~Implement timeout mechanism for collect~~ Implement timeout mechanism for several metrics components Mar 24, 2022

ocelotl added the 1.10.0rc1 release candidate 1 for metrics GA label Apr 12, 2022

ocelotl mentioned this issue Apr 13, 2022

Use exceptions to report shutdown result #2599

Merged

ocelotl mentioned this issue Apr 28, 2022

Add *args, **kwargsto user implementable interfaces #2650

Closed

aabmass mentioned this issue Apr 29, 2022

Add timeouts to metric SDK #2653

Merged

8 tasks

aabmass mentioned this issue May 5, 2022

OTLP exporters don't respect timeout #2663

Open

ocelotl closed this as completed in #2653 May 6, 2022

LarsMichelsen mentioned this issue Jul 10, 2024

Timeout and retry of exporters is broken #4043

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement timeout mechanism for several metrics components #2402

Implement timeout mechanism for several metrics components #2402

ocelotl commented Jan 24, 2022 •

edited

Loading

ocelotl commented Mar 24, 2022

ocelotl commented Mar 24, 2022

aabmass commented Apr 26, 2022

aabmass commented Apr 26, 2022

aabmass commented Apr 28, 2022 •

edited

Loading

aabmass commented Apr 29, 2022 •

edited

Loading

lzchen commented May 2, 2022

aabmass commented May 2, 2022

lzchen commented May 2, 2022

srikanthccv commented May 3, 2022

aabmass commented May 3, 2022

aabmass commented May 3, 2022

lzchen commented May 3, 2022

srikanthccv commented May 4, 2022

aabmass commented May 4, 2022

aabmass commented May 4, 2022

srikanthccv commented May 4, 2022

ocelotl commented May 4, 2022

aabmass commented May 5, 2022

lzchen commented May 6, 2022

aabmass commented May 6, 2022

Implement timeout mechanism for several metrics components #2402

Implement timeout mechanism for several metrics components #2402

Comments

ocelotl commented Jan 24, 2022 • edited Loading

ocelotl commented Mar 24, 2022

ocelotl commented Mar 24, 2022

aabmass commented Apr 26, 2022

aabmass commented Apr 26, 2022

aabmass commented Apr 28, 2022 • edited Loading

aabmass commented Apr 29, 2022 • edited Loading

lzchen commented May 2, 2022

aabmass commented May 2, 2022

lzchen commented May 2, 2022

srikanthccv commented May 3, 2022

aabmass commented May 3, 2022

aabmass commented May 3, 2022

1. pass timeout duration down through the call stack

2. handle timeout at the highest level

lzchen commented May 3, 2022

srikanthccv commented May 4, 2022

aabmass commented May 4, 2022

aabmass commented May 4, 2022

srikanthccv commented May 4, 2022

ocelotl commented May 4, 2022

aabmass commented May 5, 2022

lzchen commented May 6, 2022

aabmass commented May 6, 2022

ocelotl commented Jan 24, 2022 •

edited

Loading

aabmass commented Apr 28, 2022 •

edited

Loading

aabmass commented Apr 29, 2022 •

edited

Loading