Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Shutdown doesn't flush when used with global subscriber #1961

Open
stepantubanov opened this issue Jul 24, 2024 · 5 comments
Open

[Bug]: Shutdown doesn't flush when used with global subscriber #1961

stepantubanov opened this issue Jul 24, 2024 · 5 comments
Assignees
Labels
bug Something isn't working triage:todo Needs to be traiged.

Comments

@stepantubanov
Copy link

stepantubanov commented Jul 24, 2024

What happened?

As mentioned in #1625 - Tracer now holds strong reference to TracerProvider.

When opentelemetry used as a layer with global tracing subscriber it is now impossible to shutdown properly (it only decrements a reference, but doesn't execute Drop).

let layer = tracing_opentelemetry::layer()
  .with_tracer(tracer_provider.tracer("app"));

opentelemetry::global::set_tracer_provider(tracer_provider);
tracing::subscriber::set_global_default(Registry::default().with(layer));

// shutdown call does not actually shutdown global tracer provider
opentelemetry::global::shutdown_tracer_provider();

As a result some spans are missing, flattened, etc.

EDIT: Possible workaround is to flush manually:

tracer_provider.force_flush();

API Version

0.24.0

SDK Version

0.24.1

What Exporter(s) are you seeing the problem on?

OTLP

Relevant log output

No response

@stepantubanov stepantubanov added bug Something isn't working triage:todo Needs to be traiged. labels Jul 24, 2024
@stepantubanov stepantubanov changed the title [Bug]: No way to shutdown properly when used with global subscriber [Bug]: Shutdown doesn't flush when used with global subscriber Jul 24, 2024
@lalitb
Copy link
Member

lalitb commented Jul 24, 2024

When opentelemetry used as a layer with global tracing subscriber it is now impossible to shutdown properly (it only decrements a reference, but doesn't execute Drop).

That's expected behavior. The global::shutdown_tracer_provider() should only decrement the reference it is holding. There shouldn't be any flush/shutdown mechanism exposed at API/global level as per the specs. The application needs to keep hold of the sdk::TracerProvider handle, and use it to invoke shutdown/flush. And during graceful shutdown of application, the shutdown would be invoked implicitly through Drop which internally does the flush before shutdown. cc @TommyCpp

@stepantubanov
Copy link
Author

stepantubanov commented Jul 24, 2024

That's expected behavior. The global::shutdown_tracer_provider() should only decrement the reference it is holding. There shouldn't be any flush/shutdown mechanism exposed at API/global level as per the specs.

It's kind of misleading for global::shutdown_tracer_provider() to not be a shutdown mechanism for global tracer. Currently the doc reads:

Shut down the current tracer provider. This will invoke the shutdown method on all span processors. span processors should export remaining spans before return
https://docs.rs/opentelemetry/0.24.0/opentelemetry/global/fn.shutdown_tracer_provider.html

If current behavior (no shutdown if there are other references) is intentional and should remain, then I think it'd be helpful to change the documentation (explicitly state that it's not going to do anything if there's more references to the tracer provider, suggest calling flush/shutdown on the tracer provider directly).

And during graceful shutdown of application, the shutdown would be invoked implicitly through Drop which internally does the flush before shutdown.

In previous opentelemetry version it used to work like that (because Tracer had a weak reference to TracerProvider) for the given example, but with the new version it no longer works. And it was a bit surprising to debug this issue and find out call to shutdown_tracer_provider didn't do anything.

@lalitb
Copy link
Member

lalitb commented Jul 24, 2024

If current behavior (no shutdown if there are other references) is intentional and should remain, then I think it'd be helpful to change the documentation (explicitly state that it's not going to do anything if there's more references to the tracer provider, suggest calling flush/shutdown on the tracer provider directly).

  • Agree, doc is misleading. We can use this issue to track the improvement. Also, I think global::shutdown_tracer_provider() can be named better say global::release_tracer_provider().

@mzabaluev
Copy link

mzabaluev commented Oct 15, 2024

Here's a test against opentelemetry_sdk:

#![cfg(feature = "rt-tokio")]

use futures_util::future::BoxFuture;
use opentelemetry::global as otel_global;
use opentelemetry::trace::{TracerProvider as _, Tracer as _};
use opentelemetry_sdk::{
    export::trace::{ExportResult, SpanData, SpanExporter},
    runtime,
    trace::TracerProvider,
};
use tokio::runtime::Runtime;

use std::sync::{Arc, Mutex};

#[derive(Clone, Debug, Default)]
struct TestExporter(Arc<Mutex<Vec<SpanData>>>);

impl SpanExporter for TestExporter {
    fn export(&mut self, mut batch: Vec<SpanData>) -> BoxFuture<'static, ExportResult> {
        let spans = self.0.clone();
        Box::pin(async move {
            if let Ok(mut inner) = spans.lock() {
                inner.append(&mut batch);
            }
            Ok(())
        })
    }
}

fn test_tracer(runtime: &Runtime) -> (TracerProvider, TestExporter) {
    let _guard = runtime.enter();

    let exporter = TestExporter::default();
    let provider = TracerProvider::builder()
        .with_batch_exporter(exporter.clone(), runtime::Tokio)
        .build();

    (provider, exporter)
}

#[test]
fn shutdown_global() {
    let rt = Runtime::new().unwrap();
    let (provider, exporter) = test_tracer(&rt);

    otel_global::set_tracer_provider(provider);

    let tracer = otel_global::tracer("test");
    for _ in 0..1000 {
        tracer.start("test_span");
    }
    // drop(tracer);

    // Should flush all batched telemetry spans
    otel_global::shutdown_tracer_provider();

    let spans = exporter.0.lock().unwrap();
    assert_eq!(spans.len(), 1000);
}

Dropping the tracer before the call to shutdown_tracer_provider makes the test succeed.

As currently implemented, the behavior of shutdown_tracer_provider is fragile against existence of remaining references to the provider, which may be still in scope, or owned by other globally set data like in tokio-rs/tracing-opentelemetry#159, or simply leaked. This is in contrast to TracerProvider::shutdown, which works as expected.

@lalitb lalitb self-assigned this Oct 15, 2024
@mzabaluev
Copy link

mzabaluev commented Oct 16, 2024

This is in contrast to TracerProvider::shutdown, which works as expected.

Here's a test to exercise this. However, it does produce an error printout when the tracer instance is kept alive past the shutdown.

#[test]
fn shutdown_in_scope() {
    let rt = Runtime::new().unwrap();
    let (provider, exporter) = test_tracer(&rt);

    let tracer = provider.tracer("test");
    for _ in 0..1000 {
        tracer.start("test_span");
    }
    // drop(tracer);

    // Should flush all batched telemetry spans
    provider.shutdown().unwrap();

    let spans = exporter.0.lock().unwrap();
    assert_eq!(spans.len(), 1000);
}

There is also a case of a lockup in the shutdown call unless it's done in a separate worker thread. I was able to reproduce and traced it to this line:

futures_executor::block_on(res_receiver)

So it looks like we're not safe with the shutdown method either, but that's a subject for another bug report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage:todo Needs to be traiged.
Projects
None yet
Development

No branches or pull requests

3 participants