Skip to content

Opentelemetry: Race condition between connection shutdown and export can result in duplicate spans #44894

Open
@arn-ivu

Description

Describe the bug

When there are ~10seconds between two traces/spans beeing exported, it can happen that the exporter closes the connection after the next spans have been exported but before the otel-collector responds with OK.
This is what it looks like on TCP-level:

//traces are exported. everything ok
1.   12:55:22 app->otel POST /v1/traces (protobuf-content a)
2.   12:55:22 otel->app OK
//a trace is exported directly followed by connection termination
3.   12:55:32 app->otel POST /v1/traces (protobuf-content b)
4.   12:55:32 otel->app ACK 
5.   12:55:32 app->otel FIN
6.   12:55:32 otel->app OK
7.   12:55:32 otel->app FIN, ACK
//connection re-opened
8.   12:55:32 app->otel SYN
9.   12:55:32 otel->app SYN, ACK
10. 12:55:32 app->otel ACK
//traces are exported again
11. 12:55:32 app->otel POST /v1/traces (brotobuf-content b)
12. 12:55:32 otel->app OK

Here is the actual tcp dump containing the example: /example/tcp_traffic_http.pcap

the server's OK in step 6 is probably dropped (conforming to http-standard). This results in a timeout followed by a retry of the export.
As the otel-collector appearently doesn't deduplicate the spans this results in duplicate spans in the trace.

This is how the trace should look like: trace_normal.json
This is how the trace looks like when the issue occurs: trace_with_duplicates.json

From what I can see in the code, I assume this is a race-condition, where the connection-shutdown is triggered but does not set the isShutdown-flag before the next spans are exported. I imagine something like this is happening:

thread 1:
    // we didn't export anything for 10s. Let's close the connection
    VertxHttpSpanExport.shutdown()
thread 2:
    //receives new spans to export
    VertxHttpSpanExport.export()
        HttpExporter.export()
            if (isShutdown.get()) -> false
            //proceed exporting
thread 1:
       HttpExporter.shutdown()
           if (!isShutdown.compareAndSet(false, true))
           //proceed shutting down

Expected behavior

The exporter should not export the same traces twice to the otel-collector.

Actual behavior

The exporter sometimes (probably: race condition) closes the connection to the otel-collector right after spans where exported, but before the otel-collector could respond with OK. It will then reopen the connection and send the spans a second time.

How to Reproduce?

Reproducer: https://github.com/arn-ivu/reproducer-duplicate-spans

Requirements:

  • docker
  • optional: a tool to capture tcp-traffic, e.g. wireshark

Steps to reproduce:

  • Start the reproducer: ./mvnw compile quarkus:dev
  • Since this is about a race-condition you may have to wait for some time until the issue occurs (on my machine it was often ~10 minutes)

Find duplicate Spans in grafana:

  • In the dev-ui: http://localhost:8080/q/dev-ui/extensions
    • Click on Grafana-ui
  • Naviagate to "Explore", Choose "Tempo" as source
  • Paste this query to the traceQL-input: {resource.service.name="reproducer-duplicate-spans"} | count() > 2

Monitor the tcp-traffic with wireshark

Output of uname -a or ver

Linux e5808115ef08 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 Linux

Output of java -version

openjdk version "22.0.1" 2024-04-16 OpenJDK Runtime Environment Temurin-22.0.1+8 (build 22.0.1+8) OpenJDK 64-Bit Server VM Temurin-22.0.1+8 (build 22.0.1+8, mixed mode, sharing)

Quarkus version or git rev

3.17.2

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.9

Additional information

While I have analyzed the problem mostly for quarkus.otel.exporter.otlp.protocol=http/protobuf
I did observe the same problem with quarkus.otel.exporter.otlp.protocol=grpc
See https://github.com/arn-ivu/reproducer-duplicate-spans/tree/main/examples for tcp-traffic of both cases

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions