Skip to content

Opentelemetry: Race condition between connection shutdown and export can result in duplicate spans #44894

Open
@arn-ivu

Description

@arn-ivu

Describe the bug

When there are ~10seconds between two traces/spans beeing exported, it can happen that the exporter closes the connection after the next spans have been exported but before the otel-collector responds with OK.
This is what it looks like on TCP-level:

//traces are exported. everything ok
1.   12:55:22 app->otel POST /v1/traces (protobuf-content a)
2.   12:55:22 otel->app OK
//a trace is exported directly followed by connection termination
3.   12:55:32 app->otel POST /v1/traces (protobuf-content b)
4.   12:55:32 otel->app ACK 
5.   12:55:32 app->otel FIN
6.   12:55:32 otel->app OK
7.   12:55:32 otel->app FIN, ACK
//connection re-opened
8.   12:55:32 app->otel SYN
9.   12:55:32 otel->app SYN, ACK
10. 12:55:32 app->otel ACK
//traces are exported again
11. 12:55:32 app->otel POST /v1/traces (brotobuf-content b)
12. 12:55:32 otel->app OK

Here is the actual tcp dump containing the example: /example/tcp_traffic_http.pcap

the server's OK in step 6 is probably dropped (conforming to http-standard). This results in a timeout followed by a retry of the export.
As the otel-collector appearently doesn't deduplicate the spans this results in duplicate spans in the trace.

This is how the trace should look like: trace_normal.json
This is how the trace looks like when the issue occurs: trace_with_duplicates.json

From what I can see in the code, I assume this is a race-condition, where the connection-shutdown is triggered but does not set the isShutdown-flag before the next spans are exported. I imagine something like this is happening:

thread 1:
    // we didn't export anything for 10s. Let's close the connection
    VertxHttpSpanExport.shutdown()
thread 2:
    //receives new spans to export
    VertxHttpSpanExport.export()
        HttpExporter.export()
            if (isShutdown.get()) -> false
            //proceed exporting
thread 1:
       HttpExporter.shutdown()
           if (!isShutdown.compareAndSet(false, true))
           //proceed shutting down

Expected behavior

The exporter should not export the same traces twice to the otel-collector.

Actual behavior

The exporter sometimes (probably: race condition) closes the connection to the otel-collector right after spans where exported, but before the otel-collector could respond with OK. It will then reopen the connection and send the spans a second time.

How to Reproduce?

Reproducer: https://github.com/arn-ivu/reproducer-duplicate-spans

Requirements:

  • docker
  • optional: a tool to capture tcp-traffic, e.g. wireshark

Steps to reproduce:

  • Start the reproducer: ./mvnw compile quarkus:dev
  • Since this is about a race-condition you may have to wait for some time until the issue occurs (on my machine it was often ~10 minutes)

Find duplicate Spans in grafana:

  • In the dev-ui: http://localhost:8080/q/dev-ui/extensions
    • Click on Grafana-ui
  • Naviagate to "Explore", Choose "Tempo" as source
  • Paste this query to the traceQL-input: {resource.service.name="reproducer-duplicate-spans"} | count() > 2

Monitor the tcp-traffic with wireshark

Output of uname -a or ver

Linux e5808115ef08 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 Linux

Output of java -version

openjdk version "22.0.1" 2024-04-16 OpenJDK Runtime Environment Temurin-22.0.1+8 (build 22.0.1+8) OpenJDK 64-Bit Server VM Temurin-22.0.1+8 (build 22.0.1+8, mixed mode, sharing)

Quarkus version or git rev

3.17.2

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.9

Additional information

While I have analyzed the problem mostly for quarkus.otel.exporter.otlp.protocol=http/protobuf
I did observe the same problem with quarkus.otel.exporter.otlp.protocol=grpc
See https://github.com/arn-ivu/reproducer-duplicate-spans/tree/main/examples for tcp-traffic of both cases

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions