Opentelemetry: Race condition between connection shutdown and export can result in duplicate spans #44894
Description
Describe the bug
When there are ~10seconds between two traces/spans beeing exported, it can happen that the exporter closes the connection after the next spans have been exported but before the otel-collector responds with OK.
This is what it looks like on TCP-level:
//traces are exported. everything ok
1. 12:55:22 app->otel POST /v1/traces (protobuf-content a)
2. 12:55:22 otel->app OK
//a trace is exported directly followed by connection termination
3. 12:55:32 app->otel POST /v1/traces (protobuf-content b)
4. 12:55:32 otel->app ACK
5. 12:55:32 app->otel FIN
6. 12:55:32 otel->app OK
7. 12:55:32 otel->app FIN, ACK
//connection re-opened
8. 12:55:32 app->otel SYN
9. 12:55:32 otel->app SYN, ACK
10. 12:55:32 app->otel ACK
//traces are exported again
11. 12:55:32 app->otel POST /v1/traces (brotobuf-content b)
12. 12:55:32 otel->app OK
Here is the actual tcp dump containing the example: /example/tcp_traffic_http.pcap
the server's OK in step 6 is probably dropped (conforming to http-standard). This results in a timeout followed by a retry of the export.
As the otel-collector appearently doesn't deduplicate the spans this results in duplicate spans in the trace.
This is how the trace should look like: trace_normal.json
This is how the trace looks like when the issue occurs: trace_with_duplicates.json
From what I can see in the code, I assume this is a race-condition, where the connection-shutdown is triggered but does not set the isShutdown-flag before the next spans are exported. I imagine something like this is happening:
thread 1:
// we didn't export anything for 10s. Let's close the connection
VertxHttpSpanExport.shutdown()
thread 2:
//receives new spans to export
VertxHttpSpanExport.export()
HttpExporter.export()
if (isShutdown.get()) -> false
//proceed exporting
thread 1:
HttpExporter.shutdown()
if (!isShutdown.compareAndSet(false, true))
//proceed shutting down
Expected behavior
The exporter should not export the same traces twice to the otel-collector.
Actual behavior
The exporter sometimes (probably: race condition) closes the connection to the otel-collector right after spans where exported, but before the otel-collector could respond with OK. It will then reopen the connection and send the spans a second time.
How to Reproduce?
Reproducer: https://github.com/arn-ivu/reproducer-duplicate-spans
Requirements:
- docker
- optional: a tool to capture tcp-traffic, e.g. wireshark
Steps to reproduce:
- Start the reproducer:
./mvnw compile quarkus:dev
- Since this is about a race-condition you may have to wait for some time until the issue occurs (on my machine it was often ~10 minutes)
Find duplicate Spans in grafana:
- In the dev-ui: http://localhost:8080/q/dev-ui/extensions
- Click on Grafana-ui
- Naviagate to "Explore", Choose "Tempo" as source
- Paste this query to the traceQL-input:
{resource.service.name="reproducer-duplicate-spans"} | count() > 2
Monitor the tcp-traffic with wireshark
- In the dev-ui http://localhost:8080/q/dev-ui/extensions
- check the Opentelemetry Port OTEL_PORT - Click on "Adapter for loopback traffic capture"
- In the filter-input enter tcp.port==OTEL_PORT
Output of uname -a
or ver
Linux e5808115ef08 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 Linux
Output of java -version
openjdk version "22.0.1" 2024-04-16 OpenJDK Runtime Environment Temurin-22.0.1+8 (build 22.0.1+8) OpenJDK 64-Bit Server VM Temurin-22.0.1+8 (build 22.0.1+8, mixed mode, sharing)
Quarkus version or git rev
3.17.2
Build tool (ie. output of mvnw --version
or gradlew --version
)
Apache Maven 3.9.9
Additional information
While I have analyzed the problem mostly for quarkus.otel.exporter.otlp.protocol=http/protobuf
I did observe the same problem with quarkus.otel.exporter.otlp.protocol=grpc
See https://github.com/arn-ivu/reproducer-duplicate-spans/tree/main/examples for tcp-traffic of both cases
Metadata
Assignees
Labels
Type
Projects
Status
Todo