-
Notifications
You must be signed in to change notification settings - Fork 483
fix(ddtracer/tracer): retry sendStats on EOF by marking POST idempotent #3991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Encode the stats payload once, use a bytes.Reader for the body, and set `req.GetBody` + `ContentLength`. Also add an *empty* `Idempotency-Key` header (zero-length slice) so `net/http.Transport` treats the request as idempotent and can automatically retry on network errors when `GetBody` is available. This fixes intermittent `Post "...": EOF` failures caused by a race where the agent (~5s idle timeout) closes an idle connection while the tracer (~90s) is sending- `Write` succeeds, then `Read` returns `io.EOF`. Tests: * Add `TestSendStatsNetworkErrorRetry` with an `eofConn` that returns EOF *after* a successful `Write` to force the Transport retry path. * Assert two dials (retry on a new conn), `/stats` is retried, and `Content-Length` is set. The server sees three handler hits (`/info` & two `/stats` due to retry) while the client records two requests. Result: `sendStats` transparently retries on transient EOFs; no behavior change on the success path.
@lattwood Thanks for the contribution. Allow us to ping the agent team and extra eyes from our guild. This looks good in a quick review, but we want to double check. |
In the meantime, can you check the test? Thanks!
|
will do!
i am going to have to start drinking if this is because of a macos/linux
idiosyncrasy 🙃
…On Thu, Sep 25, 2025 at 4:42 AM Dario Castañé ***@***.***> wrote:
*darccio* left a comment (DataDog/dd-trace-go#3991)
<#3991 (comment)>
In the meantime, can you check the test? Thanks!
--- FAIL: TestSendStatsNetworkErrorRetry (0.01s)
transport_test.go:615:
Error Trace: /home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/transport_test.go:615
Error: Not equal:
expected: 2
actual : 3
Test: TestSendStatsNetworkErrorRetry
—
Reply to this email directly, view it on GitHub
<#3991 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJNKD3T2CCDZI3IEFWLKD3UOMHHAVCNFSM6AAAAACHNFIZASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMZSGYYTSMRQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
return err | ||
} | ||
|
||
// by providing GetBody and a zero length slice for the Idempotency-Key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it could work well here, but wouldn't this mean all these requests are considered idempotent so even on a network error where the agent received and processed the request this would cause a retry of the same request? I took a look at the net/http docs for clarity here but they aren't clarifying this exact case for me
They would. I took a quick skim through the stats protobuf and it doesn't
*look* like that would be an issue.
To fix this without a blind retry, we'd have to significantly increase the
idle/read timeout from 3 seconds in the agent, which i'm guessing was done
for a reason (the 90s here appears to be a go default).
hoping to get at the test failure tomorrow.
…On Thu, Sep 25, 2025 at 12:11 PM Andrew Glaude ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In ddtrace/tracer/transport.go
<#3991 (comment)>:
> if err != nil {
return err
}
+
+ // by providing GetBody and a zero length slice for the Idempotency-Key
This seems like it could work well here, but wouldn't this mean all these
requests are considered idempotent so even on a network error where the
agent received and processed the request this would cause a retry of the
same request? I took a look at the net/http docs for clarity here but they
aren't clarifying this exact case for me
—
Reply to this email directly, view it on GitHub
<#3991 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJNKFMHH5GJE2MYTWE4T33UQA2DAVCNFSM6AAAAACHNFIZASVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTENRYGI3TGNBZGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What does this PR do?
Addresses the edge-case of
datadog-agent
closing a persistent connection whiledd-trace-go
is trying to send it a request, resulting in error logs.Encode the stats payload once, use a bytes.Reader for the body, and set
req.GetBody
+ContentLength
. Also add an emptyIdempotency-Key
header (zero-length slice) sonet/http.Transport
treats the request as idempotent and can automatically retry on network errors whenGetBody
is available.This fixes intermittent
Post "...": EOF
failures caused by a race where the agent (~5s idle timeout) closes an idle connection while the tracer (~90s) is sending-Write
succeeds, thenRead
returnsio.EOF
.Tests:
TestSendStatsNetworkErrorRetry
with aneofConn
that returns EOF after a successfulWrite
to force the Transport retry path./stats
is retried, andContent-Length
is set. The server sees three handler hits (/info
& two/stats
due to retry) while the client records two requests.Result:
sendStats
transparently retries on transient EOFs; no behavior change on the success path.Motivation
I was getting a ton of
Post "...": EOF
errors in my logs.Reviewer's Checklist
./scripts/lint.sh
locally.