Skip to content

Conversation

lattwood
Copy link

What does this PR do?

Addresses the edge-case of datadog-agent closing a persistent connection while dd-trace-go is trying to send it a request, resulting in error logs.

Encode the stats payload once, use a bytes.Reader for the body, and set req.GetBody + ContentLength. Also add an empty Idempotency-Key header (zero-length slice) so net/http.Transport treats the request as idempotent and can automatically retry on network errors when GetBody is available.

This fixes intermittent Post "...": EOF failures caused by a race where the agent (~5s idle timeout) closes an idle connection while the tracer (~90s) is sending- Write succeeds, then Read returns io.EOF.

Tests:

  • Add TestSendStatsNetworkErrorRetry with an eofConn that returns EOF after a successful Write to force the Transport retry path.
  • Assert two dials (retry on a new conn), /stats is retried, and Content-Length is set. The server sees three handler hits (/info & two /stats due to retry) while the client records two requests.

Result: sendStats transparently retries on transient EOFs; no behavior change on the success path.

Motivation

I was getting a ton of Post "...": EOF errors in my logs.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors. You can check this by running ./scripts/lint.sh locally.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

Encode the stats payload once, use a bytes.Reader for the body, and set
`req.GetBody` + `ContentLength`. Also add an *empty* `Idempotency-Key`
header (zero-length slice) so `net/http.Transport` treats the request as
idempotent and can automatically retry on network errors when `GetBody`
is available.

This fixes intermittent `Post "...": EOF` failures caused by a race where
the agent (~5s idle timeout) closes an idle connection while the tracer
(~90s) is sending- `Write` succeeds, then `Read` returns `io.EOF`.

Tests:

* Add `TestSendStatsNetworkErrorRetry` with an `eofConn` that returns EOF
  *after* a successful `Write` to force the Transport retry path.
* Assert two dials (retry on a new conn), `/stats` is retried, and
  `Content-Length` is set. The server sees three handler hits (`/info` &
  two `/stats` due to retry) while the client records two requests.

Result: `sendStats` transparently retries on transient EOFs; no behavior
change on the success path.
@lattwood lattwood requested a review from a team as a code owner September 24, 2025 19:54
@darccio darccio requested review from a team September 25, 2025 07:26
@darccio
Copy link
Member

darccio commented Sep 25, 2025

@lattwood Thanks for the contribution. Allow us to ping the agent team and extra eyes from our guild. This looks good in a quick review, but we want to double check.

@darccio
Copy link
Member

darccio commented Sep 25, 2025

In the meantime, can you check the test? Thanks!

--- FAIL: TestSendStatsNetworkErrorRetry (0.01s)
    transport_test.go:615: 
        	Error Trace:	/home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/transport_test.go:615
        	Error:      	Not equal: 
        	            	expected: 2
        	            	actual  : 3
        	Test:       	TestSendStatsNetworkErrorRetry

@lattwood
Copy link
Author

lattwood commented Sep 25, 2025 via email

return err
}

// by providing GetBody and a zero length slice for the Idempotency-Key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could work well here, but wouldn't this mean all these requests are considered idempotent so even on a network error where the agent received and processed the request this would cause a retry of the same request? I took a look at the net/http docs for clarity here but they aren't clarifying this exact case for me

@lattwood
Copy link
Author

lattwood commented Sep 25, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants