You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At high throughput, the splunk HEC exporter returns some errors, I collected 4 differents
Post "https://<redacted>/services/collector/raw?index=main&sourcetype=test&source=eventhub://pbernier-premium1.servicebus.windows.net/zscaler_eh_cef-2%3B&host=<redacted>": net/http: HTTP/1.x transport connection broken: http: ContentLength=2916 with Body length 0
flate: closed writer
"HTTP/1.1 400 Unparsable gzip header in request data\r\nContent-Length: 261\r\nConnection: keep-alive\r\nContent-Type: text/html; charset=UTF-8\r\nDate: Wed, 24 Jul 2024 20:39:34 GMT\r\nServer: Splunkd\r\nX-Content-Type-Options: nosniff\r\nX-Frame-Options: SAMEORIGIN\r\n\r\n<!doctype html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=UTF-8\"><title>400 Unparsable gzip header in request data</title></head><body><h1>Unparsable gzip header in request data</h1><p>HTTP Request was malformed.</p></body></html>\r\n"
Those errors, and especially the second one, makes me wonder if there is an issue with the cancellableGzipWriter, that would occasionally end up being in inconsistent state. The only way I could see this happen if a buffer was being concurrently used. I've looked at the code myself and couldn't find anything obvious, the usage of sync.Pool does make sense.
Steps to Reproduce
I have 300 collectors, sending a total of 309MB/s for 39,000event/s. Events are sent 1by1 (a single event per HEC HTTP request) to a single Splunk Cloud stack. About 1req/s fails (so 1 out of 39,000)
Expected Result
No error, Splunk should not receive incomplete payload.
The request Body, if non-nil, will be closed by the underlying Transport, even on errors. The Body may be closed asynchronously after Do returns.
Because the Body (in the case here, the buffer) can still be closed asynchronously, it is unsafe to return into the pool, as it might end up being closed after already having be picked up again, corrupting the data. There is a GH issue about that in the Golang repo golang/go#51907 (where you can see some high profile projects like Kubernetes got impacted by that as well)
Component(s)
exporter/splunkhec
What happened?
Description
At high throughput, the splunk HEC exporter returns some errors, I collected 4 differents
Post "https://<redacted>/services/collector/raw?index=main&sourcetype=test&source=eventhub://pbernier-premium1.servicebus.windows.net/zscaler_eh_cef-2%3B&host=<redacted>": net/http: HTTP/1.x transport connection broken: http: ContentLength=2916 with Body length 0
flate: closed writer
"HTTP/1.1 400 Unparsable gzip header in request data\r\nContent-Length: 261\r\nConnection: keep-alive\r\nContent-Type: text/html; charset=UTF-8\r\nDate: Wed, 24 Jul 2024 20:39:34 GMT\r\nServer: Splunkd\r\nX-Content-Type-Options: nosniff\r\nX-Frame-Options: SAMEORIGIN\r\n\r\n<!doctype html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=UTF-8\"><title>400 Unparsable gzip header in request data</title></head><body><h1>Unparsable gzip header in request data</h1><p>HTTP Request was malformed.</p></body></html>\r\n"
Permanent error: "HTTP/1.1 400 Bad Request\r\nContent-Length: 27\r\nConnection: keep-alive\r\nContent-Type: application/json; charset=UTF-8\r\nDate: Thu, 25 Jul 2024 17:49:23 GMT\r\nServer: Splunkd\r\nVary: Authorization\r\nX-Content-Type-Options: nosniff\r\nX-Frame-Options: SAMEORIGIN\r\n\r\n{\"text\":\"No data\",\"code\":5}"
Those errors, and especially the second one, makes me wonder if there is an issue with the
cancellableGzipWriter
, that would occasionally end up being in inconsistent state. The only way I could see this happen if a buffer was being concurrently used. I've looked at the code myself and couldn't find anything obvious, the usage of sync.Pool does make sense.Steps to Reproduce
I have 300 collectors, sending a total of 309MB/s for 39,000event/s. Events are sent 1by1 (a single event per HEC HTTP request) to a single Splunk Cloud stack. About 1req/s fails (so 1 out of 39,000)
Expected Result
No error, Splunk should not receive incomplete payload.
Actual Result
Errors shared above
Collector version
v0.104.0
Environment information
Environment
alpine linux
Go 1.22
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: