Invalid gzip payload sent by splunk HEC exporter #34255

PaulBernier · 2024-07-25T20:19:09Z

Component(s)

exporter/splunkhec

What happened?

Description

At high throughput, the splunk HEC exporter returns some errors, I collected 4 differents

Post "https://<redacted>/services/collector/raw?index=main&sourcetype=test&source=eventhub://pbernier-premium1.servicebus.windows.net/zscaler_eh_cef-2%3B&host=<redacted>": net/http: HTTP/1.x transport connection broken: http: ContentLength=2916 with Body length 0
flate: closed writer
"HTTP/1.1 400 Unparsable gzip header in request data\r\nContent-Length: 261\r\nConnection: keep-alive\r\nContent-Type: text/html; charset=UTF-8\r\nDate: Wed, 24 Jul 2024 20:39:34 GMT\r\nServer: Splunkd\r\nX-Content-Type-Options: nosniff\r\nX-Frame-Options: SAMEORIGIN\r\n\r\n<!doctype html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=UTF-8\"><title>400 Unparsable gzip header in request data</title></head><body><h1>Unparsable gzip header in request data</h1><p>HTTP Request was malformed.</p></body></html>\r\n"
Permanent error: "HTTP/1.1 400 Bad Request\r\nContent-Length: 27\r\nConnection: keep-alive\r\nContent-Type: application/json; charset=UTF-8\r\nDate: Thu, 25 Jul 2024 17:49:23 GMT\r\nServer: Splunkd\r\nVary: Authorization\r\nX-Content-Type-Options: nosniff\r\nX-Frame-Options: SAMEORIGIN\r\n\r\n{\"text\":\"No data\",\"code\":5}"

Those errors, and especially the second one, makes me wonder if there is an issue with the cancellableGzipWriter, that would occasionally end up being in inconsistent state. The only way I could see this happen if a buffer was being concurrently used. I've looked at the code myself and couldn't find anything obvious, the usage of sync.Pool does make sense.

Steps to Reproduce

I have 300 collectors, sending a total of 309MB/s for 39,000event/s. Events are sent 1by1 (a single event per HEC HTTP request) to a single Splunk Cloud stack. About 1req/s fails (so 1 out of 39,000)

Expected Result

No error, Splunk should not receive incomplete payload.

Actual Result

Errors shared above

Collector version

v0.104.0

Environment information

Environment

alpine linux
Go 1.22

OpenTelemetry Collector configuration

exporters:
  splunk_hec/1:
    token: "{{.hecToken}}"
    endpoint: "{{.hecEndpoint}}/services/collector/raw?index={{.splunkIndex}}&sourcetype={{.splunkSourceType}}&source=eventhub://{{.eventHubFullyQualifiedNamespace}}/{{.eventHubName}}
    sourcetype: "{{.splunkSourceType}}"
    index: "{{.splunkIndex}}"
    export_raw: true
    max_content_length_logs: {{.exporterMaxContentLengthLogs}}
    retry_on_failure:
      enabled: false
    sending_queue:
      enabled: false
    hec_metadata_to_otel_attrs:
      source: source
      host: host
    tls:
      min_version: 1.2

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-25T20:19:27Z

Pinging code owners:

exporter/splunkhec: @atoulme @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

PaulBernier · 2024-07-25T23:44:40Z

I found the root cause, from https://pkg.go.dev/net/http#Client.Do

The request Body, if non-nil, will be closed by the underlying Transport, even on errors. The Body may be closed asynchronously after Do returns.

Because the Body (in the case here, the buffer) can still be closed asynchronously, it is unsafe to return into the pool, as it might end up being closed after already having be picked up again, corrupting the data. There is a GH issue about that in the Golang repo golang/go#51907 (where you can see some high profile projects like Kubernetes got impacted by that as well)

crobert-1 · 2024-08-02T16:10:22Z

Thanks for filing @PaulBernier, and for including all of the information!

I'm going to close this is a duplicate of #34357, based on the code owner's response in that issue.

PaulBernier added bug Something isn't working needs triage New item requiring triage labels Jul 25, 2024

github-actions bot added the exporter/splunkhec label Jul 25, 2024

atoulme mentioned this issue Jul 26, 2024

[exporter/splunkhec] asynchronous HTTP request consumption #34259

Closed

github-actions bot mentioned this issue Jul 30, 2024

Weekly Report: 2024-07-23 - 2024-07-30 #34301

Closed

PaulBernier mentioned this issue Jul 31, 2024

splunkhecexporter panic #34357

Closed

crobert-1 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid gzip payload sent by splunk HEC exporter #34255

Invalid gzip payload sent by splunk HEC exporter #34255

PaulBernier commented Jul 25, 2024 •

edited

Loading

github-actions bot commented Jul 25, 2024

PaulBernier commented Jul 25, 2024

crobert-1 commented Aug 2, 2024

Invalid gzip payload sent by splunk HEC exporter #34255

Invalid gzip payload sent by splunk HEC exporter #34255

Comments

PaulBernier commented Jul 25, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jul 25, 2024

PaulBernier commented Jul 25, 2024

crobert-1 commented Aug 2, 2024

PaulBernier commented Jul 25, 2024 •

edited

Loading