Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong status code set in tracing spans #10538

Closed
rafaeljesus opened this issue Jul 4, 2024 · 8 comments · Fixed by #10574
Closed

wrong status code set in tracing spans #10538

rafaeljesus opened this issue Jul 4, 2024 · 8 comments · Fixed by #10574
Labels
bug Something isn't working waiting-for-author

Comments

@rafaeljesus
Copy link

Describe the bug

I have a logs.exporters.otlphttp that is not properly converting the status code being returned from the logging backend, the problem is that the logging backend (in this case logmixer-ingest) is returning 400 status code but the collector server span (in this case logingest) is setting 500 instead, this is causing my SLOs to incorrectly burn

Steps to reproduce

What did you expect to see?

I expect the server span status code attribute to correctly match what is being returned by the logging service, which in this case, the logging service returns 400 but the server span sets 500 instead

What did you see instead?

here we see that the http client is correctly setting 400 to the span

Screenshot 2024-07-04 at 14 53 16

in the span above, we see that the permanent error message contains the 400

Screenshot 2024-07-04 at 14 48 34

but in the server span we see that it is 500 instead

Screenshot 2024-07-04 at 14 50 10

What version did you use?

v0.95.0

What config did you use?

this is the configMap managed by https://github.com/open-telemetry/opentelemetry-operator

apiVersion: v1
data:
  collector.yaml: |
    receivers:
      otelarrow:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            include_metadata: true
            keepalive:
              server_parameters:
                max_connection_age: 180s
                max_connection_age_grace: 240s
            max_recv_msg_size_mib: 20
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
            include_metadata: true
    exporters:
      logging:
        verbosity: detailed
      otlphttp:
        auth:
          authenticator: headers_setter
        logs_endpoint: http://logmixer-ingest:9201/v1/otlp
        retry_on_failure:
          enabled: false
        sending_queue:
          enabled: false
        timeout: 5s
    extensions:
      headers_setter:
        headers:
          - from_context: lightstep-org-id
            key: lightstep-org-id
          - from_context: lightstep-org-name
            key: lightstep-org-name
          - from_context: lightstep-project-id
            key: lightstep-project-id
          - from_context: lightstep-project-name
            key: lightstep-project-name
          - from_context: log-database-id
            key: log-database-id
      health_check:
        check_collector_pipeline:
          enabled: false
          exporter_failure_threshold: 5
          interval: 5m
        endpoint: 0.0.0.0:13133
        path: /
      lightsteptelemetry:
        lightstep:
          lightstep:
            access_token: ${LS_TOKEN}
            collector:
              address: kube-otel-stack-internal-collector:4317
              plaintext: true
          metrics:
            opentelemetry:
              access_token: ${LS_TOKEN}
              enable_otlp: true
              grpc:
                endpoint:
                  address: kube-otel-stack-internal-collector:4317
                  plaintext: true
              tags_from_request_headers:
                log_database_id: log-database-id
                org_name: lightstep-org-name
                organization_id: lightstep-org-id
                project_id: lightstep-project-id
                project_name: lightstep-project-name
      pprof: null
    service:
      extensions:
        - lightsteptelemetry
        - health_check
        - pprof
        - headers_setter
      telemetry:
        metrics:
          level: detailed
        resource:
          service.name: null
      pipelines:
        logs:
          exporters:
            - otlphttp
          processors: []
          receivers:
            - otelarrow
            - otlp

Environment

GKE

Additional context

@rafaeljesus rafaeljesus added the bug Something isn't working label Jul 4, 2024
@mx-psi
Copy link
Member

mx-psi commented Jul 5, 2024

The configuration you shared does not have any component that modifies your telemetry, therefore the Collector is not at fault here. The bug is likely in the SDK or client that produces the OTLP data. If your SDK/client supports some kind of stdout exporter you can check that this is the case by enabling said exporter and checking what data is coming out of your SDK/client. Can you check this so we can confirm the issue is not in the Collector?

@rafaeljesus
Copy link
Author

our suspicion (cc/ @jaronoff97) is that this has to do with the otlp receiver not properly converting grpc <-> http status code , you can see in the picture below that the span exporter/otlphttp/logs is actually receiving the 400, but it is not propagating back through the stack

Screenshot 2024-07-05 at 13 48 15

exporter/otlphttp/logs is from

return plogotlp.NewExportResponse(), errors.GetStatusFromError(err)

@rafaeljesus
Copy link
Author

it looks like this was closed by mistake

@rafaeljesus rafaeljesus reopened this Jul 5, 2024
@TylerHelmuth
Copy link
Member

TylerHelmuth commented Jul 6, 2024

@rafaeljesus Please try upgrading to v0.99.0 or higher. There was a fix to properly propagate error codes.

@rafaeljesus
Copy link
Author

rafaeljesus commented Jul 9, 2024

after inspection, we are already on v0.99.0,

it looks like the receiver always sets 500 regardless of what Export returns as a error

writeError(resp, enc, err, http.StatusInternalServerError)

@rafaeljesus
Copy link
Author

rafaeljesus commented Jul 9, 2024

and my guess is that the permanent 400 err is falling here

statusCode = errors.GetHTTPStatusCodeFromStatus(s)
which does not convert InvalidArgument to BadRequest
func GetHTTPStatusCodeFromStatus(s *status.Status) int {

does that make sense?

@TylerHelmuth
Copy link
Member

@rafaeljesus thank you for the detailed issue. I think I see why this is happening.

the OTLP exporters use a method, NewStatusFromMsgAndHTTPCode, that converts received HTTP status codes to grpc Statuses for internal collector use. That method does a http.StatusBadRequest -> codes.InvalidArgument conversion.

Then, in the otlpreceiver's http handling, we convert back from the grpc Status to an HTTP error code. As you've pointed out, the conversion back is not handling codes.InvalidArgument back to http.StatusBadRequest.

This is definitely a bug, I'll get a fix open.

@TylerHelmuth
Copy link
Member

Once #9041 we'll also have a non-lossy way to transmit http status codes through the consumer pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting-for-author
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants