Skip to content

opentelemetry sink batching results in malformed data (status code 400) when max_events is more than 1 #22054

@navodveduth

Description

@navodveduth

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

When using the opentelemetry sink in vector to send metrics derived from logs to an opentelemetry collector, vector repeatedly fails with 400 bad request. These errors appear in the vector agent logs, but the otel collector does not show any related error logs or indications of receiving malformed payloads. As a result, metrics are not processed by the otel collector as expected

Configuration

apiVersion: observability.kaasops.io/v1alpha1
kind: ClusterVectorPipeline
metadata:
  name: log-level-metrics-pipeline
spec:
  sources:
    kubernetes_logs:
      type: kubernetes_logs
      pod_annotation_fields:
        container_image: container_image
        container_name: container_name
        pod_name: pod_name
        pod_namespace: pod_namespace
      fingerprint_lines: 1
      ignore_older_secs: 600

  transforms:
    log_level_tagger:
      type: remap
      inputs:
        - kubernetes_logs
      source: |
        if exists(.message) {
          log_message = string!(.message)
          log_level = "INFO"

          if contains(upcase(log_message), "ERROR") {
            log_level = "ERROR"
          } else if contains(upcase(log_message), "WARN") {
            log_level = "WARN"
          } else if contains(upcase(log_message), "DEBUG") {
            log_level = "DEBUG"
          }

          .log_level = log_level

          .attributes = {
            "log_level": log_level
          }

          if exists(.pod_name) {
            .attributes.pod_name = string!(.pod_name)
          } else {
            .attributes.pod_name = "unknown_pod"
          }

          if exists(.pod_namespace) {
            .attributes.pod_namespace = string!(.pod_namespace)
          } else {
            .attributes.pod_namespace = "unknown_namespace"
          }

          .timestamp = now()
        } else {
          .log_level = "UNKNOWN"
          .attributes = {
            "log_level": "UNKNOWN",
            "pod_name": "unknown_pod",
            "pod_namespace": "unknown_namespace"
          }
        }

    log_to_metric:
      type: log_to_metric
      inputs:
        - log_level_tagger
      metrics:
        - type: counter
          name: log_level_count
          field: log_level
          tags:
            log_level: "{{attributes.log_level}}"
            pod_name: "{{attributes.pod_name}}"
            pod_namespace: "{{attributes.pod_namespace}}"

  sinks:
    otel_collector_sink:
      type: opentelemetry
      inputs:
        - log_to_metric
      protocol:
        type: http
        uri: "http://otel-collector.otel:4318/v1/logs"
        method: post
        encoding:
          codec: json
          framing:
            method: newline_delimited
      batch:
        max_events: 100
        max_bytes: 1048576
        timeout_secs: 10
      retry:
        initial_interval_secs: 1
        max_interval_secs: 30
        max_retries: 5
      healthcheck:
        enabled: true
        interval_secs: 60

Version

0.43.0

Debug Output

2024-12-18T14:40:25.520217Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=456}: vector::sinks::util::retries: Not retriable; dropping the request. reason="Http status: 400 Bad Request" internal_log_rate_limit=true
2024-12-18T14:40:25.520229Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=456}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] has been suppressed 4 times.
2024-12-18T14:40:25.520231Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=456}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=None request_id=456 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-12-18T14:40:25.520266Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=456}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] has been suppressed 4 times.
2024-12-18T14:40:25.520268Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=456}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=2 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
2024-12-18T14:40:26.554810Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=457}: vector::sinks::util::retries: Internal log [Not retriable; dropping the request.] is being suppressed to avoid flooding.
2024-12-18T14:40:26.554830Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=457}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] is being suppressed to avoid flooding.
2024-12-18T14:40:26.554840Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=457}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] is being suppressed to avoid flooding.
2024-12-18T14:40:43.994791Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=459}: vector::sinks::util::retries: Internal log [Not retriable; dropping the request.] has been suppressed 2 times.
2024-12-18T14:40:43.994821Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=459}: vector::sinks::util::retries: Not retriable; dropping the request. reason="Http status: 400 Bad Request" internal_log_rate_limit=true
2024-12-18T14:40:43.994858Z ERROR sink{component_kind="sink" component_id=log-level-metrics-pipeline-otel_collector_sink component_type=opentelemetry}:request{request_id=459}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] has been suppressed 2 times.

Example Data

No response

Additional Context

Both vector and the otel collector are running in a cluster. Even with debug logging enabled on the otel collector, there are no logs showing that it received the payload or encountered any issues. However, when the same payload is sent to the otel collector using a curl request, it is logged and processed correctly

OpenTelemetry collector config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
processors:
  batch: {}
  memory_limiter:
    limit_mib: 1000
    spike_limit_mib: 512
    check_interval: 5s
extensions:
  zpages: {}
exporters:
  logging:
    loglevel: debug
    sampling_initial: 5
    sampling_thereafter: 200
  prometheus:
    endpoint: 0.0.0.0:8889
    metric_expiration: 1m
service:
  extensions: [zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [logging, prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [logging, file]

References

No response

Metadata

Metadata

Assignees

Labels

sink: opentelemetryAnything `opentelemetry` sink relatedtype: bugA code related bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions