Skip to content

Grafana Agent(Flow Mode) stops sending logs to loki #6868

Open

Description

What's wrong?

Periodically grafana-agent pods stop sending logs to Loki and need to be restarted to get them sending logs again.

Steps to reproduce

Sporadically occurs usually on high log level pods

System information

EKS 1.28

Software version

Grafana Agent 0.39.1 helm chart .31

Configuration

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: grafana-agent
spec:
  releaseName: grafana-agent
  chart:
    spec:
      chart: grafana-agent
      sourceRef:
        kind: HelmRepository
        name: artifactory-helm-repo
        namespace: flux-system
      version: "0.31.0"
  interval: 1h0m0s
  values:
    apiVersion: v1
    ## Global properties for image pulling override the values defined under `image.registry` and `configReloader.image.registry`.
    ## If you want to override only one image registry, use the specific fields but if you want to override them all, use `global.image.registry`
    global:
      image:
        registry: jfrog
      pullSecrets:
        - regcred

      # -- Security context to apply to the Grafana Agent pod.
      podSecurityContext: {}

    crds:
      # -- Whether to install CRDs for monitoring.
      create: true

    # Various agent settings.
    configReloader:
      # -- Enables automatically reloading when the agent config changes.
      enabled: true
      image:
        # -- Tag of image to use for config reloading.
        tag: v0.8.0
    agent:
      # -- Mode to run Grafana Agent in. Can be "flow" or "static".
      mode: 'flow'
      configMap:
      # -- Create a new ConfigMap for the config file.
        create: false
 
      clustering:
        # -- Deploy agents in a cluster to allow for load distribution. Only
        # applies when agent.mode=flow.
        enabled: false

      # -- Enables sending Grafana Labs anonymous usage stats to help improve Grafana
      # Agent.
      enableReporting: false
    image:
        tag: v0.39.0
    controller:
      # -- Type of controller to use for deploying Grafana Agent in the cluster.
      # Must be one of 'daemonset', 'deployment', or 'statefulset'.
      type: 'daemonset'

      # -- Number of pods to deploy. Ignored when controller.type is 'daemonset'.
      #replicas: 4

      # -- Annotations to add to controller.
      extraAnnotations: {}

      autoscaling:
        # -- Creates a HorizontalPodAutoscaler for controller type deployment.
        enabled: false
        # -- The lower limit for the number of replicas to which the autoscaler can scale down.
        minReplicas: 1
        # -- The upper limit for the number of replicas to which the autoscaler can scale up.
        maxReplicas: 5
        # -- Average CPU utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetCPUUtilizationPercentage` to 0 will disable CPU scaling.
        targetCPUUtilizationPercentage: 0
        # -- Average Memory utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetMemoryUtilizationPercentage` to 0 will disable Memory scaling.
        targetMemoryUtilizationPercentage: 80

Logs

From Grafana Agent....

Wait returned an error: context canceled"

2024-04-11 19:13:13.757	ts=2024-04-11T23:13:13.75746748Z level=info msg="tailer exited" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs

2024-04-11 19:13:13.757	ts=2024-04-11T23:13:13.757432913Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs err="client rate limiter Wait returned an error: context canceled"

2024-04-11 19:13:13.734	ts=2024-04-11T23:13:13.734468227Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.filtered_pod_logs duration=5.612367ms

2024-04-11 19:13:13.728	ts=2024-04-11T23:13:13.728808151Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.pod_logs duration=15.128878ms

2024-04-11 19:13:13.565	ts=2024-04-11T23:13:13.565594699Z level=warn msg="could not determine if container terminated; will retry tailing" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-wsp2d\" not found"

2024-04-11 19:13:13.364	ts=2024-04-11T23:13:13.364645639Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-dfw7x:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-dfw7x\" not found"

2024-04-11 19:13:13.277	ts=2024-04-11T23:13:13.277872904Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.246Z

2024-04-11 19:13:13.245	ts=2024-04-11T23:13:13.245083471Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.215Z

2024-04-11 19:13:13.243	ts=2024-04-11T23:13:13.243761946Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-dfw7x:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-dfw7x\" not found"

2024-04-11 19:13:13.214	ts=2024-04-11T23:13:13.214541615Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.187Z

2024-04-11 19:13:13.186	ts=2024-04-11T23:13:13.18667988Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:12.313Z

2024-04-11 19:13:11.100	ts=2024-04-11T23:13:11.100140922Z level=error msg="final error sending batch" component=loki.write.grafana_cloud_loki component=client host=logs.xtops.ue1.eexchange.com status=400 tenant="" error="server returned HTTP status 400 Bad Request (400): entry for stream '{cluster=\"ufdc-eks01-1-28\", container=\"fc-core-adm\", env=\"eks-uat\", instance=\"apps-mmf2/fc-core-adm-68647f484b-wbxb9:fc-core-adm\", job=\"apps-mmf2/fc-core-adm-68647f484b-wbxb9\", namespace=\"apps-mmf2\", pod=\"fc-core-adm-68647f484b-wbxb9\", system=\"fc\"}' has timestamp too old: 2024-04-04T14:33:35Z, oldest acceptable timestamp is: 2024-04-04T23:13:11Z"

2024-04-11 19:13:08.767	ts=2024-04-11T23:13:08.767149881Z level=info msg="finished node evaluation" controller_id="" node_id=loki.source.kubernetes.pod_logs duration=32.254832ms

2024-04-11 19:13:08.734	ts=2024-04-11T23:13:08.734838869Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.filtered_pod_logs duration=6.112792ms

2024-04-11 19:13:08.728	ts=2024-04-11T23:13:08.728672495Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.pod_logs duration=15.306976ms

2024-04-11 19:13:06.588	ts=2024-04-11T23:13:06.588374162Z level=info msg="opened log stream" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:06.562Z

2024-04-11 19:13:06.563	ts=2024-04-11T23:13:06.562995551Z level=warn msg="tailer stopped; will retry" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs err="http2: response body closed"

2024-04-11 19:13:06.563	ts=2024-04-11T23:13:06.562911538Z level=info msg="have not seen a log line in 3x average time between lines, closing and re-opening tailer" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs rolling_average=2s time_since_last=6.476935385s

From a pod....

2024-04-11 19:13:13.230	unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.230	unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211	unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211	unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211	unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.207	unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.205	unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.201	failed to try resolving symlinks in path "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": lstat /var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log: no such file or directory
2024-04-11 19:13:13.179	failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.178	failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.177	failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.177	failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-attentionAn issue or PR has been sitting around and needs attention.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions