Open
Description
openedon Apr 18, 2024
What's wrong?
Periodically grafana-agent pods stop sending logs to Loki and need to be restarted to get them sending logs again.
Steps to reproduce
Sporadically occurs usually on high log level pods
System information
EKS 1.28
Software version
Grafana Agent 0.39.1 helm chart .31
Configuration
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: grafana-agent
spec:
releaseName: grafana-agent
chart:
spec:
chart: grafana-agent
sourceRef:
kind: HelmRepository
name: artifactory-helm-repo
namespace: flux-system
version: "0.31.0"
interval: 1h0m0s
values:
apiVersion: v1
## Global properties for image pulling override the values defined under `image.registry` and `configReloader.image.registry`.
## If you want to override only one image registry, use the specific fields but if you want to override them all, use `global.image.registry`
global:
image:
registry: jfrog
pullSecrets:
- regcred
# -- Security context to apply to the Grafana Agent pod.
podSecurityContext: {}
crds:
# -- Whether to install CRDs for monitoring.
create: true
# Various agent settings.
configReloader:
# -- Enables automatically reloading when the agent config changes.
enabled: true
image:
# -- Tag of image to use for config reloading.
tag: v0.8.0
agent:
# -- Mode to run Grafana Agent in. Can be "flow" or "static".
mode: 'flow'
configMap:
# -- Create a new ConfigMap for the config file.
create: false
clustering:
# -- Deploy agents in a cluster to allow for load distribution. Only
# applies when agent.mode=flow.
enabled: false
# -- Enables sending Grafana Labs anonymous usage stats to help improve Grafana
# Agent.
enableReporting: false
image:
tag: v0.39.0
controller:
# -- Type of controller to use for deploying Grafana Agent in the cluster.
# Must be one of 'daemonset', 'deployment', or 'statefulset'.
type: 'daemonset'
# -- Number of pods to deploy. Ignored when controller.type is 'daemonset'.
#replicas: 4
# -- Annotations to add to controller.
extraAnnotations: {}
autoscaling:
# -- Creates a HorizontalPodAutoscaler for controller type deployment.
enabled: false
# -- The lower limit for the number of replicas to which the autoscaler can scale down.
minReplicas: 1
# -- The upper limit for the number of replicas to which the autoscaler can scale up.
maxReplicas: 5
# -- Average CPU utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetCPUUtilizationPercentage` to 0 will disable CPU scaling.
targetCPUUtilizationPercentage: 0
# -- Average Memory utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetMemoryUtilizationPercentage` to 0 will disable Memory scaling.
targetMemoryUtilizationPercentage: 80
Logs
From Grafana Agent....
Wait returned an error: context canceled"
2024-04-11 19:13:13.757 ts=2024-04-11T23:13:13.75746748Z level=info msg="tailer exited" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs
2024-04-11 19:13:13.757 ts=2024-04-11T23:13:13.757432913Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs err="client rate limiter Wait returned an error: context canceled"
2024-04-11 19:13:13.734 ts=2024-04-11T23:13:13.734468227Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.filtered_pod_logs duration=5.612367ms
2024-04-11 19:13:13.728 ts=2024-04-11T23:13:13.728808151Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.pod_logs duration=15.128878ms
2024-04-11 19:13:13.565 ts=2024-04-11T23:13:13.565594699Z level=warn msg="could not determine if container terminated; will retry tailing" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-wsp2d\" not found"
2024-04-11 19:13:13.364 ts=2024-04-11T23:13:13.364645639Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-dfw7x:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-dfw7x\" not found"
2024-04-11 19:13:13.277 ts=2024-04-11T23:13:13.277872904Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.246Z
2024-04-11 19:13:13.245 ts=2024-04-11T23:13:13.245083471Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.215Z
2024-04-11 19:13:13.243 ts=2024-04-11T23:13:13.243761946Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-dfw7x:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-dfw7x\" not found"
2024-04-11 19:13:13.214 ts=2024-04-11T23:13:13.214541615Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.187Z
2024-04-11 19:13:13.186 ts=2024-04-11T23:13:13.18667988Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:12.313Z
2024-04-11 19:13:11.100 ts=2024-04-11T23:13:11.100140922Z level=error msg="final error sending batch" component=loki.write.grafana_cloud_loki component=client host=logs.xtops.ue1.eexchange.com status=400 tenant="" error="server returned HTTP status 400 Bad Request (400): entry for stream '{cluster=\"ufdc-eks01-1-28\", container=\"fc-core-adm\", env=\"eks-uat\", instance=\"apps-mmf2/fc-core-adm-68647f484b-wbxb9:fc-core-adm\", job=\"apps-mmf2/fc-core-adm-68647f484b-wbxb9\", namespace=\"apps-mmf2\", pod=\"fc-core-adm-68647f484b-wbxb9\", system=\"fc\"}' has timestamp too old: 2024-04-04T14:33:35Z, oldest acceptable timestamp is: 2024-04-04T23:13:11Z"
2024-04-11 19:13:08.767 ts=2024-04-11T23:13:08.767149881Z level=info msg="finished node evaluation" controller_id="" node_id=loki.source.kubernetes.pod_logs duration=32.254832ms
2024-04-11 19:13:08.734 ts=2024-04-11T23:13:08.734838869Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.filtered_pod_logs duration=6.112792ms
2024-04-11 19:13:08.728 ts=2024-04-11T23:13:08.728672495Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.pod_logs duration=15.306976ms
2024-04-11 19:13:06.588 ts=2024-04-11T23:13:06.588374162Z level=info msg="opened log stream" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:06.562Z
2024-04-11 19:13:06.563 ts=2024-04-11T23:13:06.562995551Z level=warn msg="tailer stopped; will retry" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs err="http2: response body closed"
2024-04-11 19:13:06.563 ts=2024-04-11T23:13:06.562911538Z level=info msg="have not seen a log line in 3x average time between lines, closing and re-opening tailer" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs rolling_average=2s time_since_last=6.476935385s
From a pod....
2024-04-11 19:13:13.230 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.230 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.207 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.205 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.201 failed to try resolving symlinks in path "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": lstat /var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log: no such file or directory
2024-04-11 19:13:13.179 failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.178 failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.177 failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.177 failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment