Skip to content

promtail terminates when loki returns a 500 error from back-pressure #89

Closed

Description

It appears promtail is terminating (and the pod is restarting) when it receives a 500 error from the loki server.
"Error sending batch: Error doing write: 500 - 500 Internal Server Error"
From a discussion on Slack, this occurs when the remote end is overloaded. Possibly this should be a more specific 503 slow down error?

Perhaps back-pressure from the remote end should be expected and handled by promtail, by retrying the request with a capped exponential backoff with jitter.

Additionally promtail could expose a metric indicating its consumer lag, that is the delta between the current head of the log file and what it has successfully processed sent to the remote server. That could be used in AlertManager to warn when there is a danger of loosing logs (for example in Kubernetes, Nodes automatically rotate and delete log files as they grow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions