Description
It appears promtail is terminating (and the pod is restarting) when it receives a 500 error from the loki server.
"Error sending batch: Error doing write: 500 - 500 Internal Server Error"
From a discussion on Slack, this occurs when the remote end is overloaded. Possibly this should be a more specific 503 slow down
error?
Perhaps back-pressure from the remote end should be expected and handled by promtail, by retrying the request with a capped exponential backoff with jitter.
Additionally promtail could expose a metric indicating its consumer lag, that is the delta between the current head of the log file and what it has successfully processed sent to the remote server. That could be used in AlertManager to warn when there is a danger of loosing logs (for example in Kubernetes, Nodes automatically rotate and delete log files as they grow).