Description
Describe the bug
Cluster is down while it should not.
To Reproduce
Using Loki 2.1.0
The initial setup is 2 monolithic Loki 2.1.0 running with replication_factor: 2
.
I add 2 nodes to the cluster, they all show ACTIVE
looking at /ring
.
I remove the first 2 nodes. They first show as LEAVING
then they go Unhealthy
.
They never leave this state (could not find a relevant config option).
At this point the cluster is down. Read or writes fail with something like:
level=warn ts=2021-02-19T14:42:46.766880514Z caller=logging.go:71 traceID=44198a5667db211f msg="POST /loki/api/v1/push (500) 147.959µs Response: \"at least 3 live replicas required, could only find 2\\n\"
Forgetting a single Unhealthy
node using /ring
buttons is enough to recover.
Expected behavior
2 ACTIVE
nodes is sufficient for the cluster to be healthy, so the cluster should not be down when this condition is met.
Unhealthy nodes should leave the ring at some configurable point.
Environment:
- Infrastructure: ECS
- Deployment tool: Terraform