Skip to content

"too many failed ingesters" using memberlist #3360

Closed
github-vincent-miszczak/loki
#2
@github-vincent-miszczak

Description

Describe the bug
Cluster is down while it should not.

To Reproduce
Using Loki 2.1.0
The initial setup is 2 monolithic Loki 2.1.0 running with replication_factor: 2.
I add 2 nodes to the cluster, they all show ACTIVE looking at /ring.
I remove the first 2 nodes. They first show as LEAVING then they go Unhealthy.
They never leave this state (could not find a relevant config option).
At this point the cluster is down. Read or writes fail with something like:
level=warn ts=2021-02-19T14:42:46.766880514Z caller=logging.go:71 traceID=44198a5667db211f msg="POST /loki/api/v1/push (500) 147.959µs Response: \"at least 3 live replicas required, could only find 2\\n\"

Forgetting a single Unhealthy node using /ring buttons is enough to recover.

Expected behavior
2 ACTIVE nodes is sufficient for the cluster to be healthy, so the cluster should not be down when this condition is met.
Unhealthy nodes should leave the ring at some configurable point.

Environment:

  • Infrastructure: ECS
  • Deployment tool: Terraform

Screenshots, Promtail config, or terminal output
image
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions