Description
openedon May 10, 2024
Background
The ingester runs as a BasicService
and moves to the services.Running
state after the starting()
function completes.
As part of its starting()
function, the ingester starts a ring.Lifecycler
. Once started, the lifecycler auto-joins the ring, and moves the ingester's ring state to ring.ACTIVE
as soon as it can.
The Problem
- Once an ingester's ring state is
ring.ACTIVE
it becomes available for read requests. - When the ingester services is not in the
services.Running
state, the ingester will reject read requests.
Because of the above, starting the Lifecycler
essentially starts a timer on the ingester service getting to the services.Running
state. If the ingester's starting()
function is still being executed when the ring state becomes ring.ACTIVE
, the ingester will start receiving read requests, but reject them all with error ingester is unavailable (current state: Starting)
.
This isn't much of an issue if a single ingester enters this state, since reads are able to complete using other zones to achieve quorum. However, when ingesters are scaled up horizontally, instances are added to all zones at the same time. If instances in multiple zones are rejecting reads while in the services.Stating
state, quorum can't be achieved, and we suffer a read outage.
Solution
Ideally, moving the ring state to ring.ACTIVE
should be the last thing done in the ingester's starting()
function (or the first thing done in its running()
function) -- no other code should run in between those two events.
Unfortunately the existing ring.Lifecycler
used by the ingester doesn't offer much control over when the switch to ring.ACTIVE
occurs, since it auto-joins the ring.