Skip to content

Enhance Query availability when single store-gateway is slow #4871

Open
@alanprot

Description

@alanprot

Cortex only retry fetching a block from a store gateway upon error, see:

if err != nil {
if isRetryableError(err) {
level.Warn(spanLog).Log("err", errors.Wrapf(err, "failed to fetch series from %s due to retryable error", c.RemoteAddress()))
return nil
}
return errors.Wrapf(err, "failed to fetch series from %s", c.RemoteAddress())

for attempt := 1; attempt <= maxFetchSeriesAttempts; attempt++ {

This means that is a single store gateway is just slow and not return an error, the query will eventually timeout.
This scenario can happens for multiple reasons like network partition between store gateway and the storage or a slow disk.

On those cases we could:

  • Try to fetch at least 2 store-gateways in parallel, or
  • Have some mechanism to make store-gateway advertise that he cannot handle requests (set itself to unhealthy?)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions