-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add metric to track number of times a request fails to forward #2420
Conversation
backoff := backoff.New(ctx, backoffConfig) | ||
for backoff.Ongoing() { | ||
if !attemptLoop() { | ||
w.forwardedRequestsFailedStats.Add(1) | ||
backoff.Wait() | ||
} else { | ||
w.forwardedRequestsFailedStats.Set(0) | ||
backoff.Reset() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could distinguish between failed retry attempts and failures, I guess (assuming we want to setup alerting). backoff.Err()
indicates whether we have exhausted all the attempts.
I'm not very sure about resetting the metric with w.forwardedRequestsFailedStats.Set(0)
, as this is a shared object; how does it affect concurrently occurring failures?
// Number of queries that failed to forward to the querier. | ||
forwardedRequestsFailedStats *expvar.Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should use usagestats for this. Would a prometheus.Counter
be more suitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree. We use usage stats for the "phone home" functionality and wouldn't collect operational metrics to that detail there.
I don't have the bandwidth to pursue this at the moment. I'm going to close this PR in favor of this issue #2523. When there is more time, we can come back to that issue for a more thorough improvement. |
In #2391, a bug was introduced that stopped the scheduler from sending requests to the querier. This was reverted in #2398.
This PR adds metrics to track the number of times a frontend worker fails to forward a request to a querier. We can then create alerting based off those metrics to catch failures earlier.