Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add metric to track number of times a request fails to forward #2420

Closed
wants to merge 3 commits into from

Conversation

bryanhuhta
Copy link
Contributor

In #2391, a bug was introduced that stopped the scheduler from sending requests to the querier. This was reverted in #2398.

This PR adds metrics to track the number of times a frontend worker fails to forward a request to a querier. We can then create alerting based off those metrics to catch failures earlier.

@bryanhuhta bryanhuhta requested a review from a team as a code owner September 18, 2023 17:15
@bryanhuhta bryanhuhta self-assigned this Sep 18, 2023
Comment on lines 315 to 324
backoff := backoff.New(ctx, backoffConfig)
for backoff.Ongoing() {
if !attemptLoop() {
w.forwardedRequestsFailedStats.Add(1)
backoff.Wait()
} else {
w.forwardedRequestsFailedStats.Set(0)
backoff.Reset()
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could distinguish between failed retry attempts and failures, I guess (assuming we want to setup alerting). backoff.Err() indicates whether we have exhausted all the attempts.

I'm not very sure about resetting the metric with w.forwardedRequestsFailedStats.Set(0), as this is a shared object; how does it affect concurrently occurring failures?

Comment on lines +240 to +241
// Number of queries that failed to forward to the querier.
forwardedRequestsFailedStats *expvar.Int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should use usagestats for this. Would a prometheus.Counter be more suitable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. We use usage stats for the "phone home" functionality and wouldn't collect operational metrics to that detail there.

@bryanhuhta
Copy link
Contributor Author

I don't have the bandwidth to pursue this at the moment. I'm going to close this PR in favor of this issue #2523. When there is more time, we can come back to that issue for a more thorough improvement.

@bryanhuhta bryanhuhta closed this Oct 11, 2023
@bryanhuhta bryanhuhta deleted the fe-scheduler-metrics branch October 12, 2023 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants