Skip to content

Add querier.ingester-query-max-attempts to retry on partial data. #6714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 22, 2025

Conversation

justinjung04
Copy link
Contributor

@justinjung04 justinjung04 commented Apr 22, 2025

What this PR does:
In #6526, new configs query_partial_data and rules_partial_data were added which allows tenants to receive 2xx with a warning message when the data accuracy is relatively high in zone-aware setting.

This PR adds retry logic in querier getting data from ingesters, retrying the requests if the response is partial data. The new configuration, querier.ingester-query-max-attempts, allows ingester queries to be retried. Default is set to 1.

Which issue(s) this PR fixes:
n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 changed the title Make partial data responses to be retryable Add querier.ingester-query-max-attempts to retry on partial data. Apr 22, 2025
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 marked this pull request as ready for review April 22, 2025 06:03
Copy link
Contributor

@danielblando danielblando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SungJin1212
Copy link
Member

LGTM!

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should document clearly and provide some suggestions on what's the recommended set up for this retry value.

To me this flag is kind of overlapped with partial data and I am not sure how much the retry helps for most of the usecase.
We know it is unlikely to miss series so we return partial data with 4XX. The retry may succeed but given we only wait short period time and Ingester queries are usually within ms, I am unsure if it is worth it to retry more espeically if Ingesters are high load or ongoing deployment

@justinjung04
Copy link
Contributor Author

Do you think the following description for the config is sufficient?

The maximum number of times we attempt fetching data from ingesters for retryable errors (ex. partial data returned).

What I noticed with partial data is that the status code is 200 (such that the result is actually returned to the customer), but it throws a warning along with the results. So, when there is a transient network issue between a querier and ingesters, the customer could get partial data response instead of the query path retrying to get the full response (current retry logic from query frontend only reties 5xx responses, so partial data response does not get retried at all).

Signed-off-by: Justin Jung <jungjust@amazon.com>
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We can merge after fixing the changelog conflict

Signed-off-by: Justin Jung <jungjust@amazon.com>
@yeya24 yeya24 merged commit 0b8c593 into cortexproject:master May 22, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants