Skip to content

ruler: don't retry on non-retriable error #7216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
* [BUGFIX] Fix panic during tsdb Commit #6766
* [BUGFIX] tsdb/head: wlog exemplars after samples #6766
* [BUGFIX] Ruler: fix issue where "failed to remotely evaluate query expression, will retry" messages are logged without context such as the trace ID and do not appear in trace events. #6789
* [BUGFIX] Ruler: do not retry requests to remote querier when server's response exceeds its configured max payload size. #7216
* [BUGFIX] Querier: fix issue where spans in query request traces were not nested correctly. #6893
* [BUGFIX] Fix issue where all incoming HTTP requests have duplicate trace spans. #6920
* [BUGFIX] Querier: do not retry requests to store-gateway when a query gets canceled. #6934
Expand Down
23 changes: 19 additions & 4 deletions pkg/ruler/remotequerier.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ import (
"github.com/prometheus/prometheus/promql"
"golang.org/x/exp/slices"
"google.golang.org/grpc"
"google.golang.org/grpc/codes"

"github.com/grafana/mimir/pkg/querier/api"
"github.com/grafana/mimir/pkg/util/spanlogger"
Expand Down Expand Up @@ -309,11 +310,25 @@ func (q *RemoteQuerier) sendRequest(ctx context.Context, req *httpgrpc.HTTPReque
}
return resp, nil
}
// 4xx errors shouldn't be retried because it is expected that
// running the same query gives rise to the same 4xx error.
if code := grpcutil.ErrorToStatusCode(err); code/100 == 4 {
return nil, err

// Bail out if the error is known to be not retriable.
switch code := grpcutil.ErrorToStatusCode(err); code {
case codes.ResourceExhausted:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed via Slack, the initial idea was to make all errors with codes.ResourceExhausted status code fail. Some of these errors come from gRPC itself (e.g., https://github.com/grpc/grpc-go/blob/84b85babc00b4b8460e53b6ee110bfb49e9311cf/rpc_util.go#L610-L615). On the other hand, there are some errors with status code codes.ResourceExhausted returned by distributors and ingesters that should be retried. These cases are:

  1. ingester too busy: the returned error has codes.ResourceExhausted status code, but also contains a mimirpb.ErrorDetail with cause mimirpb.TOO_BUSY and should always be retried
  2. ingestion rate limit hit: the returned error has codes.ResourceExhausted status code, but also contains a mimirpb.ErrorDetail with cause mimirpb.INGESTION_RATE_LIMITED and should be retried if distributor's -distributor.service-overload-status-code-on-rate-limit-enabled is set to true
  3. request rate limit hit: the returned error has codes.ResourceExhausted status code, but also contains a mimirpb.ErrorDetail with cause mimirpb.REQUEST_RATE_LIMITED and should be retried if distributor's -distributor.service-overload-status-code-on-rate-limit-enabled is set to true

So I recommend to check whether the error contains a detail of type mimirpb.ErrorDetails with one of the causes mentioned above and to retry when needed.

An example:

stat, ok := grpcutil.ErrorToStatus(err)
if ok {
	switch code := stat.Code() {
		case codes.ResourceExhausted:
			details := stat.Details()
			if len(details) != 1 {
				return nil, err
			}
			errDetails, ok := details[0].(*mimirpb.ErrorDetails)
			if !ok {
				return nil, err
			}
			if errDetails.GetCause() == mimirpb.INGESTION_RATE_LIMITED 
				|| errDetails.GetCause() == mimirpb.REQUEST_RATE_LIMITED
				|| errDetails.GetCause() == mimirpb.TOO_BUSY {
				continue
			}
			return nil, err
		default:
			// In case the error was a wrapped HTTPResponse, its code represents HTTP status;
			// 4xx errors shouldn't be retried because it is expected that
			// running the same query gives rise to the same 4xx error.
			if code/100 == 4 {
				return nil, err
			}
		}
}

IMPORTANT: this code is just an example, and it doesn't take into account the -distributor.service-overload-status-code-on-rate-limit-enabled flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After another chat with @narqo I actually realized that mimirpb.INGESTION_RATE_LIMITED and mimirpb.REQUEST_RATE_LIMITED could be thrown on the write path only, while the present PR concerns the read path. Therefore, the -distributor.service-overload-status-code-on-rate-limit-enabled CLI flag is not important, and the implementation can be simplified.

stat, ok := grpcutil.ErrorToStatus(err)
if ok {
	switch code := stat.Code() {
		case codes.ResourceExhausted:
			details := stat.Details()
			if len(details) != 1 {
				return nil, err
			}
			errDetails, ok := details[0].(*mimirpb.ErrorDetails)
			if !ok {
				return nil, err
			}
			if errDetails.GetCause() == mimirpb.TOO_BUSY {
				continue
			}
			return nil, err
		default:
			// In case the error was a wrapped HTTPResponse, its code represents HTTP status;
			// 4xx errors shouldn't be retried because it is expected that
			// running the same query gives rise to the same 4xx error.
			if code/100 == 4 {
				return nil, err
			}
		}
}

// In case the server is configured with "grpc-max-send-msg-size-bytes",
// and the response exceeds this limit, there is no point retrying the request.
// This is a special case, refer to grafana/mimir#7216.
if strings.Contains(err.Error(), "message larger than max") {
return nil, err
}
default:
// In case the error was a wrapped HTTPResponse, its code represents HTTP status;
// 4xx errors shouldn't be retried because it is expected that
// running the same query gives rise to the same 4xx error.
if code/100 == 4 {
return nil, err
}
}

if !retry.Ongoing() {
return nil, err
}
Expand Down
112 changes: 29 additions & 83 deletions pkg/ruler/remotequerier_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -185,26 +185,22 @@ func TestRemoteQuerier_Query(t *testing.T) {
})
}

func TestRemoteQuerier_SendRequest(t *testing.T) {
func TestRemoteQuerier_QueryRetryOnFailure(t *testing.T) {
const errMsg = "this is an error"
var (
successfulResponse = &httpgrpc.HTTPResponse{
Code: http.StatusOK,
Headers: []*httpgrpc.Header{
{Key: "Content-Type", Values: []string{"application/json"}},
},
Body: []byte(`{
"status": "success","data": {"resultType":"vector","result":[]}
}`),
}
erroneousResponse = &httpgrpc.HTTPResponse{
Code: http.StatusBadRequest,
Headers: []*httpgrpc.Header{
{Key: "Content-Type", Values: []string{"application/json"}},
},
Body: []byte("this is an error"),
}
)
successfulResponse := &httpgrpc.HTTPResponse{
Code: http.StatusOK,
Headers: []*httpgrpc.Header{
{Key: "Content-Type", Values: []string{"application/json"}},
},
Body: []byte(`{"status": "success","data": {"resultType":"vector","result":[]}}`),
}
erroneousResponse := &httpgrpc.HTTPResponse{
Code: http.StatusBadRequest,
Headers: []*httpgrpc.Header{
{Key: "Content-Type", Values: []string{"application/json"}},
},
Body: []byte("this is an error"),
}

tests := map[string]struct {
response *httpgrpc.HTTPResponse
Expand Down Expand Up @@ -237,6 +233,16 @@ func TestRemoteQuerier_SendRequest(t *testing.T) {
expectedError: status.Error(codes.DeadlineExceeded, context.DeadlineExceeded.Error()),
expectedRetries: true,
},
"gRPC ResourceExhausted error is retried": {
err: status.Error(codes.ResourceExhausted, errMsg),
expectedError: status.Error(codes.ResourceExhausted, errMsg),
expectedRetries: true,
},
"errors about execeeding gRPC server's limit are not retried": {
err: status.Error(codes.ResourceExhausted, "trying to send message larger than max"),
expectedError: status.Error(codes.ResourceExhausted, "trying to send message larger than max"),
expectedRetries: false,
},
"errors with code 4xx are not retried": {
err: httpgrpc.Errorf(http.StatusBadRequest, errMsg),
expectedError: httpgrpc.Errorf(http.StatusBadRequest, errMsg),
Expand All @@ -257,10 +263,7 @@ func TestRemoteQuerier_SendRequest(t *testing.T) {
}
for testName, testCase := range tests {
t.Run(testName, func(t *testing.T) {
var (
inReq *httpgrpc.HTTPRequest
count atomic.Int64
)
var count atomic.Int64

ctx, cancel := context.WithCancel(context.Background())
mockClientFn := func(ctx context.Context, req *httpgrpc.HTTPRequest, _ ...grpc.CallOption) (*httpgrpc.HTTPResponse, error) {
Expand All @@ -275,7 +278,7 @@ func TestRemoteQuerier_SendRequest(t *testing.T) {
}
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
require.Equal(t, int64(0), count.Load())
_, err := q.sendRequest(ctx, inReq, log.NewNopLogger())
_, err := q.Query(ctx, "qs", time.Now())
if testCase.err == nil {
if testCase.expectedError == nil {
require.NoError(t, err)
Expand Down Expand Up @@ -717,63 +720,6 @@ func TestRemoteQuerier_QueryReqTimeout(t *testing.T) {
require.Error(t, err)
}

func TestRemoteQuerier_BackoffRetry(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, these tests duplicate what is already covered in TestRemoteQuerier_QueryRetryOnFailure. I don't think, testing the actual number of retries worth this number of code. The backoff is an internal implementation detail (_that's IMHO, and I don't feel strong about it).

tcs := map[string]struct {
failedRequests int
expectedError string
requestDeadline time.Duration
}{
"succeed on failed requests <= max retries": {
failedRequests: maxRequestRetries,
},
"fail on failed requests > max retries": {
failedRequests: maxRequestRetries + 1,
expectedError: "failed request: 4",
},
"return last known error on context cancellation": {
failedRequests: 1,
requestDeadline: 50 * time.Millisecond, // force context cancellation while waiting for retry
expectedError: "context deadline exceeded while retrying request, last err was: failed request: 1",
},
}
for tn, tc := range tcs {
t.Run(tn, func(t *testing.T) {
retries := 0
mockClientFn := func(ctx context.Context, req *httpgrpc.HTTPRequest, _ ...grpc.CallOption) (*httpgrpc.HTTPResponse, error) {
retries++
if retries <= tc.failedRequests {
return nil, fmt.Errorf("failed request: %d", retries)
}
return &httpgrpc.HTTPResponse{
Code: http.StatusOK,
Headers: []*httpgrpc.Header{
{Key: "Content-Type", Values: []string{"application/json"}},
},
Body: []byte(`{
"status": "success","data": {"resultType":"vector","result":[{"metric":{"foo":"bar"},"value":[1,"773054.5916666666"]}]}
}`),
}, nil
}
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, formatJSON, "/prometheus", log.NewNopLogger())

ctx := context.Background()
if tc.requestDeadline > 0 {
var cancelFn context.CancelFunc
ctx, cancelFn = context.WithTimeout(ctx, tc.requestDeadline)
defer cancelFn()
}

resp, err := q.Query(ctx, "qs", time.Now())
if tc.expectedError != "" {
require.EqualError(t, err, tc.expectedError)
} else {
require.NotNil(t, resp)
require.Len(t, resp, 1)
}
})
}
}

func TestRemoteQuerier_StatusErrorResponses(t *testing.T) {
var (
errorResp = &httpgrpc.HTTPResponse{
Expand All @@ -785,8 +731,8 @@ func TestRemoteQuerier_StatusErrorResponses(t *testing.T) {
"status": "error","errorType": "execution"
}`),
}
error4xx = status.Error(http.StatusUnprocessableEntity, "this is a 4xx error")
error5xx = status.Error(http.StatusInternalServerError, "this is a 5xx error")
error4xx = httpgrpc.Errorf(http.StatusUnprocessableEntity, "this is a 4xx error")
error5xx = httpgrpc.Errorf(http.StatusInternalServerError, "this is a 5xx error")
Comment on lines +734 to +735
Copy link
Contributor Author

@narqo narqo Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: these are outside the scope of changes. It was just confusing, while I was reading through the code, that we use non-gRPC codes in gRPC statuses, while their docs explicitly ask not doing that (I'm now aware that httpgrpc has different semantics for the codes, while keeping the same implementation) ;)

)
testCases := map[string]struct {
resp *httpgrpc.HTTPResponse
Expand Down