-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Access] Improve api error handling #3988
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3988 +/- ##
==========================================
- Coverage 59.47% 53.28% -6.20%
==========================================
Files 228 825 +597
Lines 21499 77632 +56133
==========================================
+ Hits 12787 41366 +28579
- Misses 7752 32941 +25189
- Partials 960 3325 +2365
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 617 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
err = rpc.ConvertStorageError(err) | ||
return nil, flow.BlockStatusUnknown, err | ||
// node should always have the latest block | ||
return nil, flow.BlockStatusUnknown, status.Errorf(codes.Internal, "could not get latest block: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indicates a pretty serious inconsistency in our local database - should we irrecoverable.Throw
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reluctant to cause the node to crash as a result of an API request. This is generally something to avoid due to the DoS potential. In this case, it's unclear if any API response is valid if the blocks index in inconsistent, so I think we should at least halt the APIs.
The rpc engine currently isn't a component, so there's some additional work to do before we could throw. I'll open an issue to convert it and add some halting logic.
In practice, there are many other components on ANs that also check and throw given this condition, so the node won't silently fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, there are many other components on ANs that also check and throw given this condition, so the node won't silently fail.
This is a fair point, we could add a comment along the lines to document the reasoning:
// In the RPC engine, if we encounter an error from the protocol state indicating state corruption,
// we should halt processing requests, but do throw an exception which might cause a crash:
// - It is unsafe to process requests if we have an internally bad state. TODO issue
// - We would like to avoid throwing an exception as a result of an Access API request by policy
// because this can cause DOS potential
// - Since the protocol state is widely shared, we assume that in practice another component will
// observe the protocol state error and throw an exception.
engine/common/rpc/errors.go
Outdated
// ConvertError converts a generic error into a grpc status error | ||
func ConvertError(err error, msg string, defaultCode codes.Code) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kind of centralized error handling has the tendency to be misused when it addresses anything other than errors which are universally benign.
For example, a context.Canceled
error can pretty much always be considered a benign network failure. However, ErrNotFound
could be either benign or critical, depending on the context.
- If someone requests a block height which we have not seen yet, then
ErrNotFound
is benign and completely expected. We should return a status code and do nothing else. - If someone requests the latest finalized block, then
ErrNotFound
indicates a very serious problem with our internal state.
By moving the interpretation of errors from individual handlers into a central handling function, it becomes more likely that context-dependent errors like ErrNotFound
will be interpreted as benign in cases where they should be raising alarm bells.
I noticed that in several places you used status.Errorf
rather than this function, when a ErrNotFound
indicated something seriously wrong. That's good, but I think the structure of the centralized error handling function makes it easy for future changes to skip this step and simply call ConvertError
to handle any error without additional inspection.
I think the most correct approach would be to remove any error types which can't universally be considered benign in the context of the RPC Backend from the switch statement (ie. ErrNotFound
) - see suggestion below. Minimally, we should clearly document the responsibility of the caller to handle context-dependent error types.
Suggestion
- Restrict the use of
ConvertError
to error types which can universally be considered benign. - Add a note to the godoc about usage requirements
// ConvertError converts a generic error into a grpc status error | |
func ConvertError(err error, msg string, defaultCode codes.Code) error { | |
// ConvertError converts a generic networking error into a grpc status error. The input | |
// must either be a status.Error already, or be a supported error type which can be | |
// universally interpreted as benign (see switch statement). The caller must first check | |
// for any other unsupported errors and handle them accordingly. | |
// Error returns: | |
// - status.Error for any status.Error or supported networking error inputs | |
// - generic error in case an unsupported error type is passed in | |
func ConvertError(err error, msg string, defaultCode codes.Code) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good point about storage errors, and I'll make that change. Not all storage errors should return NotFound
, and codifying that will create bugs in the future.
The intention with this method is to simplify converting common error conditions into their corresponding grpc status codes immediately before returning them back to the client. Ultimately, this helps with monitoring and alerting for nodes since Internal
and Unknown
errors should only occur in genuine failure cases. It's not intended to make any assertions about the error condition. If it's not a direct mapping, the defaultCode
should be returned (Internal
in all cases in this PR)
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
bors merge |
Closes: #3987