Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Access] Improve api error handling #3988

Merged
merged 9 commits into from
Mar 10, 2023
Merged

Conversation

peterargue
Copy link
Contributor

@peterargue peterargue commented Mar 3, 2023

Closes: #3987

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

FVM Benchstat comparison

This branch with compared with the base branch onflow:master commit 03af504

The command (for i in {1..7}; do go test ./fvm ./engine/execution/computation --bench . --tags relic -shuffle=on --benchmem --run ^$; done) was used.

Collapsed results for better readability

@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2023

Codecov Report

Merging #3988 (891ee55) into master (031ed1a) will decrease coverage by 6.20%.
The diff coverage is 33.09%.

@@            Coverage Diff             @@
##           master    #3988      +/-   ##
==========================================
- Coverage   59.47%   53.28%   -6.20%     
==========================================
  Files         228      825     +597     
  Lines       21499    77632   +56133     
==========================================
+ Hits        12787    41366   +28579     
- Misses       7752    32941   +25189     
- Partials      960     3325    +2365     
Flag Coverage Δ
unittests 53.28% <33.09%> (-6.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
engine/access/rpc/backend/backend.go 75.47% <0.00%> (ø)
engine/access/rpc/backend/backend_block_headers.go 24.61% <0.00%> (-6.76%) ⬇️
engine/access/rpc/backend/backend_network.go 42.02% <0.00%> (-2.59%) ⬇️
engine/access/rpc/backend/errors.go 0.00% <0.00%> (ø)
engine/collection/rpc/engine.go 11.68% <0.00%> (-0.48%) ⬇️
engine/execution/rpc/engine.go 51.08% <0.00%> (-0.42%) ⬇️
engine/access/rpc/backend/backend_block_details.go 21.42% <4.16%> (-5.36%) ⬇️
engine/access/rpc/backend/backend_accounts.go 54.83% <20.00%> (+3.83%) ⬆️
engine/access/rpc/backend/backend_transactions.go 47.46% <24.24%> (-2.45%) ⬇️
engine/access/rpc/backend/backend_events.go 68.45% <70.58%> (+1.78%) ⬆️
... and 2 more

... and 617 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@peterargue peterargue requested a review from koko1123 March 7, 2023 20:03
engine/access/rpc/backend/backend_accounts.go Show resolved Hide resolved
engine/access/rpc/backend/backend_accounts.go Outdated Show resolved Hide resolved
err = rpc.ConvertStorageError(err)
return nil, flow.BlockStatusUnknown, err
// node should always have the latest block
return nil, flow.BlockStatusUnknown, status.Errorf(codes.Internal, "could not get latest block: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicates a pretty serious inconsistency in our local database - should we irrecoverable.Throw instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to cause the node to crash as a result of an API request. This is generally something to avoid due to the DoS potential. In this case, it's unclear if any API response is valid if the blocks index in inconsistent, so I think we should at least halt the APIs.

The rpc engine currently isn't a component, so there's some additional work to do before we could throw. I'll open an issue to convert it and add some halting logic.

In practice, there are many other components on ANs that also check and throw given this condition, so the node won't silently fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, there are many other components on ANs that also check and throw given this condition, so the node won't silently fail.

This is a fair point, we could add a comment along the lines to document the reasoning:

// In the RPC engine, if we encounter an error from the protocol state indicating state corruption,
// we should halt processing requests, but do throw an exception which might cause a crash:
//  - It is unsafe to process requests if we have an internally bad state. TODO issue
//  - We would like to avoid throwing an exception as a result of an Access API request by policy 
//    because this can cause DOS potential
//  - Since the protocol state is widely shared, we assume that in practice another component will
//    observe the protocol state error and throw an exception.

engine/access/rpc/backend/backend_block_details.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/backend_block_details.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/backend_transactions.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/backend_transactions.go Outdated Show resolved Hide resolved
Comment on lines 14 to 15
// ConvertError converts a generic error into a grpc status error
func ConvertError(err error, msg string, defaultCode codes.Code) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of centralized error handling has the tendency to be misused when it addresses anything other than errors which are universally benign.

For example, a context.Canceled error can pretty much always be considered a benign network failure. However, ErrNotFound could be either benign or critical, depending on the context.

  • If someone requests a block height which we have not seen yet, then ErrNotFound is benign and completely expected. We should return a status code and do nothing else.
  • If someone requests the latest finalized block, then ErrNotFound indicates a very serious problem with our internal state.

By moving the interpretation of errors from individual handlers into a central handling function, it becomes more likely that context-dependent errors like ErrNotFound will be interpreted as benign in cases where they should be raising alarm bells.

I noticed that in several places you used status.Errorf rather than this function, when a ErrNotFound indicated something seriously wrong. That's good, but I think the structure of the centralized error handling function makes it easy for future changes to skip this step and simply call ConvertError to handle any error without additional inspection.

I think the most correct approach would be to remove any error types which can't universally be considered benign in the context of the RPC Backend from the switch statement (ie. ErrNotFound) - see suggestion below. Minimally, we should clearly document the responsibility of the caller to handle context-dependent error types.

Suggestion

  • Restrict the use of ConvertError to error types which can universally be considered benign.
  • Add a note to the godoc about usage requirements
Suggested change
// ConvertError converts a generic error into a grpc status error
func ConvertError(err error, msg string, defaultCode codes.Code) error {
// ConvertError converts a generic networking error into a grpc status error. The input
// must either be a status.Error already, or be a supported error type which can be
// universally interpreted as benign (see switch statement). The caller must first check
// for any other unsupported errors and handle them accordingly.
// Error returns:
// - status.Error for any status.Error or supported networking error inputs
// - generic error in case an unsupported error type is passed in
func ConvertError(err error, msg string, defaultCode codes.Code) error {

Copy link
Contributor Author

@peterargue peterargue Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good point about storage errors, and I'll make that change. Not all storage errors should return NotFound, and codifying that will create bugs in the future.

The intention with this method is to simplify converting common error conditions into their corresponding grpc status codes immediately before returning them back to the client. Ultimately, this helps with monitoring and alerting for nodes since Internal and Unknown errors should only occur in genuine failure cases. It's not intended to make any assertions about the error condition. If it's not a direct mapping, the defaultCode should be returned (Internal in all cases in this PR)

peterargue and others added 3 commits March 8, 2023 11:59
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
@peterargue
Copy link
Contributor Author

bors merge

@bors bors bot merged commit ba99bea into master Mar 10, 2023
@bors bors bot deleted the petera/access-api-error-cleanup branch March 10, 2023 22:21
peterargue added a commit that referenced this pull request May 19, 2023
3988: [Access] Improve api error handling r=peterargue a=peterargue

Closes: #3987

Co-authored-by: Peter Argue <89119817+peterargue@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup unhandled Access API errors
4 participants