-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: snapshot recover from exporter error #13935
fix: snapshot recover from exporter error #13935
Conversation
Example output which is very helpful for debugging snapshots:
It could be argued that a program should crash when a store doesn't have a nodedb since that could indicate a deeper problem. If that is desired let me know. The idea here is to make it clear exactly which misconfigured store is crashing the daemon rather than a panic in a location which is called many times with different conditions. In the case of Dig Chain, snapshots without the icahost store restored fine and validator nodes are signing with the proper apphash values. So it is likely that is an unused store that is mistakenly added, or mistakenly unconfigured. |
store/rootmulti/store.go
Outdated
@@ -757,8 +757,17 @@ func (rs *Store) Snapshot(height uint64, protoWriter protoio.Writer) error { | |||
// and the following messages contain a SnapshotNode (i.e. an ExportNode). Store changes | |||
// are demarcated by new SnapshotStore items. | |||
for _, store := range stores { | |||
fmt.Printf("Snapshot.Export Store Key %s\n", store.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use the logger and place this in debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix tomorrow. didn't see the logger earlier, but I see it now
Changed all fmt.Printf to rs.logger Found out that the original logger was a NopLogger which DID NOT respect the cli --log_level The code path does not make it obvious how to attach to the global app logger, nor does it make sense to turn on consensus debugging when trying to debug rootmulti. Thus we allow a runtime setting of environment variable LOG_ROOTMULTI to turn on logging for this package. Example usage:
|
store/store.go
Outdated
@@ -10,7 +12,15 @@ import ( | |||
) | |||
|
|||
func NewCommitMultiStore(db dbm.DB) types.CommitMultiStore { | |||
return rootmulti.NewStore(db, log.NewNopLogger()) | |||
// Developers may enable rootmulti logging by setting env LOG_ROOTMULTI=1 | |||
if len(os.Getenv("LOG_ROOTMULTI")) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be cleaner in the app.toml. But not sure, lets wait to see what others think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these makes the most sense?
- passing the app the NewCommitMultiStore
- passing the logging level desired to NewCommitMultiStore
- configuring the logger in the baseapp and supplying it to NewCommitMultiStore
Also not sure where such a setting would go in app.toml (other than root?) since store spans quite a number of functions/reactors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this change to eliminate complexity within this PR
Do not merge, found a bug (noted above) |
store/rootmulti/store.go
Outdated
if err != nil { | ||
rs.logger.Error("Snapshot Failed", "store", store.name, "err", err) | ||
return err | ||
} | ||
defer exporter.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has a bug, defers don't run until the function is returning, this assumes that they'll be closed per iteration of the loop. I've mailed PR #14027
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting note is that exporter.Close() is called at the end of this loop.
How about we remove the defer and call expoter.Close() manually only on error conditions (of which there are only a few)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your rewrite in #14027, that will work also of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in my PR I just put that code into a closure then kept the defer.
ce26122
to
fd0e4a7
Compare
Here's an update on this PR. Previously the effort was focused on preventing panic/nil/crashes during a snapshot. After review of chains with the current crashing problem, we have determined that chains introducing a new store without a keeper and a call to This new store will continue to increment versions by 1 as enforced by cosmos/iavl on each block. The crashing problem appear then when a store within rootmulti must be selected for the current block height for all stores. However the new improperly introduced store, does not have a "version" matching the current height, and thus any effort to snapshot will fail. For example, if your rootmulti has stores at height 1234
Then a chain adds a module such as authz This results in commit with stores in rootmulti like:
A proper snapshot now needs to consider the actual referenced version of each store during the commit. This PR now loads the commit, determines the actual version of each store, and places that information into the snapshot. The snapshot protobuf format had to be modified because the restore function could no longer assume that all height match. Since IAVL checksums include the store version, nodes will AppHash error on the first modification to that store since the new checksum will include the mismatched version. For example, all other nodes on the network 100 blocks after the upgrade will have:
But the restored node which assumes all restored stores have the same height will have
To avoid this, we utilize the newly added Version to ensure that authz is restored to version 100 to match all other validator nodes. This avoid the situation where a team deploying an upgrade in an imperfect function immediately penalizes all validators by breaking snapshots creation and statesync restore. The fact that a team is not perfect should not manifest as a panic especially since the the chain works otherwise. Please note, we encountered issues using the protobuf generation functions and would appreciate if another member could run the protobuf generation and see if the results match or if additional corrections are needed. |
fd0e4a7
to
626a01d
Compare
The below documents the reason help is necessary to properly regenerate the snapshot protobuf pb.go files. This error is current as of cosmos/cosmos-sdk main branch at commit 25449b5
With a manual run inside the docker image:
Does not produce an output file. No errors are reported.
|
store/rootmulti/store.go
Outdated
|
||
for key, store := range rs.stores { | ||
rs.logger.Debug("prune store", "key", key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the store as a field to the logger too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw the suggestion to log "name', store.Name, but store.name is a private lower case variable.
store/rootmulti/store.go
Outdated
if item.Store.Version == 0 { | ||
fmt.Printf("!!!!!!!!!!!!!!!!! Restore Error, Snapshot Missing Version, Use Another Peer !!!!!!!!!!!!!!!!") | ||
rs.logger.Error("Restore Error, Snapshot Missing Version, Use Another Peer", "store", item.Store.Name, "version", item.Store.Version) | ||
return snapshottypes.SnapshotItem{}, sdkerrors.Wrap(sdkerrors.ErrLogic, "Snapshot Without Store Version Not Supported, Use Another Peer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please do not capitalize or punctuate errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion noted. This code is no longer relevant to the scope of this PR
Thank you for the review @alexanderbez Noted above is that protobuf to go code generation appears broken. Is the code generation working for your team? Would be best to have matching .proto and .go code in the final commit. |
@julienrbrt any idea why this is happening, I cant reproduce it |
I cannot reproduce it either. Works for me in GitHub Codespaces as well. @chillyvee can you retry in codespaces too and what is your OS? |
Works in github codespaces but not Ubuntu 20.04 x64 This is the OS I'm using for the x64 dev environment
I will delete the docker image I have and pull it again to see what happens - does not work
In codespaces
And the codespaces output is much more verbose
Looks like something is preventing the go build/mod process from completing properly. Same docker image. Must just be something very strange with that particular release of Ubuntu+Docker? Will set up alternate dev environment/use codespaces if that's all it takes to fix. Might not be worth the time to debug. |
I believe it has something to do with Ubuntu+Docker release as mentioned by @chillyvee because I had tried debugging in several different ways but nothing seems to work locally, but it works fine in codespaces with the same docker image. |
FYI got my node synced up with the latest juno testnet so working on testing now. |
52fe0c9
to
8027524
Compare
Previous branch renamed to v0466_snapshot_recover_from_exporter_error_master_preclean Removed enhanced snapshot format related code Kept debugging code which can assist validators with real-time monitoring. |
92e3955
to
aae329a
Compare
aae329a
to
267c64e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating these changes @chillyvee. Just one more change then good to go. Please also update the base branch to the latest.
Thank you for the feedback. One more item: @tac0turtle @odeke-em - Earlier someone asked to log store.Name, but store.name is a private variable. Okay to delete the comment from this PR? (code reference here). If there was a clean way to access the private .name, please recommend it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thank you @chillyvee!
Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com>
Given it ended up only adding logs and improving UX, we should get this in v0.47 final imho. |
@Mergifyio backport release/v0.47.x |
✅ Backports have been created
|
Co-authored-by: Marko <marbar3778@yahoo.com> Co-authored-by: Emmanuel T Odeke <emmanuel@orijtech.com> Co-authored-by: Marko <marko@baricevic.me> Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com> (cherry picked from commit 183dde8) # Conflicts: # store/rootmulti/store.go
exporter, err := store.Export(int64(height)) | ||
if exporter == nil { | ||
rs.logger.Error("snapshot failed; exporter is nil", "store", store.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at store.Export
this will always trigger when there is an error, and the error check below is now not so useful. Was it intended?
cosmos-sdk/store/iavl/store.go
Lines 275 to 288 in 5d559dd
func (st *Store) Export(version int64) (*iavl.Exporter, error) { | |
istore, err := st.GetImmutable(version) | |
if err != nil { | |
return nil, errorsmod.Wrapf(err, "iavl export failed for version %v", version) | |
} | |
tree, ok := istore.tree.(*immutableTree) | |
if !ok || tree == nil { | |
return nil, fmt.Errorf("iavl export failed: unable to fetch tree for version %v", version) | |
} | |
export, err := tree.Export() | |
if err != nil { | |
return nil, err | |
} | |
return export, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we modified the exporter function after the fact and probably never cleaned this up
Description
During snapshots, it is difficult to tell which store was not able to export properly.
In the case of Dig chain, one store was not fully configured (no nodedb) so the returned exporter should be nil.
Cosmos-sdk can then decide whether to panic or skip the possibly unused store.
Added print when each store begins and completes an export.
This related update in cosmos/iavl properly returns a nil exporter on error:
https://github.com/cosmos/iavl/pull/622/files#diff-9d9ca52939a19f26358e2b5b35ca938679b381be35f4c54379df94c4fa4d1233R44
Logging for rootmulti was previous hardcoded to NewNopLogger which did NOT log.
Now developers/users can set an environment variable to turn on logging for this package:
LOG_ROOTMULTI=1 daemond start
Author Checklist
CHANGELOG.md
Reviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...
!
in the type prefix if API or client breaking change