-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Dumping upgrade info interrupted by cosmovisor #9384
Conversation
Solution: - dumping upgrade info before emit `UPGRADED NEEDED` log which will cause cosmovisor to kill chain process
Codecov Report
@@ Coverage Diff @@
## master #9384 +/- ##
==========================================
- Coverage 60.59% 60.43% -0.16%
==========================================
Files 589 590 +1
Lines 37227 37243 +16
==========================================
- Hits 22556 22508 -48
- Misses 12727 12754 +27
- Partials 1944 1981 +37
|
@yihuang what issue number does this PR resolve? |
I didn't open an issue for it, I described the bug in PR description directly. We find the issue in our testing. |
@yihuang How can i reproduce this bug? |
Thanks @yihuang 🙏 |
@cyberbono3 I don't think we need to actually reproduce this. The patch is pretty straightforward and looks correct. Cosmovisor will currently kill the node once that message hits stdout and that could interrupt writing the file. So changing the order will fix that. What we really need to do for 0.42 nodes is get #8590 finished ASAP. |
@Mergifyio backport release/v0.42.x |
Solution: - dumping upgrade info before emit `UPGRADED NEEDED` log which will cause cosmovisor to kill chain process <!-- < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < ☺ v ✰ Thanks for creating a PR! ✰ v Before smashing the submit button please review the checkboxes. v If a checkbox is n/a - please still include it but + a little note why ☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --> ## Description The problematic procedure: 1. chain process output UPGRADE NEEDED log 2. cosmovisor see the message and kill the chain binary 3. chain process dump upgrade info and panic itself the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log. there are two problematic situation: 1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error. 2. the upgrade info file is not created at all, when the chain process restart, the [store upgrades](https://github.com/crypto-org-chain/chain-main/blob/master/app/app.go#L436) are not activated, will cause app hash mismatch error later on. --- Before we can merge this PR, please make sure that all the following items have been checked off. If any of the checklist items are not applicable, please leave them but write a little note why. - [x] Targeted PR against correct branch (see [CONTRIBUTING.md](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [ ] Linked to Github issue with discussion and accepted design OR link to spec that describes this work. - [x] Code follows the [module structure standards](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules/structure.md). - [ ] Wrote unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [ ] Updated relevant documentation (`docs/`) or specification (`x/<module>/spec/`) - [ ] Added relevant `godoc` [comments](https://blog.golang.org/godoc-documenting-go-code). - [ ] Added a relevant changelog entry to the `Unreleased` section in `CHANGELOG.md` - [x] Re-reviewed `Files changed` in the Github PR explorer - [x] Review `Codecov Report` in the comment section below once CI passes (cherry picked from commit 0540ed2)
Command
|
Solution: - dumping upgrade info before emit `UPGRADED NEEDED` log which will cause cosmovisor to kill chain process <!-- < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < ☺ v ✰ Thanks for creating a PR! ✰ v Before smashing the submit button please review the checkboxes. v If a checkbox is n/a - please still include it but + a little note why ☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --> ## Description The problematic procedure: 1. chain process output UPGRADE NEEDED log 2. cosmovisor see the message and kill the chain binary 3. chain process dump upgrade info and panic itself the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log. there are two problematic situation: 1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error. 2. the upgrade info file is not created at all, when the chain process restart, the [store upgrades](https://github.com/crypto-org-chain/chain-main/blob/master/app/app.go#L436) are not activated, will cause app hash mismatch error later on. --- Before we can merge this PR, please make sure that all the following items have been checked off. If any of the checklist items are not applicable, please leave them but write a little note why. - [x] Targeted PR against correct branch (see [CONTRIBUTING.md](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [ ] Linked to Github issue with discussion and accepted design OR link to spec that describes this work. - [x] Code follows the [module structure standards](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules/structure.md). - [ ] Wrote unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [ ] Updated relevant documentation (`docs/`) or specification (`x/<module>/spec/`) - [ ] Added relevant `godoc` [comments](https://blog.golang.org/godoc-documenting-go-code). - [ ] Added a relevant changelog entry to the `Unreleased` section in `CHANGELOG.md` - [x] Re-reviewed `Files changed` in the Github PR explorer - [x] Review `Codecov Report` in the comment section below once CI passes (cherry picked from commit 0540ed2) Co-authored-by: yihuang <huang@crypto.com>
@Mergifyio backport release/v0.43.x |
Solution: - dumping upgrade info before emit `UPGRADED NEEDED` log which will cause cosmovisor to kill chain process <!-- < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < ☺ v ✰ Thanks for creating a PR! ✰ v Before smashing the submit button please review the checkboxes. v If a checkbox is n/a - please still include it but + a little note why ☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --> ## Description The problematic procedure: 1. chain process output UPGRADE NEEDED log 2. cosmovisor see the message and kill the chain binary 3. chain process dump upgrade info and panic itself the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log. there are two problematic situation: 1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error. 2. the upgrade info file is not created at all, when the chain process restart, the [store upgrades](https://github.com/crypto-org-chain/chain-main/blob/master/app/app.go#L436) are not activated, will cause app hash mismatch error later on. --- Before we can merge this PR, please make sure that all the following items have been checked off. If any of the checklist items are not applicable, please leave them but write a little note why. - [x] Targeted PR against correct branch (see [CONTRIBUTING.md](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [ ] Linked to Github issue with discussion and accepted design OR link to spec that describes this work. - [x] Code follows the [module structure standards](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules/structure.md). - [ ] Wrote unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [ ] Updated relevant documentation (`docs/`) or specification (`x/<module>/spec/`) - [ ] Added relevant `godoc` [comments](https://blog.golang.org/godoc-documenting-go-code). - [ ] Added a relevant changelog entry to the `Unreleased` section in `CHANGELOG.md` - [x] Re-reviewed `Files changed` in the Github PR explorer - [x] Review `Codecov Report` in the comment section below once CI passes (cherry picked from commit 0540ed2)
Command
|
Hello all. I don't know is it the same issue or maybe another one should be created.
Is it the same issue or maybe I can create another one? |
With #8590 we don't have the race condition any more, because the "new" cosmovisor doesn't can log and only monitors the upgrade info file. |
* Fixed parse key issue (backport cosmos#9299) (cosmos#9561) * Fixed parse key issue (cosmos#9299) * Fixed parse key issue * Added getconfig in root command * uncommented changes in parse.go (cherry picked from commit d7dd1d7) # Conflicts: # simapp/simd/cmd/root.go * Add changelog Co-authored-by: Prathyusha Lakkireddy <prathyusha@vitwit.com> Co-authored-by: Amaury M <1293565+amaurym@users.noreply.github.com> * feat: return trace value from baseapp (backport cosmos#9578) (cosmos#9580) * feat: return trace value from baseapp (cosmos#9578) ## Description Closes: #XXXX <!-- Add a description of the changes that this PR introduces and the files that are the most critical to review. --> * fix: Dumping upgrade info interrupted by cosmovisor (cosmos#9384) (cosmos#9608) Solution: - dumping upgrade info before emit `UPGRADED NEEDED` log which will cause cosmovisor to kill chain process <!-- < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < ☺ v ✰ Thanks for creating a PR! ✰ v Before smashing the submit button please review the checkboxes. v If a checkbox is n/a - please still include it but + a little note why ☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --> ## Description The problematic procedure: 1. chain process output UPGRADE NEEDED log 2. cosmovisor see the message and kill the chain binary 3. chain process dump upgrade info and panic itself the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log. there are two problematic situation: 1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error. 2. the upgrade info file is not created at all, when the chain process restart, the [store upgrades](https://github.com/crypto-org-chain/chain-main/blob/master/app/app.go#L436) are not activated, will cause app hash mismatch error later on. * feat: Error on blank chain-id in multisign command (backport cosmos#9593) (cosmos#9605) * feat: Error on blank chain-id in multisign command (cosmos#9593) Error on `tx multisign` command if chain-id is blank. This is a common cause of signature verification failures when combining signatures and the error message doesn't provide any clues to this common cause. I have... - [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [x] added `!` to the type prefix if API or client breaking change - [x] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [] provided a link to the relevant issue or specification - [ ] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules) - [ ] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [ ] added a changelog entry to `CHANGELOG.md` - [ ] included comments for [documenting Go code](https://blog.golang.org/godoc) - [ ] updated the relevant documentation or specification - [ ] reviewed "Files changed" and left comments if necessary - [ ] confirmed all CI checks have passed ### Reviewers Checklist *All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.* I have... - [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [ ] confirmed `!` in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed - [ ] reviewed state machine logic - [ ] reviewed API design and naming - [ ] reviewed documentation is accurate - [ ] reviewed tests and test coverage - [ ] manually tested (if applicable) (cherry picked from commit f65b6c9) # Conflicts: # CHANGELOG.md * fix conflicts * less change diff Co-authored-by: Zaki Manian <zaki@manian.org> Co-authored-by: Amaury M <1293565+amaurym@users.noreply.github.com> * x/capability: Cap Initialization Fix (cosmos#9392) * fix: correct ibc metric labels (cosmos#9645) * backport cosmos/ibc-go#223 * add changelog * fix unnecessary changes * fix: Fix IBC Transfer Event (cosmos#9640) * fix event type * CHANGELOG * Update CHANGELOG.md Co-authored-by: Amaury M <1293565+amaurym@users.noreply.github.com> * chore: v0.42.7 release notes & changelog (cosmos#9661) * chore: v0.42.7 release notes & changelog * Add ibc * fix(keyring): update keyring for kwallet fix (backport cosmos#9563) (cosmos#9579) * fix(keyring): update keyring for kwallet fix (cosmos#9563) ## Description Closes: cosmos#9562 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Prathyusha Lakkireddy <prathyusha@vitwit.com> Co-authored-by: Amaury M <1293565+amaurym@users.noreply.github.com> Co-authored-by: Zaki Manian <zaki@manian.org> Co-authored-by: Aditya <adityasripal@gmail.com> Co-authored-by: colin axnér <25233464+colin-axner@users.noreply.github.com>
Solution:
UPGRADED NEEDED
log which willcause cosmovisor to kill chain process
Description
The problematic procedure:
the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log.
there are two problematic situation:
Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.
docs/
) or specification (x/<module>/spec/
)godoc
comments.Unreleased
section inCHANGELOG.md
Files changed
in the Github PR explorerCodecov Report
in the comment section below once CI passes