txnsync: avoid pseudonode execution goroutine hang #2582
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
While running a longevity tests on the transaction sync feature branch, the following issue showed up on some of the participating nodes :
{"file":"actions.go","function":"github.com/algorand/go-algorand/agreement.pseudonodeAction.do","level":"error","line":386,"msg":"pseudonode.MakeVotes call failed(attest) pseudonode input channel is full","time":"2021-07-19T13:28:35.078410Z"}Digging further, it seems that both
pseudonodeVotesTask.executeas well aspseudonodeProposalsTask.executecould lead to this situation ( same issue ). There are two distinct issues in the current implementation that have been addressed in this PR:verifyVotemight silently fail. In that case, the caller ( i.e. theexecutemethod ) would still expect to find the result in the passed-in output channel. When this does not happen, the method could be blocked indefinitely.executemethod tries to send the output result back to the output channel, it might fail doing so due to an issue on the "other" side. When that happens, we have had no error indication. The changes in the codebase would now allow a timeout for these messages to ensure it won't be "stuck" forever.Test Plan
Use existing unit tests. Run a longevity test.