txnsync: avoid pseudonode execution goroutine hang #2582

tsachiherman · 2021-07-19T16:31:29Z

Summary

While running a longevity tests on the transaction sync feature branch, the following issue showed up on some of the participating nodes :

{"file":"actions.go","function":"github.com/algorand/go-algorand/agreement.pseudonodeAction.do","level":"error","line":386,"msg":"pseudonode.MakeVotes call failed(attest) pseudonode input channel is full","time":"2021-07-19T13:28:35.078410Z"}

Digging further, it seems that both pseudonodeVotesTask.execute as well as pseudonodeProposalsTask.execute could lead to this situation ( same issue ). There are two distinct issues in the current implementation that have been addressed in this PR:

The call to verifyVote might silently fail. In that case, the caller ( i.e. the execute method ) would still expect to find the result in the passed-in output channel. When this does not happen, the method could be blocked indefinitely.
When the execute method tries to send the output result back to the output channel, it might fail doing so due to an issue on the "other" side. When that happens, we have had no error indication. The changes in the codebase would now allow a timeout for these messages to ensure it won't be "stuck" forever.

Test Plan

Use existing unit tests. Run a longevity test.

ghost · 2021-07-19T16:34:10Z

Is there a reason why we are not trying to merge this to master but only to the txnsync branch?

tsachiherman · 2021-07-19T16:43:28Z

Is there a reason why we are not trying to merge this to master but only to the txnsync branch?

Yes - I want to verify the fix on this branch ( where I was able to reproduce it ), and then port it to master.

It will take few days of runtime to reproduce it, so I figured that I'll start as soon as I can on this branch.

codecov-commenter · 2021-07-19T17:34:29Z

Codecov Report

Merging #2582 (79bb444) into feature/txnsync (dd84666) will increase coverage by 0.02%.
The diff coverage is 41.02%.

@@                 Coverage Diff                 @@
##           feature/txnsync    #2582      +/-   ##
===================================================
+ Coverage            42.93%   42.96%   +0.02%     
===================================================
  Files                  369      370       +1     
  Lines                83744    83808      +64     
===================================================
+ Hits                 35958    36007      +49     
- Misses               41978    41999      +21     
+ Partials              5808     5802       -6

Impacted Files	Coverage Δ
agreement/cryptoVerifier.go	`73.04% <16.66%> (-2.69%)`	⬇️
agreement/asyncVoteVerifier.go	`79.66% <33.33%> (-2.16%)`	⬇️
agreement/pseudonode.go	`65.43% <43.93%> (-6.60%)`	⬇️
ledger/blockqueue.go	`81.03% <0.00%> (-2.88%)`	⬇️
network/requestTracker.go	`70.25% <0.00%> (-0.87%)`	⬇️
network/wsNetwork.go	`61.09% <0.00%> (ø)`
libgoal/lockedFileLinux.go
util/db/fullfsync_darwin.go	`100.00% <0.00%> (ø)`
libgoal/lockedFileUnix.go	`0.00% <0.00%> (ø)`
util/db/dbutil.go	`39.54% <0.00%> (+0.56%)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd84666...79bb444. Read the comment docs.

cce · 2022-04-19T18:41:06Z

follow-up: this was split off and released in #2741

Fix pseudonode goroutine getting stuck

c3a2ca5

tsachiherman requested a review from a user July 19, 2021 16:31

tsachiherman self-assigned this Jul 19, 2021

tsachiherman added the Team Carbon-11 label Jul 19, 2021

tsachiherman requested a review from cce July 19, 2021 16:33

Add error handling to voteFillWorker

79bb444

ghost approved these changes Jul 19, 2021

View reviewed changes

tsachiherman merged commit 4836ea4 into algorand:feature/txnsync Jul 19, 2021

tsachiherman deleted the tsachi/aggreement_full_backlog branch July 19, 2021 18:58

cce mentioned this pull request Aug 11, 2021

transaction sync #2718

Merged

tsachiherman restored the tsachi/aggreement_full_backlog branch August 16, 2021 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

txnsync: avoid pseudonode execution goroutine hang #2582

txnsync: avoid pseudonode execution goroutine hang #2582

Uh oh!

tsachiherman commented Jul 19, 2021

Uh oh!

ghost commented Jul 19, 2021

Uh oh!

tsachiherman commented Jul 19, 2021

Uh oh!

codecov-commenter commented Jul 19, 2021 •

edited

Loading

Uh oh!

cce commented Apr 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

txnsync: avoid pseudonode execution goroutine hang #2582

txnsync: avoid pseudonode execution goroutine hang #2582

Uh oh!

Conversation

tsachiherman commented Jul 19, 2021

Summary

Test Plan

Uh oh!

ghost commented Jul 19, 2021

Uh oh!

tsachiherman commented Jul 19, 2021

Uh oh!

codecov-commenter commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cce commented Apr 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jul 19, 2021 •

edited

Loading