Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Approval-distribution: Fix out-of-view messages caused by a race condition in view updates #5089

Merged
merged 6 commits into from
Mar 14, 2022

Conversation

vstakhov
Copy link
Contributor

This issue occurs when we have received a peer's view update before our own view update. In this case, we ignore all unknown heads in the remote view and attach assignment/approval messages to a list of pending messages. However, after our view update the remote views are not updated and it leads to a situation when we report honest peers for out-of-view messages. With this patch, we assume that if we know some head and a peer has already sent us some information about this particular head (in assignment/approval MessagesPending) it also means that a peer has actually this head in it's view even if we skipped it due to the race condition between local/remote view updates.

@vstakhov vstakhov added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. labels Mar 11, 2022
@rphmeier
Copy link
Contributor

This seems to make sense, although the intention of unify_with_peer is to ensure known_by is accurate for all peers. I notice that importing of pending messages is done before unify_with_peer. This is presumably done so they can be circulated afterwards. However, the import_and_circulate calls are fairly toothless because we haven't added the peers to the known_by map for those relay-parents yet.

What I'm getting at is that it seems an alternate fix could be made by simply reversing the order of these two blocks:

for (peer_id, view) in self.peer_views.iter() {
let intersection = view.iter().filter(|h| new_hashes.contains(h));
let view_intersection = View::new(intersection.cloned(), view.finalized_number);
Self::unify_with_peer(
ctx,
&self.gossip_peers,
metrics,
&mut self.blocks,
peer_id.clone(),
view_intersection,
)
.await;
}

and

if !to_import.is_empty() {

Please add a unit test.

@vstakhov
Copy link
Contributor Author

Yes, just a reversing the order of these code blocks resolves the issue as well. Should we use this approach?

@ordian
Copy link
Member

ordian commented Mar 14, 2022

Yes, just a reversing the order of these code blocks resolves the issue as well. Should we use this approach?

Yes please. That approach is cleaner.

@vstakhov
Copy link
Contributor Author

This is done: tested on unit test and in the problematic zombienet scenario.

@ordian ordian merged commit e4e40d2 into master Mar 14, 2022
@ordian ordian deleted the vstakhov-approval-view-race branch March 14, 2022 16:07
@rphmeier
Copy link
Contributor

Thanks!

sandreim pushed a commit that referenced this pull request Mar 17, 2022
…ition in view updates (#5089)

* Try to fix out-of-view messages in approval distribution

Suggested by: @ordian

* Cargo fmt

* Add a unit test for the proposed fix

* Spelling fix

* Use a simplier approach to fix the race condition as suggested by @rphmeier

* Cargo fmt run
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants