Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/2725 #2746

Merged
merged 31 commits into from
Jul 6, 2021
Merged

Fix/2725 #2746

merged 31 commits into from
Jul 6, 2021

Conversation

jcnelson
Copy link
Member

This PR fixes a lot of issues underpinning #2725. Namely, it fixes #2701, #2728, #2730, and #2738 as well.

The gist of the changes is that we make it so that when the block inventory state-machine determines that the node is in the initial block download, and one of its bootstrap nodes NACK's its request for a PoxInv message or a BlocksInv message at a given reward cycle, it will only refresh its view of the last INV_REWARD_CYCLES reward cycles from its bootstrap peers instead of all prior reward cycles. This situation arises more often than not -- for example, whenever the anchor block is "late" relative to the PoX sync watchdog -- which leads to a significant slow-down in synchronization as the node gets closer to the testnet chain tip (which has thousands of reward cycles). In addition, the changes here make it so that if this happens, the download state machine will begin scanning the local chain state from the sortition height at the start of the highest reward cycle synchronized with the bootstrap peer (instead of the very first sortition). This significantly speeds up block downloads, because like the inventory state machine, the download state machine only processes one reward cycle at a time (and making it start right where it's bound to find new blocks makes synchronization go faster towards the chain tip).

The PR also fixes a few odds and ends, like making it so /v2/info doesn't require any I/O to complete and making it so the inventory state machine doesn't have to load the PoX bitvector on each pass (which is surprisingly slow).

I'm in the process of testing this on a local testnet follower; will report back when it's ready.

jcnelson added 19 commits June 28, 2021 14:26
…is queried while it is in the process of being invalidated
…on error from the GetBlocksInv handler, reply a NACK instead of erroring out
…ase when choosing where to start scanning for new block downloads. Also, use the PeerNetwork::with_http() interface to access the inner HTTP peer when downloading blocks and microblocks.
…t before the reward cycle it reported that the local peer diverged. Also, record this sortition height for the block downloader's consumption.
…dicates that a network error can be resolved by simply re-trying the poll loop.
…nce can be passed to the HTTP handler code. Also, do a better job at propagating hints from the inventory state-machine to the downloader as part of #2730.
…ress #2738 by way of using cached data in the PeerNetwork to construct a /v2/info response without doing any I/O
…ge-podge of fields from it, so it can be passed into the HTTP request handler
…ock of the reward cycle in which the canonical Stacks chain state starts
…s, so /v2/info acts on the *current* chain view
@gregorycoppola
Copy link
Contributor

@jcnelson Are there any tests that we should add as part of the release process in order to test this change?

@jcnelson
Copy link
Member Author

Yes -- per the blockchain engineering meeting, it's imperative that we spin up a release candidate from genesis on both mainnet and testnet and verify that it reaches the chain tip.

@jcnelson
Copy link
Member Author

@wileyj Want to take this branch for a spin?

@wileyj
Copy link
Collaborator

wileyj commented Jun 30, 2021

@jcnelson ready to second spin-up test with the commits above?

@jcnelson
Copy link
Member Author

jcnelson commented Jul 1, 2021

Not yet -- still getting stuck (investigating)

jcnelson added 4 commits July 1, 2021 20:40
…r that's diverged from us, start at either the reward cycle that contains the highest processed Stacks block, or INV_REWARD_CYCLES fewer reward cycles than the diverged reward cycle -- whichever is lower
… of either the highest processed Stacks block, or the inventory sortition height hint from the inv state machine -- whichever is lower
@jcnelson
Copy link
Member Author

jcnelson commented Jul 2, 2021

Okay, this time it managed to reach the testnet chain tip. Going to spin up from testnet genesis again to verify that it can do so without getting stuck even once.

@wileyj
Copy link
Collaborator

wileyj commented Jul 2, 2021

Okay, this time it managed to reach the testnet chain tip. Going to spin up from testnet genesis again to verify that it can do so without getting stuck even once.

👍 spinning up our node from genesis now

@gregorycoppola
Copy link
Contributor

Okay, this time it managed to reach the testnet chain tip. Going to spin up from testnet genesis again to verify that it can do so without getting stuck even once.

Nice work, @jcnelson . Thanks :).

@jcnelson
Copy link
Member Author

jcnelson commented Jul 2, 2021

Okay, my node reached the chain tip on testnet and is in steady state.

@jcnelson
Copy link
Member Author

jcnelson commented Jul 6, 2021

Both my testnet and mainnet nodes sync'ed with this patch, without issue.

Copy link
Contributor

@kantai kantai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but the allowed warning should be re-enabled.

@jcnelson
Copy link
Member Author

jcnelson commented Jul 6, 2021

This looks good to me, but the allowed warning should be re-enabled.

As in 2a1e8f5?

Copy link
Contributor

@gregorycoppola gregorycoppola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Jude

I was talking a look at this PR to review.

Even though I lack a lot of context, I was planning to just review based on the testing.

I was surprised to see that no tests were affected, or added.

Is it possible to add any unit tests that would exercise some of the new behavior? Perhaps this could be in a future PR.

Otherwise, can we document in the PR description how this was tested and why we think that is sufficient?

@jcnelson
Copy link
Member Author

jcnelson commented Jul 6, 2021

In terms of lines changed, the bulk of the delta in this PR comes from refactoring. Namely, we now wrap access to the HTTP network state with the new PeerNetwork::with_http() method, and we avoid reloading PoX and burnchain state in the RPC handler and the inventory state machine if we don't need to. The existing test battery already covers both areas of the codebase.

The truly new functionality -- which is simply concerned with making sure the block downloader will start downloading blocks at the right reward cycle -- is really only testable against a live system. You'd need a live Bitcoin node and a Stacks node that not only has all the blocks, but also will occasionally disconnect from bootstrapping nodes and render them in a state where they have no choice but to process reward cycles for which they don't have the anchor block. You'd also need to instrument the Stacks node so that it can be stopped and restarted while it is bootstrapping (clearing its inventory state), in order to verify that the node does not spend an unreasonable amount of time bootstrapping. Implementing such an integration test that faithfully simulates this condition would not only take more work than this PR, but also would be redundant -- this environment is exactly what the live testnet is supposed to provide. The release procedure explicitly calls for booting up a new node from genesis on both testnet and mainnet in order to give PRs like this their due testing.

Copy link
Contributor

@gregorycoppola gregorycoppola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcnelson If you think this has had the full test-suite run on it then LGTM.

@jcnelson
Copy link
Member Author

jcnelson commented Jul 6, 2021

The tests are failing is because bitcoin.org is down right now. However, it previously passed with the current patches with the exception of only size_overflow_unconfirmed_stream_microblocks_integration_test, which passed locally for me.

@blockstack-devops
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@stacks-network stacks-network locked as resolved and limited conversation to collaborators Nov 21, 2024
@wileyj wileyj deleted the fix/2725 branch March 11, 2025 21:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants