Make `SyncingStrategy` abstract and allow developers to customize it #5333

nazar-pc · 2024-08-13T08:00:35Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Motivation

Right now SyncingStrategy is created in the constructor of SyncingEngine and the whole thing assumes there are just 3 syncing strategies that transition like this:

polkadot-sdk/substrate/client/network/sync/src/strategy.rs

Lines 520 to 522 in 6619277

    
           /// Proceed with the next strategy if the active one finished. 
        
           pub fn proceed_to_next(&mut self) -> Result<(), ClientError> { 
        
           	// The strategies are switched as `WarpSync` -> `StateStrategy` -> `ChainSync`.

This is a problem for chains that both don't have/use Warp sync and/or have custom sync strategies (like Subspace/Autonomys does, we have two custom sync strategies already).

Assumption of existence of certain strategies makes it impossible to implement custom ones without forking Substrate and injecting custom logic in strategic places, which is hard to maintain and fragile long-term. It also frequently results in unexpected side-effects like #4607, #4407 and more.

Specific examples of strategies we have already implemented with hacks in Subspace/Autonomys:

Sync from DSN (Distributed Storage Network): we download and decode most of the blocks from archival history, import them sequentially before switching back to regular chain sync strategy
Snap sync: we download the first block of the last segment of archived history from DSN, download and verify corresponding state from one of the nodes, import block with corresponding state before switching back to regular chain sync (similar to Warp sync, except we do not intend to ever sync the gap afterwards)

Request

Make it possible to define custom sync strategies and transition sequence without forking

Solution

Probably the first step would be to extract a trait out of existing SyncingStrategy, then probably extract different strategies out of SyncingStrategy and make them composable, finally expose a way to configure custom syncing strategy instead of what is used by default.

Not sure what to do with build_network, it contains significant amount of boilerplate and while could be copy-pasted and modified in chain-specific codebase, it feels like there should be a better solution.

Open to suggestions on how to approach this so we can make progress and reduce the diff with upstream in our fork.

Examples of hacks that we had to introduce downstream to implement custom sync protocols:

Temporarily pausing Substrate's sync so it doesn't attempt to catch up while we do progress towards the same objective with our custom sync protocol, avoiding unnecessary network usage, errors and eventually banning due to Node can send multiple same block requests while syncing from other nodes #531 / ChainSync requesting the same block multiple times #1915: autonomys@4670a2c
API for sync strategy restart: autonomys@9897ffa
API for force-clearing block gap: autonomys@d537512
Expose state sync internals in order to be able to make the same requests ourselves: autonomys@de6a7e3

These are all fragile workarounds that we'd really like to get rid of sooner rather than later.

cc @shamil-gadelshin

Are you willing to help with this request?

Yes!

The text was updated successfully, but these errors were encountered:

nazar-pc · 2024-08-15T14:56:03Z

@dmitry-markin @skunert any opinion here?

dmitry-markin · 2024-08-16T07:53:12Z

Thanks for bringing this up! Actually, I was also thinking about composing syncing strategies instead of hardcoding transitions in proceed_to_next when I was working on this. This is something we need anyway when we introduce more syncing strategies after extracting gap sync and adding Sync 2.0. So, I would really appreciate your help here.

One thing to keep in mind is that some strategies should be able to run in parallel. This is the case for gap sync we have now (it was planned we extract it from ChainSync as a parallel sync startegy, and some prep refactoring was already done). This is also the case for planned polkadot-sdk Sync 2.0 strategy, which should run in parallel to ChainSync (Sync 1.0) until we completely phase out Sync 1.0.

Another thing that should be dealt with along the way is that currently the strategies use custom actions to communicate with protocols. This should be replaced with a possibility for a strategy to communicate with "its" protocols using opaque requests/responses and subscribing to opaque notifications.

Your proposed solution seems sound to me.

As for build_network issue, I don't have any thoughts right now, but we can plan this refactoring as well.

Overall, this is definitely the thing that should be implemented.

nazar-pc · 2024-08-16T11:21:59Z

Okay, so we are thinking in about the same direction then, which is good.

The first issue I see for composition is that everything is tied together through layers instead of being somewhat independent. Specifically different strategies already have their own implementations, but the actions they produce are proxied through SyncingStragegy into SyncingEngine and then results back through a bunch of purpose-built methods.

What I think would solve that is to have each strategy their own async fn run() -> FinishResult and make them own things like NetworkServiceHandle or ImportQueueService that they need for their operation. Then SyncingStragegy could become that first aggregator that simply drives a list of provided strategies to their completion and one one of them completes it can make a decision to exit or to start another strategy, etc.

Essentially individual strategies should become more or less opaque to the caller, caller should understand their internal operation or forward their requests/responses back and forth.

This is essentially how our custom strategies in Subspace protocol work, they own handles to their dependencies and interact with networking or other things as they see fit without Substrate having any idea about it. We'd be able to wrap those in structs and implement strategy trait in the future.

Does it make sense?

dmitry-markin · 2024-08-16T12:08:10Z

Let me think if we can make strategies completely opaque. May be we will need to manage strategies using the same protocol in some smart way. For example, if different strategies request the same blocks via block request protocol, the peer will be banned.

We also tried to issue max 1 in-flight request at a time to every peer at some point, but I haven't found this being enforced now. Probably we lifted this constraint, and don't need to take it into account anymore.

Do you have any thoughts what else (except async fn run()) should be in the strategy trait?

nazar-pc · 2024-08-16T12:37:35Z

Methods like is_major_syncing, num_peers, probably status, report_metrics, etc.

Not yet sure if there needs to be some more direct communication between lower and higher level strategies, but at least on the top level I don't think too many are necessary.

Well, I guess it might be more like spawning a worker for strategy and having a handle to interact with it, kind of like SyncingEngine vs SyncingService already split the API between them.

conr2d · 2024-08-17T09:25:55Z

I'm also interested in this issue regarding the potential implementation of a PoW chain using Substrate. Since there's no finalized state in a PoW context, the existing fast sync mechanism may not work effectively. However, if an abstraction of SyncingStrategy is introduced, it could enable syncing a PoW chain to a specific block number, similar to how Eth1 handles synchronization with a pivot point.

nazar-pc · 2024-08-20T20:56:38Z

I have tried different approaches and found something that is somewhat decent for the first step.

After #5410 I have another step before the actual abstraction.

New trait is introduced and initially implemented for both SyncingStrategy and ChainSync (other sync implementations are not really complete sync strategies and can't be used on their own, but ChainSync can be!).

There is still plenty of things to untangle, but I need to get reviews and submit things in sequence, so will appreciate reasonably quick turnaround with PRs if possible.

nazar-pc · 2024-08-22T18:36:44Z

I'd like to make more progress here, but can't open PRs until #5364, #5410 and #5431 are in. Let me know if there is anything I can do to move things forward faster.

@dmitry-markin

This PR untangles syncing metrics and makes them reactive, the way metrics are supposed to be in general. Syncing metrics were bundled in a way that caused coupling across multiple layers: justifications metrics were defined and managed by `ChainSync`, but only updated periodically on tick in `SyncingEngine`, while actual values were queried from `ExtraRequests`. This convoluted architecture was hard to follow when I was looking into #5333. Now metrics that correspond to each component are owned by that component and updated as changes are made instead of on tick every 1100ms. This does add some annoying boilerplate that is a bit harder to maintain, but it separates metrics more nicely and if someone queries them more frequently will give arbitrary resolution. Since metrics updates are just atomic operations I do not expect any performance impact of these changes. Will add prdoc if changes look good otherwise. P.S. I noticed that importing requests (and corresponding metrics) were not cleared ever since corresponding code was introduced in dc41558#r145518721 and I left it as is to not change the behavior, but it might be something worth fixing. cc @dmitry-markin --------- Co-authored-by: Dmitry Markin <dmitry@markin.tech>

As I was looking at the coupling between `SyncingEngine`, `SyncingStrategy` and individual strategies I noticed a few things that were unused, redundant or awkward. The awkward change comes from paritytech/substrate#13700 where `num_connected_peers` property was added to `SyncStatus` struct just so it can be rendered in the informer. While convenient, the property didn't really belong there and was annoyingly set to `0` in some strategies and to `num_peers` in others. I have replaced that with a property on `SyncingService` that already stored necessary information internally. Also `ExtendedPeerInfo` didn't have a working `Clone` implementation due to lack of perfect derive in Rust and while I ended up not using it in the refactoring, I included fixed implementation for it in this PR anyway. While these changes are not strictly necessary for #5333, they do reduce coupling of syncing engine with syncing strategy, which I thought is a good thing. Reviewing individual commits will be the easiest as usual. --------- Co-authored-by: Dmitry Markin <dmitry@markin.tech>

nazar-pc · 2024-08-29T06:18:17Z

Another PR with initial SyncingStrategy trait introduction: #5469

nazar-pc · 2024-09-06T04:24:31Z

Have another set of changes that allows to use custom syncing strategies on top of #5469 that I can't open as a PR before #5469 is in.

With that syncing_strategy is an argument of build_network and any strategy can be used instead of the default one, all of the protocols are added by the strategy constructor, so they are under full control of the developer now. Also StrategyKey is an opaque type instead of an enum with a finite set of options, so custom strategies can define their own keys if they need to without modifying Substrate.

The next step will be to make SyncingEngine unaware of different sync types by consolidating different request and response handling methods. For example to remove SyncingStrategy::on_warp_proof_response() since it is too specific and have something generic instead (especially since we already have StrategyKey to distinguish what purpose each response has), but I'm not there yet and will wait for other changes to go in first.

nazar-pc · 2024-09-06T16:22:31Z

I think I finally found a good pattern to remove block/state/warp requests/responses handling from SyncingStrategy public interface. Existing SyncingAction::CancelRequest will simply be accompanied by SyncingAction::StartRequest and responses are all generic and up to strategy to decode/interpret. I quite like how things are looking so far.

This feature is helpful for us with custom sync protocol that is similar to Warp sync except we do not ever sync the gap and don't want it to exist in the first place (see #5333 and its references for motivation). Otherwise we had to resort to this: autonomys@d537512 --------- Co-authored-by: Davide Galassi <davxy@datawok.net>

This is a step towards #5333 The commits with code moving (but no other changes) and actual changes are separated for easier review. Essentially this results in `SyncingStrategy` trait replacing struct (which is renamed to `PolkadotSyncingStrategy`, open for better name suggestions). Technically it is already possible to replace `PolkadotSyncingStrategy<B, Client>` with `Box<dyn SyncingStrategy<B>` in syncing engine, but I decided to postpone such change until we actually have an ability to customize it. It should also be possible to swap `PolkadotSyncingStrategy` with just `ChainSync` that only supports regular full sync from genesis (it also implements `SyncingStrategy` trait). While extracted trait still has a lot of non-generic stuff in it like exposed knowledge of warp sync and `StrategyKey` with hardcoded set of options, I believe this is a step in the right direction and will address those in upcoming PRs. With #5431 that landed earlier warp sync configuration is more straightforward, but there are still numerous things interleaved that will take some time to abstract away nicely and expose in network config for developers. For now this is an internal change even though data structures are technically public and require major version bump.

This is a step towards paritytech#5333 The commits with code moving (but no other changes) and actual changes are separated for easier review. Essentially this results in `SyncingStrategy` trait replacing struct (which is renamed to `PolkadotSyncingStrategy`, open for better name suggestions). Technically it is already possible to replace `PolkadotSyncingStrategy<B, Client>` with `Box<dyn SyncingStrategy<B>` in syncing engine, but I decided to postpone such change until we actually have an ability to customize it. It should also be possible to swap `PolkadotSyncingStrategy` with just `ChainSync` that only supports regular full sync from genesis (it also implements `SyncingStrategy` trait). While extracted trait still has a lot of non-generic stuff in it like exposed knowledge of warp sync and `StrategyKey` with hardcoded set of options, I believe this is a step in the right direction and will address those in upcoming PRs. With paritytech#5431 that landed earlier warp sync configuration is more straightforward, but there are still numerous things interleaved that will take some time to abstract away nicely and expose in network config for developers. For now this is an internal change even though data structures are technically public and require major version bump.

# Description Follow-up to #5469 and mostly covering #5333. The primary change here is that syncing strategy is no longer created inside of syncing engine, instead syncing strategy is an argument of syncing engine, more specifically it is an argument to `build_network` that most downstream users will use. This also extracts addition of request-response protocols outside of network construction, making sure they are physically not present when they don't need to be (imagine syncing strategy that uses none of Substrate's protocols in its implementation for example). This technically allows to completely replace syncing strategy with whatever strategy chain might need. There will be at least one follow-up PR that will simplify `SyncingStrategy` trait and other public interfaces to remove mentions of block/state/warp sync requests, replacing them with generic APIs, such that strategies where warp sync is not applicable don't have to provide dummy method implementations, etc. ## Integration Downstream projects will have to write a bit of boilerplate calling `build_polkadot_syncing_strategy` function to create previously default syncing strategy. ## Review Notes Please review PR through individual commits rather than the final diff, it will be easier that way. The changes are mostly just moving code around one step at a time. # Checklist * [x] My PR includes a detailed description as outlined in the "Description" and its two subsections above. * [x] My PR follows the [labeling requirements]( https://github.com/paritytech/polkadot-sdk/blob/master/docs/contributor/CONTRIBUTING.md#Process ) of this project (at minimum one label for `T` required) * External contributors: ask maintainers to put the right label on your PR. * [x] I have made corresponding changes to the documentation (if applicable)

This feature is helpful for us with custom sync protocol that is similar to Warp sync except we do not ever sync the gap and don't want it to exist in the first place (see paritytech#5333 and its references for motivation). Otherwise we had to resort to this: d537512 --------- Co-authored-by: Davide Galassi <davxy@datawok.net> (cherry picked from commit 030cb4a)

This is a step towards paritytech#5333 The commits with code moving (but no other changes) and actual changes are separated for easier review. Essentially this results in `SyncingStrategy` trait replacing struct (which is renamed to `PolkadotSyncingStrategy`, open for better name suggestions). Technically it is already possible to replace `PolkadotSyncingStrategy<B, Client>` with `Box<dyn SyncingStrategy<B>` in syncing engine, but I decided to postpone such change until we actually have an ability to customize it. It should also be possible to swap `PolkadotSyncingStrategy` with just `ChainSync` that only supports regular full sync from genesis (it also implements `SyncingStrategy` trait). While extracted trait still has a lot of non-generic stuff in it like exposed knowledge of warp sync and `StrategyKey` with hardcoded set of options, I believe this is a step in the right direction and will address those in upcoming PRs. With paritytech#5431 that landed earlier warp sync configuration is more straightforward, but there are still numerous things interleaved that will take some time to abstract away nicely and expose in network config for developers. For now this is an internal change even though data structures are technically public and require major version bump. (cherry picked from commit 1f1f20a)

# Description Follow-up to paritytech#5469 and mostly covering paritytech#5333. The primary change here is that syncing strategy is no longer created inside of syncing engine, instead syncing strategy is an argument of syncing engine, more specifically it is an argument to `build_network` that most downstream users will use. This also extracts addition of request-response protocols outside of network construction, making sure they are physically not present when they don't need to be (imagine syncing strategy that uses none of Substrate's protocols in its implementation for example). This technically allows to completely replace syncing strategy with whatever strategy chain might need. There will be at least one follow-up PR that will simplify `SyncingStrategy` trait and other public interfaces to remove mentions of block/state/warp sync requests, replacing them with generic APIs, such that strategies where warp sync is not applicable don't have to provide dummy method implementations, etc. ## Integration Downstream projects will have to write a bit of boilerplate calling `build_polkadot_syncing_strategy` function to create previously default syncing strategy. ## Review Notes Please review PR through individual commits rather than the final diff, it will be easier that way. The changes are mostly just moving code around one step at a time. # Checklist * [x] My PR includes a detailed description as outlined in the "Description" and its two subsections above. * [x] My PR follows the [labeling requirements]( https://github.com/paritytech/polkadot-sdk/blob/master/docs/contributor/CONTRIBUTING.md#Process ) of this project (at minimum one label for `T` required) * External contributors: ask maintainers to put the right label on your PR. * [x] I have made corresponding changes to the documentation (if applicable) (cherry picked from commit 43cd6fd)

# Description This is a continuation of #5666 that finally fixes #5333. This should allow developers to create custom syncing strategies or even the whole syncing engine if they so desire. It also moved syncing engine creation and addition of corresponding protocol outside `build_network_advanced` method, which is something Bastian expressed as desired in #5 (comment) Here I replaced strategy-specific types and methods in `SyncingStrategy` trait with generic ones. Specifically `SyncingAction` is now used by all strategies instead of strategy-specific types with conversions. `StrategyKey` was an enum with a fixed set of options and now replaced with an opaque type that strategies create privately and send to upper layers as an opaque type. Requests and responses are now handled in a generic way regardless of the strategy, which reduced and simplified strategy API. `PolkadotSyncingStrategy` now lives in its dedicated module (had to edit .gitignore for this) like other strategies. `build_network_advanced` takes generic `SyncingService` as an argument alongside with a few other low-level types (that can probably be extracted in the future as well) without any notion of specifics of the way syncing is actually done. All the protocol and tasks are created outside and not a part of the network anymore. It still adds a bunch of protocols like for light client and some others that should eventually be restructured making `build_network_advanced` just building generic network and not application-specific protocols handling. ## Integration Just like #5666 introduced `build_polkadot_syncing_strategy`, this PR introduces `build_default_block_downloader`, but for convenience and to avoid typical boilerplate a simpler high-level function `build_default_syncing_engine` is added that will take care of creating typical block downloader, syncing strategy and syncing engine, which is what most users will be using going forward. `build_network` towards the end of the PR was renamed to `build_network_advanced` and `build_network`'s API was reverted to pre-#5666, so most users will not see much of a difference during upgrade unless they opt-in to use new API. ## Review Notes For `StrategyKey` I was thinking about using something like private type and then storing `TypeId` inside instead of a static string in it, let me know if that would preferred. The biggest change happened to requests that different strategies make and how their responses are handled. The most annoying thing here is that block response decoding, in contrast to all other responses, is dependent on request. This meant request had to be sent throughout the system. While originally `Response` was `Vec<u8>`, I didn't want to re-encode/decode request and response just to fit into that API, so I ended up with `Box<dyn Any + Send>`. This allows responses to be truly generic and each strategy will know how to downcast it back to the concrete type when handling the response. Import queue refactoring was needed to move `SyncingEngine` construction out of `build_network` that awkwardly implemented for `SyncingService`, but due to `&mut self` wasn't usable on `Arc<SyncingService>` for no good reason. `Arc<SyncingService>` itself is of course useless, but refactoring to replace it with just `SyncingService` was unfortunately rejected in #5454 As usual I recommend to review this PR as a series of commits instead of as the final diff, it'll make more sense that way. # Checklist * [x] My PR includes a detailed description as outlined in the "Description" and its two subsections above. * [x] My PR follows the [labeling requirements]( https://github.com/paritytech/polkadot-sdk/blob/master/docs/contributor/CONTRIBUTING.md#Process ) of this project (at minimum one label for `T` required) * External contributors: ask maintainers to put the right label on your PR. * [x] I have made corresponding changes to the documentation (if applicable)

nazar-pc added the I5-enhancement An additional feature request. label Aug 13, 2024

nazar-pc mentioned this issue Aug 13, 2024

Allow to disable gap creation during block import #5343

Merged

nazar-pc mentioned this issue Aug 19, 2024

Reactive syncing metrics #5410

Merged

nazar-pc mentioned this issue Aug 23, 2024

Sync status refactoring #5450

Merged

nazar-pc mentioned this issue Aug 26, 2024

Syncing strategy refactoring #5469

Merged

nazar-pc mentioned this issue Sep 10, 2024

Syncing strategy refactoring (part 2) #5666

Merged

3 tasks

nazar-pc mentioned this issue Sep 17, 2024

Syncing strategy refactoring (part 3) #5737

Merged

3 tasks

dmitry-markin closed this as completed in #5737 Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `SyncingStrategy` abstract and allow developers to customize it #5333

Make `SyncingStrategy` abstract and allow developers to customize it #5333

nazar-pc commented Aug 13, 2024 •

edited

Loading

nazar-pc commented Aug 15, 2024

dmitry-markin commented Aug 16, 2024 •

edited

Loading

nazar-pc commented Aug 16, 2024

dmitry-markin commented Aug 16, 2024

nazar-pc commented Aug 16, 2024

conr2d commented Aug 17, 2024

nazar-pc commented Aug 20, 2024

nazar-pc commented Aug 22, 2024

nazar-pc commented Aug 29, 2024

nazar-pc commented Sep 6, 2024

nazar-pc commented Sep 6, 2024

Make SyncingStrategy abstract and allow developers to customize it #5333

Make SyncingStrategy abstract and allow developers to customize it #5333

Comments

nazar-pc commented Aug 13, 2024 • edited Loading

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Motivation

Request

Solution

Are you willing to help with this request?

nazar-pc commented Aug 15, 2024

dmitry-markin commented Aug 16, 2024 • edited Loading

nazar-pc commented Aug 16, 2024

dmitry-markin commented Aug 16, 2024

nazar-pc commented Aug 16, 2024

conr2d commented Aug 17, 2024

nazar-pc commented Aug 20, 2024

nazar-pc commented Aug 22, 2024

nazar-pc commented Aug 29, 2024

nazar-pc commented Sep 6, 2024

nazar-pc commented Sep 6, 2024

Make `SyncingStrategy` abstract and allow developers to customize it #5333

Make `SyncingStrategy` abstract and allow developers to customize it #5333

nazar-pc commented Aug 13, 2024 •

edited

Loading

dmitry-markin commented Aug 16, 2024 •

edited

Loading