-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix core sharing and make use of scheduling_lookahead #4724
Conversation
Co-authored by: alindima <alin@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I don't get how this code fixes anything. The only place that is now fully taking advantage of the claim queue is prospective parachains, but that should not help much if we only ever provide collations for claim queue depth 0.
What do I miss?
polkadot/node/network/collator-protocol/src/collator_side/mod.rs
Outdated
Show resolved
Hide resolved
polkadot/node/network/collator-protocol/src/validator_side/mod.rs
Outdated
Show resolved
Hide resolved
Ok. It likely works because max_ancestry_len is long enough, that your collation will still be valid on the next occurrence in the claim queue. This is why the change in prospective parachains has an effect. |
Correct. Still, we don't seem to need an allowed_ancestry_len larger than 2. I tested with this value and 4 parachains sharing a core with 1.8 second candidates and it works in zombienet. @eskimor at this moment, we don't have any collator that builds collations for more than one relay chain block in advance (lookahead and slot-based don't do this). So adding any change here that does take into account the full claim queue will not have any effect. Collators search for an availability core where the para id is scheduled at this block or at the very next block and it builds That's why the fixes in this PR (besides the one in prospective-parachains) don't need the claim queue (and can do with the The only place where the claimqueue makes a difference is indeed in prospective-parachains. There, we need to not drop candidates for a para which are not scheduled for the next core but may be scheduled for the one after that. |
discussed offline with @eskimor about this. The PR was so far fixing core sharing but not implementing the required changes for collators to be able to build collations ahead of time. |
collator-protocol
and prospective-parachains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to ensure fairness between paras. This can be a followup though.
}; | ||
|
||
if assigned_para_id != candidate_para_id { | ||
if !assigned_paras.contains(&candidate_para_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not introduced here, but would be good to clear up. This function is called core_index_from_statement
... I would not expect it to do any validation. Do we have a better place to do this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though about this and we could move this bit of validation to the caller, but I think that hurts readability. core_index_from_statement
is already doing several validation steps
polkadot/node/network/collator-protocol/src/validator_side/mod.rs
Outdated
Show resolved
Hide resolved
@@ -293,7 +293,7 @@ impl Collations { | |||
} else { | |||
1 | |||
}; | |||
self.seconded_count < seconded_limit | |||
self.seconded_count >= seconded_limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This limit is based on max_candidate_depth
, but is not per para (despite the docs suggesting so). Maybe we want a global (for all paras) limit still, but it certainly is not enough. A global limit only protects the validator but not parachains from each other.
For the latter we need to ensure fairness in collation fetching:
- paras should get fetches proportional to how often they occur in the claim queue. E.g for a claim queue [A,B,A], A can have twice as many fetches as B.
- If for two paras the weighted fetch count (actual fetches/claim queue occurrences) is equal, then prefer the para that is higher up in the claim queue as it is more urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I decided to leave the fairness bit as a follow-up.
I have some thoughts on this, I'll add them to #1797
polkadot/node/network/collator-protocol/src/validator_side/mod.rs
Outdated
Show resolved
Hide resolved
@@ -31,7 +31,7 @@ const LOG_TARGET: &'static str = "parachain::subsystem-util-vstaging"; | |||
|
|||
/// A snapshot of the runtime claim queue at an arbitrary relay chain block. | |||
#[derive(Default)] | |||
pub struct ClaimQueueSnapshot(BTreeMap<CoreIndex, VecDeque<ParaId>>); | |||
pub struct ClaimQueueSnapshot(pub BTreeMap<CoreIndex, VecDeque<ParaId>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this pub? I mean as long as we are not keeping any invariants via the wrapper, I don't see a problem, but is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there aren't any invariants kept. and you suggested using the vecdeque directly: #4724 (comment)
instead of adding more methods that look similar to each other I decided there's no reason why we can't expose its inner type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Alternatively you could use: derive_more::Deref for read only access
it was too high for CI machines
# assign core 0 to be shared by all paras. | ||
validator-0: js-script ./assign-core.js with "0,2000,14400,2001,14400,2002,14400,2003,14400" return is 0 within 600 seconds | ||
|
||
collator-2000: reports block height is at least 6 within 200 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this high timeouts and low block heights, aren't we at a high-risk of the test working accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each parachain should get one block every 4 relay chain blocks. so one in 24 seconds (which would amount to about 144 seconds). An extra minute is needed for the session change (we're registering the parachains manually, because zombienet would otherwise also assign cores). Just to be sure, I ran this test with the code on master and it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright makes, sense then the timeouts aren't that high. Thank you!
@@ -1029,7 +1030,7 @@ fn core_index_from_statement( | |||
?group_rotation_info, | |||
?statement, | |||
?validator_to_group, | |||
n_cores = ?cores.len() , | |||
n_cores, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but once we got rid of the legacy stuff, I think we should make e.g. this iterator accessible. Should enable us to be able to get rid of this parameter as well (and the fetching of the cores to begin with).
Thank you! Great work! |
Implements most of paritytech#1797 Core sharing (two parachains or more marachains scheduled on the same core with the same `PartsOf57600` value) was not working correctly. The expected behaviour is to have Backed and Included event in each block for the paras sharing the core and the paras should take turns. E.g. for two cores we expect: Backed(a); Included(a)+Backed(b); Included(b)+Backed(a); etc. Instead of this each block contains just one event and there are a lot of gaps (blocks w/o events) during the session. Core sharing should also work when collators are building collations ahead of time TODOs: - [x] Add a zombienet test verifying that the behaviour mentioned above works. - [x] prdoc --------- Co-authored-by: alindima <alin@parity.io>
Implements most of paritytech#1797 Core sharing (two parachains or more marachains scheduled on the same core with the same `PartsOf57600` value) was not working correctly. The expected behaviour is to have Backed and Included event in each block for the paras sharing the core and the paras should take turns. E.g. for two cores we expect: Backed(a); Included(a)+Backed(b); Included(b)+Backed(a); etc. Instead of this each block contains just one event and there are a lot of gaps (blocks w/o events) during the session. Core sharing should also work when collators are building collations ahead of time TODOs: - [x] Add a zombienet test verifying that the behaviour mentioned above works. - [x] prdoc --------- Co-authored-by: alindima <alin@parity.io>
Implements most of #1797
Core sharing (two parachains or more marachains scheduled on the same core with the same
PartsOf57600
value) was not working correctly. The expected behaviour is to have Backed and Included event in each block for the paras sharing the core and the paras should take turns. E.g. for two cores we expect: Backed(a); Included(a)+Backed(b); Included(b)+Backed(a); etc. Instead of this each block contains just one event and there are a lot of gaps (blocks w/o events) during the session.Core sharing should also work when collators are building collations ahead of time
TODOs: