Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collation fetching fairness #4880

Open
wants to merge 89 commits into
base: master
Choose a base branch
from
Open

Conversation

tdimitrov
Copy link
Contributor

@tdimitrov tdimitrov commented Jun 26, 2024

Related to #1797

When fetching collations in collator protocol/validator side we need to ensure that each parachain has got a fair core time share depending on its assignments in the claim queue. This means that the number of collations fetched per parachain should ideally be equal to (but definitely not bigger than) the number of claims for the particular parachain in the claim queue.

The current implementation doesn't guarantee such fairness. For each relay parent there is a waiting_queue (PerRelayParent -> Collations -> waiting_queue) which holds any unfetched collations advertised to the validator. The collations are fetched on first in first out principle which means that if two parachains share a core and one of the parachains is more aggresive it might starve the second parachain. How? At each relay parent up to max_candidate_depth candidates are accepted (enforced in fn is_seconded_limit_reached) so if one of the parachains is quick enough to fill in the queue with its advertisements the validator will never fetch anything from the rest of the parachains despite they are scheduled. This doesn't mean that the aggressive parachain will occupy all the core time (this is guaranteed by the runtime) but it will deny the rest of the parachains sharing the same core to have collations backed.

The solution I am proposing extends the checks in is_seconded_limit_reached with an additional check. The solution I am proposing is to limit fetches and advertisements based on the state of the claim queue. At each relay parent the claim queue for the core assigned to the validator is fetched. For each parachain a fetch limit is calculated (equal to the number of entries in the claim queue). Advertisements are not fetched for a parachain which has exceeded its claims in the claim queue. This solves the problem with aggressive parachains advertising too much collations.

The second part is in collation fetching logic. The collator will keep track on which collations it has fetched so far. When a new collation needs to be fetched instead of popping the first entry from the waiting_queue the validator examines the claim queue and looks for the earliest claim which hasn't got a corresponding fetch. This way the collator will always try to prioritise the most urgent entries.

@tdimitrov tdimitrov added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Jun 26, 2024
@tdimitrov tdimitrov force-pushed the tsv-collator-proto-fairness branch from c7f24aa to 0f28aa8 Compare June 28, 2024 08:19
@@ -266,9 +264,6 @@ impl PeerData {
let candidates =
state.advertisements.entry(on_relay_parent).or_default();

if candidates.len() > max_candidate_depth {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error leads to reporting the peer with COST_UNEXPECTED_MESSAGE. I think we shold relax it to just ignoring the advertisement.

Pros:

  • with the new logic submitting more elements than scheduled is not such a major offence
  • old collators won't get punished for not respecting the claim queue

Cons:

  • we don't punish spammy collators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new logic submitting more elements than scheduled is not such a major offence

how come?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't pick the best words here but my point was since we guarantee fair share of core time for each parachain we can close our eyes for collators pushing too much advertisements as they don't cause any harm.

Furthermore at the moment we haven't got a proper collator using the claim queue to decide what advertisements to send (right?) so when this change is live a lot of collators will suffer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should certainly not keep inserting blindly. You are right that it is less of an issue since we are now guaranteeing fairness, so we don't have to be super strict here. In particular given how ineffective reputation changes are in practice, it might make sense to set the limit to something we are 100% no honest node will ever reach, but then make it an immediate disconnect (max punishment).

Is the concern about old collators valid? Why would they be spammy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should still have some limit here. otherwise we record unlimited advertisements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that without proper spam protection a malicious collator could bombard the validator with advertisements. I've restored the limit but instead of max_candidate_depth I've put the length of the current assignments which is equal to the claim queue of the assigned core.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current version looks sane to me. Claim queue length should be a decent limit and we retain the punishment to not make it too easy.

Comment on lines 503 to 509
if let Some(CoreState::Occupied(core_state)) = cores.get(core_now.0 as usize) {
core_state.next_up_on_available.as_ref().map(|scheduled_core| {
vec![scheduled_core.para_id].into_iter().collect()
})
} else {
None
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we still need to handle missing claim queue or handle it as an error (the runtime api is released everywhere). Since it's a small addition I decided to keep it here.

Copy link
Contributor

@Overkillus Overkillus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions as well as some nits here and there.

Overall looking good!

I see that you reverted to default lookahead value. Is the mystery of the failing zombienet test resolved?

prdoc/pr_4880.prdoc Outdated Show resolved Hide resolved

// Checks if another collation can be accepted. The number of collations that can be seconded
// per parachain is limited by the entries in claim queue for the `ParaId` in question.
if seconded_and_pending_at_ancestors >= claims_for_para {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We return SecondedLimitReached Error. There is still a possibility that the pending collation will not be seconded for some reason. What happens then to the 1 extra slot that could be used for seconding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a pending collation or not is determined by this function. If the fetching or the validation fails pending will be 0 and we will be able to accept another collation for the same relay parent. But as soon as there is a pending item we are optimistic and reject any new advertisements.

/// We are currently fetching a collation for the specified `ParaId`.
Fetching(ParaId),
/// We are waiting that a collation is being validated for the specified `ParaId`.
WaitingOnValidation(ParaId),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally and before async backing once we've reached Seconded status we knew we are done here. Right now we can only be satisfied with collations once we reach the seconding limit but I see that we do not have a corresponding status here.

When we reach the seconding limit and no more ocllations are expected what will be the status here?

Or is it nonsensical to even refer to a CollationStatus when we reach the limit? (Dont think so)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do it. The lifetime right now is Waiting -> Fetching -> WaitingOnValidation and back to Waiting. We can have an additional Satisfied or something similar which indicates that we can't accept any more collations at this relay parent.

Then on each seconded collation we can call seconded_and_pending_for_para_in_view and if we have reached the claim queue limit we can set the state to Satisfied and know that nothing can be accepted at this relay parent.

The benefits are that once the relay parent 'is complete' (all claim queue entries are filled) we won't need to run any checks in order to reject an incoming collation.

The drawbacks are that on each seconded collation we'll have to run the same function. So unless a collator is spamming us the benefit is minimal. And since we are limiting the number of advertisements per collator we can't get spammed too much.

We can explore this further but it better be a follow up task.

tdimitrov and others added 2 commits October 15, 2024 08:47
Co-authored-by: Maciej <maciej.zyszkiewicz@parity.io>
@tdimitrov
Copy link
Contributor Author

bot fmt

@command-bot
Copy link

command-bot bot commented Oct 15, 2024

@tdimitrov https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7573708 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 2-32f4f5d2-4847-4304-a72c-c83d5228c964 to cancel this command or bot cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Oct 15, 2024

@tdimitrov Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7573708 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7573708/artifacts/download.

@tdimitrov
Copy link
Contributor Author

I see that you reverted to default lookahead value. Is the mystery of the failing zombienet test resolved?

Yes, turned out there is an issue after all: #4880 (comment)

?para_id,
"seconded_and_pending_for_para_in_view"
);
ancestors.iter().take(claim_queue_len).fold(0, |res, anc| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to this discussion for details on take(claim_queue_len).

Copy link
Contributor

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do a more in depth re-review but leaving some comments.

We should also add the new test to the yaml file so that it's run in the CI

validator-0: js-script ./0016-verify-included-events.js return is 1 within 120 seconds

# could be flaky feel free to remove if it's causing problems
validator-2: count of log lines matching regex "Rejected v2 advertisement.*error=SecondedLimitReached" is greater than 1 within 10 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the collator attempt to advertise more collations than scheduled?

});

if (block_count == 12) {
unsubscribe();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't written js in a while but isn't this a recursive call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took it from the documentation: https://polkadot.js.org/docs/api/examples/promise/listen-to-blocks

It works but I'm not at all proficient with JS so if you feel it's bad and have a suggestion how to fix it - I'd gladly do it :)

});

console.log(`Result: 2000: ${blocks_per_para[2000]}, 2001: ${blocks_per_para[2001]}`);
return (blocks_per_para[2000] == 6 && blocks_per_para[2001] == 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a 6:1 ratio. why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I was testing some stuff and carelessly committed messed up version of the test 👎

I've fixed the whole test. I'm still worried it will be flaky but let's see.

@tdimitrov tdimitrov requested a review from a team as a code owner October 18, 2024 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants