Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase backing and approval timeouts for async backing #1366

Closed
Sophia-Gold opened this issue Sep 2, 2023 · 5 comments
Closed

Increase backing and approval timeouts for async backing #1366

Sophia-Gold opened this issue Sep 2, 2023 · 5 comments
Assignees
Labels
I10-unconfirmed Issue might be valid, but it's not yet known. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Comments

@Sophia-Gold
Copy link
Contributor

Since async backing increases execution time for collators from 500ms to 2s, we should also increase the execution time for backers and approval checkers beyond 2s to avoid larger parachain blocks not being backed due to small resource differences, nondeterminism, or cache misses.

The question is by how much. It's unclear whether we need to retain the 4x difference as 8s is quite long. This warrants testing.

@Sophia-Gold Sophia-Gold added T9-parachains_protocol T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Sep 2, 2023
@github-actions github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label Sep 2, 2023
@burdges
Copy link

burdges commented Sep 2, 2023

We'd ideally pseudo-remove the gap between backer timeouts and approval checker timesout once we charge backers fr time overruns: #742

In practice there is inherently some gap where the backer earns something for oversized blocks, but in expectation they risk only a minor fee for the time overrun. We discussed the fee curve shape at the team retreat. I'd suggested time overrun fees should be curve fit to the costs of no shows, but you maybe had something more nuanced, or just different. I've forgotten now, maybe in paritytech/polkadot#7239

At present I've no thoughts on doing something different short-term, but yeah maybe a hacky setting makes sense if we want to deploy something faster.

@rphmeier
Copy link
Contributor

rphmeier commented Sep 2, 2023

In my opinion we should start from the amount of work backers should do and work backwards to what collators should use as a timeout.

There's no reason to maintain the 4x difference in any case, as the 0.5s timeout on collators pre-async-backing is primarily determined by the relay chain block time and p2p network latency, not execution time.

@burdges
Copy link

burdges commented Sep 3, 2023

I doubt "timeout" is even well defined for collators. An optimized collator would often run quite different code from validiator, but optimized collators become a parachain logic specific problem.

We're worried about the gap between backers vs approval checkers here. Alfonso proposed a 6x gap in paritytech/polkadot#1656 We never used his 6x but we ended up with at least the 4x gap @Sophia-Gold mentions. In fact I thought we're currently running some larger gap, like 6x or 8x.

We'll pseduo-remove the gap in #742 like I say, but tight now the question is: Can we deploy async backing without #742 but with a smaller gap?

I've long forgotten how the developers selected such a large gap, so we'll just need to ask whoever put the gap into the code (maybe you) where it came from, and try to reason from that point.

There's no reason to maintain the 4x difference in any case, as the 0.5s timeout on collators pre-async-backing is primarily determined by the relay chain block time and p2p network latency, not execution time.

Is this really the only reason we have the 4x gap? If we only have a 1x gap then we should often trigger validity disputes due to timeouts, under the current scheme. If we've high variance in validator hardware then even a 2x gap could be insufficient, even assuming 100% honesty, although maybe single core performance in not such high variance anymore. Did we not just have testnet failures which required a larger gap?

@rphmeier
Copy link
Contributor

rphmeier commented Sep 3, 2023

We should really separate these concerns and be clear about what we're discussing

  1. how much different does the backing timeout vs. the approval/disputes timeout need to be in execution? currently, backing has a 2s timeout and disputes/approval have a 12s timeout. This is in line with Alfonso's recommendation.
  2. how much time collators allocate to block building in order to stay within the 2s timeout on backers.

This issue was mostly made in the context of (2). We previously had limitations on block building in Cumulus by default which were 0.5s of authoring. However, this timeout was not set with the goal of fulfilling (2) and was majority constrained by other factors. With async backing (2) becomes a direct concern. That's all. (1) is orthogonal - except if we want to make sure that 2s of authoring on collators always falls within the backing limit, in which case the timeouts may need to be adjusted upwards. This is probably the wrong order of operations, IMO.

Is this really the only reason we have the 4x gap?

It's the only reason we have 4x gap between the 0.5s allocated to collators block building and the 2s backing timeout, yes. Building, backing, and propagating within a 6s window required a very short block-building timeout mostly due to network and relay chain block propagation overhead.

This is not related to the 6x gap between the 2s allocated to backing and the 12s allocated to approval, as mentioned.

I suspect that honest collators building a block out of an on-disk DB for 2s would likely always stay within the 2s execution limit for backers executing out of memory. Collators building out of an in-memory DB or a RAMdisk would need to be more conservative as a larger proportion of their authoring time is spent in RAM or the CPU.

@eskimor
Copy link
Member

eskimor commented Oct 8, 2023

So, I guess we can close this then. We might end up wanting to increase backing timeout slightly to something like 2.5s at some point, but for now we should be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I10-unconfirmed Issue might be valid, but it's not yet known. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants