Skip to content

Fix wrong columns getting processed on a CGC change #7792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Jul 25, 2025

Issue Addressed

This PR fixes a bug where wrong columns could get processed immediately after a CGC increase.

Scenario:

  • The node's CGC increased due to additional validators attached to it (lets say from 10 to 11)
  • The new CGC is advertised and new subnets are subscribed immediately, however the change won't be effective in the data availability check until the next epoch (See this). Data availability checker still only require 10 columns for the current epoch.
  • During this time, data columns for the additional custody column (lets say column 11) may arrive via gossip as we're already subscribed to the topic, and it may be incorrectly used to satisfy the existing data availability requirement (10 columns), and result in this additional column (instead of a required one) getting persisted, resulting in database inconsistency.

Proposed fix

  • In the handle_gossip function, we compute should_process for each data column that arrived, and pass it to the beacon processor - this column will get verified, propagated, but NOT processed / imported. I've move this logic to DataAvailabilityChecker for consistency with RPC and better separation of concerns.
  • This requires us to know the sampling columns for the current epoch, which is different to globals.sampling_columns (already updated to the new CGC).
  • To avoid having to maintain multiple column lists for current and next epoch, we compute the full ordered set of custody group count on startup, and this would allow us to quickly retrieve custody columns for both current and next epoch without having to recompute or maintain multiple lists.

The trade off here is we'd have to initialise the CustodyContext with the results from get_custody_groups after network is initialised, which isn't ideal but will allow us to simplify and clean things up a lot by keeping custody related logic in CustodyContext. With this approach, we could potentially remove network_globals.sampling_columns too - I haven't done it in this PR though, wanted to see how this approach look like and get some feedback first.

The other options considered:

  • Load node_id before BeaconChain is constructed, and have beacon chain also compute the custody columns. I think this leaks networking types into the BeaconChain, so IMO isn't great for separation of concerns and also introduces a circular dependency which will bite us later.
  • Having network handle all the custody column related logic and only pass relevant data to the beacon processor. This approach splits custody logic between network and beacon chain - i believe the low cohesivsion sacrifices readability and maintainability.

@jimmygchen jimmygchen requested a review from jxs as a code owner July 25, 2025 05:08
@jimmygchen jimmygchen added bug Something isn't working ready-for-review The code is ready for review das Data Availability Sampling do-not-merge labels Jul 25, 2025
@jimmygchen
Copy link
Member Author

I've labeled it as do-not-merge as this still requires test fixes and clean up on docs, but the idea is there and I'd like to get some early feedback.

@jimmygchen jimmygchen requested a review from pawanjay176 July 25, 2025 05:41
Copy link

mergify bot commented Jul 25, 2025

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

@mergify mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Jul 25, 2025
@jimmygchen jimmygchen force-pushed the fix-extra-columns-processed branch from ab0e887 to fb08ab2 Compare July 25, 2025 07:40
…consistency with RPC and better separation of concerns
.map_err(|e| format!("Failed to compute custody groups: {:?}", e))?;
chain
.data_availability_checker
.custody_context()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we can probably remove the clone in custody_context()

all_custody_groups_ordered: Vec<CustodyIndex>,
spec: &ChainSpec,
) -> Result<(), String> {
let mut ordered_custody_groups = vec![];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this.
We basically want to store an ordering of data columns for a given NodeId.
Why are we storing columns for every custody index here? Can't we just compute the ordering for the node id and take a slice [..cgc] for the required cgc value?

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction of this, its less invasive than I thought originally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working das Data Availability Sampling do-not-merge waiting-on-author The reviewer has suggested changes and awaits thier implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants