Skip to content

WIP: Backfill data columns #15580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open

WIP: Backfill data columns #15580

wants to merge 2 commits into from

Conversation

kasey
Copy link
Collaborator

@kasey kasey commented Aug 12, 2025

What type of PR is this?

Feature

What does this PR do? Why is it needed?

Adds data column support to backfill.

Other notes for review

  • WIP - needs some additional cleanup and testing.
  • In the case where custody_group_count is the minimal CUSTODY_REQUIREMENT, we actually download SAMPLES_PER_SLOT for simplicity (otherwise DA check code needs to get more complicated). This is an easy thing to improve in a follow up if we decide to.

Acknowledgements

@kasey kasey changed the title Backfill data columns WIP: Backfill data columns Aug 12, 2025
@kasey kasey changed the base branch from develop to fusaka-devnet-3 August 14, 2025 19:17
@kasey kasey force-pushed the backfill-data-columns branch 3 times, most recently from 8634cee to e3f9617 Compare August 14, 2025 23:06
}
dasInfo, _, err := peerdas.Info(nodeID, p2pSvc.CustodyGroupCountFromPeer(peer))
if err != nil {
log.WithField("peerID", peer).WithField("nodeID", nodeID).WithError(err).Debug("Failed to derive custody groups from peer.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why both logging and returning an error?
The caller already logs the error.

return 1 / float64(f.freq())
}

func (f colFreq) freq() int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freq is unclear. Why not simply custodianCount or an other name indicating the number of peers effectively custodying this column?

Copy link
Collaborator Author

@kasey kasey Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freq is an abbreviatio for "frequency". Will think about the naming some more.

}

func (f colFreq) rarity() float64 {
if f.freq() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is a column without any custodian and a column with exactly one custodian is considered as the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need rarity. We can use popularity which is simply freq.

return len(f.custodians)
}

func columnScarcity(cf []colFreq) []float64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using columnPopularity instead of columnRarity and so we can avoid the division by 0 hack?

freq []colFreq
scarcity []float64
rg rand.Rand
covScoreRank []*peerColumnCoverage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why uint64 instead of bool?

ra[f.col] = f.rarity()
}
return ra
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a waste of resource here to add peerColumnCoverage in freqByColumn[c].custodians instead just the peerID or the nodeID, because here each item in custodians will have exactly one non-zero item in its custodied list, which is the information already present in the col attribue of freqByColumn.


// Note that in the case where custody_group_count is the minimum CUSTODY_REQUIREMENT, we will
// still download the extra columns dictated by SAMPLES_PER_SLOT. This is a hack to avoid complexity in the DA check.
// We may want to revisit this to reduce bandwidth and storage for nodes with 0 validators attached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a hack, this is a feature.
https://github.com/ethereum/consensus-specs/blob/dev/specs/fulu/das-core.md#custody-sampling

At each slot, a node advertising custody_group_count downloads a minimum of sampling_size = max(SAMPLES_PER_SLOT, custody_group_count) custody groups, selected by groups = get_custody_groups(node_id, sampling_size), to which correspond the columns compute_columns_for_custody_group(group) for group in groups. The custody groups to custody, selected by get_custody_groups(node_id, custody_group_count), are then in particular a subset of those to sample. Sampling is considered successful if the node manages to retrieve all selected columns.

This also applies for initial sync and backfill.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not imply that the behavior in the specs is a hack. It is talking about the implementation in the backfill service. Because backfill is only concerned with finalized data it does not need to perform oversampling as we would want to do when range syncing or in gossip. The hack is that backfill simply does the same thing as these other syncing modes, even though it doesn't need to, and it would be more resource efficient to avoid doing so. However it would require more complexity in the DA check code, so it is simpler to just act as if we were in one of these other sync modes.

@@ -42,7 +42,7 @@ func (a *Assigner) freshPeers() ([]peer.ID, error) {
if flags.Get().MinimumSyncPeers < required {
required = flags.Get().MinimumSyncPeers
}
_, peers := a.ps.BestFinalized(params.BeaconConfig().MaxPeersToSync, a.fc.FinalizedCheckpoint().Epoch)
_, peers := a.ps.BestFinalized(-1, a.fc.FinalizedCheckpoint().Epoch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what -1 means.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to comment that, thanks!

if cfg == nil {
cfg = &workerCfg{}
}
cfg.c = c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ease the readability, it would be useful to use more descriptive names.
c, v, cm, bfs, cfs, nbv, ndcv etc... are quite are to understand.

}

func (w *p2pWorker) run(ctx context.Context) {
for {
select {
case b := <-w.todo:
if err := b.waitUntilReady(ctx); err != nil {
log.WithField("batch_id", b.id()).WithError(ctx.Err()).Info("worker context canceled while waiting to retry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No camel case in fields. Capital letter in log.

log.WithFields(b.logFields()).WithField("backfillWorker", w.id).Debug("Backfill worker received batch")
if b.state == batchSidecarSync {
if b.state == batchSyncBlobs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have a batch with both blobs and columns?


// IsDataAvailable implements the das.AvailabilityStore interface.
func (m *multiStore) IsDataAvailable(ctx context.Context, current primitives.Slot, blk blocks.ROBlock) error {
if blk.Block().Slot() < m.fuluStart {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it's better to switch the if and to consider the fulu case as the most common.

if blk.Block().Slot() < m.fuluStart {
return m.checkAvailabilityWithFallback(ctx, m.blobStore, current, blk)
}
return m.checkAvailabilityWithFallback(ctx, m.columnStore, current, blk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks sidecars block by block, with is non efficient regarding KZG verification, especially when low custody count. (8 by 8 check).
If a sidecar does not respect a rule (ex: KZG verification), where is the peer that gave us this sidecar downscored?

@kasey kasey force-pushed the backfill-data-columns branch from e3f9617 to 2b8384f Compare August 16, 2025 03:40
@kasey kasey requested a review from prestonvanloon as a code owner August 16, 2025 03:40
@kasey kasey changed the base branch from fusaka-devnet-3 to peerdas-sync August 16, 2025 03:41
Base automatically changed from peerdas-sync to develop August 18, 2025 14:50
@kasey kasey force-pushed the backfill-data-columns branch from 2b8384f to 271a393 Compare August 20, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants