-
Notifications
You must be signed in to change notification settings - Fork 1.2k
WIP: Backfill data columns #15580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
WIP: Backfill data columns #15580
Conversation
8634cee
to
e3f9617
Compare
} | ||
dasInfo, _, err := peerdas.Info(nodeID, p2pSvc.CustodyGroupCountFromPeer(peer)) | ||
if err != nil { | ||
log.WithField("peerID", peer).WithField("nodeID", nodeID).WithError(err).Debug("Failed to derive custody groups from peer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why both logging and returning an error?
The caller already logs the error.
return 1 / float64(f.freq()) | ||
} | ||
|
||
func (f colFreq) freq() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
freq
is unclear. Why not simply custodianCount
or an other name indicating the number of peers effectively custodying this column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
freq
is an abbreviatio for "frequency". Will think about the naming some more.
} | ||
|
||
func (f colFreq) rarity() float64 { | ||
if f.freq() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is a column without any custodian and a column with exactly one custodian is considered as the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need rarity
. We can use popularity
which is simply freq
.
return len(f.custodians) | ||
} | ||
|
||
func columnScarcity(cf []colFreq) []float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using columnPopularity
instead of columnRarity
and so we can avoid the division by 0 hack?
freq []colFreq | ||
scarcity []float64 | ||
rg rand.Rand | ||
covScoreRank []*peerColumnCoverage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why uint64 instead of bool?
ra[f.col] = f.rarity() | ||
} | ||
return ra | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a waste of resource here to add peerColumnCoverage
in freqByColumn[c].custodians
instead just the peerID or the nodeID, because here each item in custodians
will have exactly one non-zero item in its custodied
list, which is the information already present in the col
attribue of freqByColumn
.
|
||
// Note that in the case where custody_group_count is the minimum CUSTODY_REQUIREMENT, we will | ||
// still download the extra columns dictated by SAMPLES_PER_SLOT. This is a hack to avoid complexity in the DA check. | ||
// We may want to revisit this to reduce bandwidth and storage for nodes with 0 validators attached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a hack, this is a feature.
https://github.com/ethereum/consensus-specs/blob/dev/specs/fulu/das-core.md#custody-sampling
At each slot, a node advertising
custody_group_count downloads
a minimum ofsampling_size = max(SAMPLES_PER_SLOT, custody_group_count)
custody groups, selected bygroups = get_custody_groups(node_id, sampling_size)
, to which correspond the columnscompute_columns_for_custody_group(group) for group in groups
. The custody groups to custody, selected byget_custody_groups(node_id, custody_group_count)
, are then in particular a subset of those to sample. Sampling is considered successful if the node manages to retrieve all selected columns.
This also applies for initial sync and backfill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not imply that the behavior in the specs is a hack. It is talking about the implementation in the backfill service. Because backfill is only concerned with finalized data it does not need to perform oversampling as we would want to do when range syncing or in gossip. The hack is that backfill simply does the same thing as these other syncing modes, even though it doesn't need to, and it would be more resource efficient to avoid doing so. However it would require more complexity in the DA check code, so it is simpler to just act as if we were in one of these other sync modes.
@@ -42,7 +42,7 @@ func (a *Assigner) freshPeers() ([]peer.ID, error) { | |||
if flags.Get().MinimumSyncPeers < required { | |||
required = flags.Get().MinimumSyncPeers | |||
} | |||
_, peers := a.ps.BestFinalized(params.BeaconConfig().MaxPeersToSync, a.fc.FinalizedCheckpoint().Epoch) | |||
_, peers := a.ps.BestFinalized(-1, a.fc.FinalizedCheckpoint().Epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear what -1
means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to comment that, thanks!
if cfg == nil { | ||
cfg = &workerCfg{} | ||
} | ||
cfg.c = c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ease the readability, it would be useful to use more descriptive names.
c, v, cm, bfs, cfs, nbv, ndcv etc... are quite are to understand.
} | ||
|
||
func (w *p2pWorker) run(ctx context.Context) { | ||
for { | ||
select { | ||
case b := <-w.todo: | ||
if err := b.waitUntilReady(ctx); err != nil { | ||
log.WithField("batch_id", b.id()).WithError(ctx.Err()).Info("worker context canceled while waiting to retry") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No camel case in fields. Capital letter in log.
log.WithFields(b.logFields()).WithField("backfillWorker", w.id).Debug("Backfill worker received batch") | ||
if b.state == batchSidecarSync { | ||
if b.state == batchSyncBlobs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have a batch with both blobs and columns?
|
||
// IsDataAvailable implements the das.AvailabilityStore interface. | ||
func (m *multiStore) IsDataAvailable(ctx context.Context, current primitives.Slot, blk blocks.ROBlock) error { | ||
if blk.Block().Slot() < m.fuluStart { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it's better to switch the if and to consider the fulu case as the most common.
if blk.Block().Slot() < m.fuluStart { | ||
return m.checkAvailabilityWithFallback(ctx, m.blobStore, current, blk) | ||
} | ||
return m.checkAvailabilityWithFallback(ctx, m.columnStore, current, blk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks sidecars block by block, with is non efficient regarding KZG verification, especially when low custody count. (8 by 8 check).
If a sidecar does not respect a rule (ex: KZG verification), where is the peer that gave us this sidecar downscored?
e3f9617
to
2b8384f
Compare
2b8384f
to
271a393
Compare
What type of PR is this?
Feature
What does this PR do? Why is it needed?
Adds data column support to backfill.
Other notes for review
CUSTODY_REQUIREMENT
, we actually downloadSAMPLES_PER_SLOT
for simplicity (otherwise DA check code needs to get more complicated). This is an easy thing to improve in a follow up if we decide to.Acknowledgements