-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix a collection of downloader, snapshot sync and torrent related issues #15043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Use sqlite piece completion - Remove Downloader discovery - Use synchronous torrent completion check - Fix unsynced stats - Start fixing up logging adapter - Fix file verification treating incomplete piece as error - Simplify webseeds URLs - Don't load torrents from mdbx - Pull out parameterized snapshot sync logging - Make header-chain and snapshot download requests disjoint
This reverts commit 89515cc.
Thanks for this. I added this to catch bad cases in Erigon webseeds, I never thought it would actually trigger. Fix shortly Update: Okay fixed. I didn't fix the actual issue on the webseed hosts... I'm getting a lot of warnings that suggest that the preverified.toml and the webseed contents are currently out of sync for main/ethmainnet |
All regular CI/unit tests are passing. @canepat's selection of snapshot CIs are passing (as far as I can tell only with occasional false negatives unrelating to snapshot download). The only pending question is how to handle snapshot merges a client does locally that result in a file being removed. I don't know if it blocks using the PR, clients would potentially redownload snapshots they had merged after they restart the client. It could be fixed later. I have a change for it, and 2 potential paths forward. @AskAlexSharov could you review? Do you want the last piece on the removals included? |
@@ -526,7 +526,23 @@ func (a *Aggregator) BuildMissedAccessors(ctx context.Context, workers int) erro | |||
ii.BuildMissedAccessors(ctx, g, ps, missedFilesItems.ii[ii.name]) | |||
} | |||
|
|||
if err := g.Wait(); err != nil { | |||
err := func() error { | |||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically not, same as my other comment.
@@ -205,6 +206,7 @@ func ProcessFrozenBlocks(ctx context.Context, db kv.RwDB, blockReader services.F | |||
func StageLoopIteration(ctx context.Context, db kv.RwDB, txc wrap.TxContainer, sync *stagedsync.Sync, initialCycle, firstCycle bool, logger log.Logger, blockReader services.FullBlockReader, hook *Hook) (err error) { | |||
defer func() { | |||
if rec := recover(); rec != nil { | |||
debug.PrintStack() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. More general problem with hiding panics, but doesn't need to be in this PR.
go 1.23.0 | ||
|
||
toolchain go1.23.6 | ||
go 1.24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we drop go 1.23
support when go 1.25.1
released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to do this because lint and mod checks were failing. I couldn't find an alternative.
snap-download and sync-from-scratch went ok on this branch: |
This is to complete an item from #15043: > Don't redownload files after preverified.toml is committed if they go missing. This occurs due to merges etc. At node startup, the preverified.toml that was committed locally from a previous initial sync is scanned and the torrents in it are requested from the downloader. If in a previous run, torrents were removed due to merges, the torrents in the final snapshot will be added again (to my understanding). I can see two ways to deal with this: 1. Mark manually removed torrents with a file like name.removed, or 2. Once the preverified.toml is committed, don't request any torrents from it any longer. The second option removes a lot of the logic from the downloader, where it doesn't belong. This PR implements that. The only downside I can see here is it makes it harder to manually repair a snapshot dir, for example you could "resync" a dir by removing all the `.removed` files in 1. However you could also sync to the latest initial snapshot by removing whatever preverified.toml you have, possibly wiping any .torrent files, and let the node resync (it will reuse existing data now too since #15043). I'm working on a sequence diagram that should describe how it fits together so it's clear and easy to work with the downloader snapshot dir and to check this logic.
In #15043 I didn't include the updated protobuf interfaces. I spotted them in #16081. The vendoring is too brittle and makes it hard to generate local changes in a workspace. I switched it to using a git submodule which is much easier to work with. The interfaces generated here are direct outputs of the protobuf so there's no need to include a reference to the interfaces revision used in the Go code.
Fixes #14649, #14544, #14438, #14170, and their sub issues #13656, #14646 (incomplete list).
Still pending (thanks @mh0lt for insight):
Don't clobber files from preverified.toml that are complete but don't match in hash.Don't write preverified.toml to disk unless "completion" is reached in stage snapshot.Reinstate the load from disk now that it should be trivial.Extra items:
Check with @AskAlexSharov if I've missed something about what was intended with the VerifyData call over RPC and why it happens after the Completed call.turbo/snapshotsync/snapshots.go
that my assumptions here hold.Extra items from @AskAlexSharov:
Check torrent_cat actually works correctlyCheck torrent_create, torrent_hashes still work.Check downloader verify.failfastCheck the case where the info bytes might be missing from the log message:EROR[06-05|06:19:10.592] error adding torrents from disk err="adding torrent for accessor/v1.0-accounts.1832-1833.vi.torrent: can't add torrent without infohash"
. Might be missing extra logs or checks. Check this with torrent_create.And
--prune=minimal
(already works but I'm curious).Check that the atomic handling of torrent files meets thekill -9
requirement. Pretty sure I already did that.Notes: