Skip to content

Fix a collection of downloader, snapshot sync and torrent related issues #15043

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 99 commits into from
Jun 19, 2025

Conversation

anacrolix
Copy link
Contributor

@anacrolix anacrolix commented May 14, 2025

Fixes #14649, #14544, #14438, #14170, and their sub issues #13656, #14646 (incomplete list).

Still pending (thanks @mh0lt for insight):

  • Don't clobber files from preverified.toml that are complete but don't match in hash.
  • Don't write preverified.toml to disk unless "completion" is reached in stage snapshot.
  • Reinstate the load from disk now that it should be trivial.

Extra items:

  • Check with @AskAlexSharov if I've missed something about what was intended with the VerifyData call over RPC and why it happens after the Completed call.
  • Check the snapshot "visible files" code in turbo/snapshotsync/snapshots.go that my assumptions here hold.

Extra items from @AskAlexSharov:

  • Don't redownload files after preverified.toml is committed if they go missing. This occurs due to merges etc.
  • Check torrent_cat actually works correctly
  • Check torrent_create, torrent_hashes still work.
  • Check downloader verify.failfast
  • Check the case where the info bytes might be missing from the log message: EROR[06-05|06:19:10.592] error adding torrents from disk err="adding torrent for accessor/v1.0-accounts.1832-1833.vi.torrent: can't add torrent without infohash". Might be missing extra logs or checks. Check this with torrent_create.
  • Check the cases where files are removed, and the synchronization from calling code with the downloader:
//GC: last reader responsible to remove useles files: close it and delete
if refCnt == 0 && src.canDelete.Load() {
    if traceFileLife != "" && dt.d.filenameBase == traceFileLife {
        dt.d.logger.Warn("[agg.dbg] real remove at DomainRoTx.Close", "file", src.decompressor.FileName())
    }
    src.closeFilesAndRemove()
}

And

s.chainDB.OnFilesChange(func(frozenFileNames []string) {
s.logger.Warn("files changed...sending notification")
events := s.notifications.Events
events.OnNewSnapshot()
if downloaderCfg != nil && downloaderCfg.ChainName == "" {
    return
}
if !s.config.Snapshot.NoDownloader && s.downloaderClient != nil && len(frozenFileNames) > 0 {
    req := &protodownloader.AddRequest{Items: make([]*protodownloader.AddItem, 0, len(frozenFileNames))}
    for _, fName := range frozenFileNames {
        req.Items = append(req.Items, &protodownloader.AddItem{
            Path: filepath.Join("history", fName),
        })
    }
    if _, err := s.downloaderClient.Add(ctx, req); err != nil {
        s.logger.Warn("[snapshots] notify downloader", "err", err)
    }
}
})
  • Test --prune=minimal (already works but I'm curious).
  • Check that the atomic handling of torrent files meets the kill -9 requirement. Pretty sure I already did that.

Notes:

- Use sqlite piece completion
- Remove Downloader discovery
- Use synchronous torrent completion check
- Fix unsynced stats
- Start fixing up logging adapter
- Fix file verification treating incomplete piece as error
- Simplify webseeds URLs
- Don't load torrents from mdbx
- Pull out parameterized snapshot sync logging
- Make header-chain and snapshot download requests disjoint
@anacrolix anacrolix self-assigned this May 14, 2025
@anacrolix anacrolix added this to the 3.1.0 milestone May 14, 2025
@anacrolix
Copy link
Contributor Author

anacrolix commented Jun 16, 2025

go run ./cmd/erigon --datadir /erigon-data/ethmainnet31/ --authrpc.jwtsecret /erigon-data/amoy31/jwt.hex  --prune.mode=archive --chain=mainnet --bor.heimdall="https://heimdall-api-amoy.polygon.technology/" --log.console.verbosity=3 --torrent.download.rate=1g  --batchSize=128m --sync.loop.block.limit=1_000 --pprof --pprof.port=6061  --http.api=eth,erigon,ots,web3,net,debug,trace,txpool --beacon.api=beacon,builder,config,debug,events,node,validator,lighthouse --txpool.disable --persist.receipts=true

INFO[06-13|06:57:24.531] logging to file system                   log dir=/erigon-data/ethmainnet31/logs file prefix=erigon log level=info json=false
INFO[06-13|06:57:24.531] Starting pprof server                    cpu="go tool pprof -lines -http=: http://127.0.0.1:6061/debug/pprof/profile?seconds=20" heap="go tool pprof -lines -http=: http://127.0.0.1:6061/debug/pprof/heap"
INFO[06-13|06:57:24.531] Build info                               git_branch= git_tag= git_commit=
INFO[06-13|06:57:24.531]
	########b          oo                               d####b.
	##                                                      '##
	##aaaa    ##d###b. dP .d####b. .d####b. ##d###b.     aaad#'
	##        ##'  '## ## ##'  '## ##'  '## ##'  '##        '##
	##        ##       ## ##.  .## ##.  .## ##    ##        .##
	########P dP       dP '####P## '#####P' dP    dP    d#####P
	                           .##
	                       d####P

INFO[06-13|06:57:24.531] Starting Erigon on Ethereum mainnet...
INFO[06-13|06:57:24.615] Maximum peer count                       total=32
INFO[06-13|06:57:24.616] starting HTTP APIs                       port=8545 APIs=eth,erigon,ots,web3,net,debug,trace,txpool
INFO[06-13|06:57:24.618] Set global gas cap                       cap=50000000
INFO[06-13|06:57:24.682] torrent verbosity                        erigon=warn anacrolix=WRN slog=WARN
INFO[06-13|06:57:24.683] processed webseed configuration          webseedHttpProviders=[https://erigon31-v1-snapshots-mainnet.erigon.network] webseedFileProviders=[] webseedUrlsOrFiles=[v1:https://erigon31-v1-snapshots-mainnet.erigon.network]
INFO[06-13|06:57:26.777] [Downloader] Running with                ipv6-enabled=true ipv4-enabled=true download.rate=1g upload.rate=4mb
INFO[06-13|06:57:26.793] Opening Database                         label=chaindata path=/erigon-data/ethmainnet31/chaindata
INFO[06-13|06:57:26.799] [db] open                                label=chaindata sizeLimit=1TB pageSize=4KB
INFO[06-13|06:57:26.825] Initialised chain configuration          config="{ChainID: 1, Terminal Total Difficulty: 58750000000000000000000, Shapella: 2023-04-12 22:27:35 +0000 UTC, Dencun: 2024-03-13 13:55:35 +0000 UTC, Pectra: 2025-05-07 10:05:11 +0000 UTC, Fusaka: <nil>, BPO1: <nil>, BPO2: <nil>, BPO3: <nil>, BPO4: <nil>, BPO5: <nil>, Engine: ethash}" genesis=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3
INFO[06-13|06:57:34.014] Initialising Ethereum protocol           network=1
INFO[06-13|06:57:34.014] Disk storage enabled for ethash DAGs     dir=/erigon-data/ethmainnet31/ethash-dags count=2
INFO[06-13|06:57:34.047] rpc filters: subscribing to Erigon events
INFO[06-13|06:57:34.048] Starting private RPC server              on=127.0.0.1:9090
INFO[06-13|06:57:34.048] Reading JWT secret                       path=/erigon-data/amoy31/jwt.hex
INFO[06-13|06:57:34.048] [rpc] endpoint opened                    ws=false ws.compression=true grpc=false http.url=127.0.0.1:8545
INFO[06-13|06:57:34.048] HTTP endpoint opened for Engine API      url=127.0.0.1:8551 ws=true ws.compression=true
WARN[06-13|06:57:34.052] NAT ExternalIP resolution has failed, try to pass a different --nat option err="no UPnP or NAT-PMP router discovered"
INFO[06-13|06:57:34.053] Started P2P networking                   version=68 self=enode://1909c365b5f67ce12421bb5f285c0e7da037a8777b9a8ca2415af76e73bc43b3716ad9440eeaecc18f2c6f76ec68ae7692616205b7750c60b01f26ff53fc1654@127.0.0.1:30303 name=erigon/v3.1.0/linux-amd64/go1.24.3
WARN[06-13|06:57:34.055] NAT ExternalIP resolution has failed, try to pass a different --nat option err="no UPnP or NAT-PMP router discovered"
INFO[06-13|06:57:34.055] Started P2P networking                   version=67 self=enode://1909c365b5f67ce12421bb5f285c0e7da037a8777b9a8ca2415af76e73bc43b3716ad9440eeaecc18f2c6f76ec68ae7692616205b7750c60b01f26ff53fc1654@127.0.0.1:30304 name=erigon/v3.1.0/linux-amd64/go1.24.3
INFO[06-13|06:57:34.059] [OtterSync] Starting Ottersync
INFO[06-13|06:57:34.059]
   _____ _             _   _                ____  _   _
  / ____| |           | | (_)              / __ \| | | |
 | (___ | |_ __ _ _ __| |_ _ _ __   __ _  | |  | | |_| |_ ___ _ __ ___ _   _ _ __   ___
  \___ \| __/ _ | '__| __| | '_ \ / _ | | |  | | __| __/ _ \ '__/ __| | | | '_ \ / __|
  ____) | || (_| | |  | |_| | | | | (_| | | |__| | |_| ||  __/ |  \__ \ |_| | | | | (__ _ _ _
 |_____/ \__\__,_|_|   \__|_|_| |_|\__, |  \____/ \__|\__\___|_|  |___/\__, |_| |_|\___(_|_|_)
                                    __/ |                               __/ |
                                   |___/                               |___/


                                        .:-===++**++===-:
                                   :=##%@@@@@@@@@@@@@@@@@@%#*=.
                               .=#@@@@@@%##+====--====+##@@@@@@@#=.     ...
                   .=**###*=:+#@@@@%*=:.                  .:=#%@@@@#==#@@@@@%#-
                 -#@@@@%%@@@@@@%+-.                            .=*%@@@@#*+*#@@@%=
                =@@@*:    -%%+:                                    -#@+.     =@@@-
                %@@#     +@#.                                        :%%-     %@@*
                @@@+    +%=.     -+=                        :=-       .#@-    %@@#
                *@@%:  #@-      =@@@*                      +@@@%.       =@= -*@@@:
                 #@@@##@+       #@@@@.                     %@@@@=        #@%@@@#-
                  :#@@@@:       +@@@#       :=++++==-.     *@@@@:        =@@@@-
                  =%@@%=         +#*.    =#%#+==-==+#%%=:  .+#*:         .#@@@#.
                 +@@%+.               .+%+-.          :=##-                :#@@@-
                -@@@=                -%#:     ..::.      +@*                 +@@%.
    .::-========*@@@..              -@#      +%@@@@%.     -@#               .-@@@+=======-
.:-====----:::::#@@%:--=::::..      #@:      *@@@@@%:      *@=      ..:-:-=--:@@@+::::----
                =@@@:.......        @@        :+@#=.       -@+        .......-@@@:
       .:=++####*%@@%=--::::..      @@   %#     %*    :@*  -@+      ...::---+@@@#*#*##+=-:
  ..--==::..     :%@@@-   ..:::..   @@   +@*:.-#@@+-.-#@-  -@+   ..:::..  .+@@@#.     ..:-
                  .#@@@##-:.        @@    :+#@%=.:+@@#=.   -@+        .-=#@@@@+
             -=+++=--+%@@%+=.       @@       +%*=+#%-      -@+       :=#@@@%+--++++=:
         .=**=:.      .=*@@@@@#=:.  @@         :--.        -@+  .-+#@@@@%+:       .:=*+-.
        ::.              .=*@@@@@@%#@@+=-:..         ..::=+#@%#@@@@@@%+-.             ..-.
                            ..=*#@@@@@@@@@@@@@@@%%@@@@@@@@@@@@@@%#+-.
                                  .:-==++*#######%######**+==-:



INFO[06-13|06:57:34.059] [1/6 OtterSync] Syncing header-chain
INFO[06-13|06:57:34.071] [Checkpoint Sync] Requesting beacon state uri=https://checkpointz.pietjepuk.net/eth/v2/debug/beacon/states/finalized
INFO[06-13|06:57:35.097] [snapshots] initializing downloads       torrents=0/337
INFO[06-13|06:57:37.097] [snapshots] initializing downloads       torrents=0/337
WARN[06-13|06:57:39.221] builder api enable but relay url not set. Skipping builder mode
INFO[06-13|06:57:39.221] Starting caplin
INFO[06-13|06:57:41.098] [snapshots] initializing downloads       torrents=0/337
INFO[06-13|06:57:41.347] Static peers                             len=0
INFO[06-13|06:57:41.350] [Sentinel] Sentinel started              app=caplin enr=enr:-LS4QOMVBG06U52BKFETqdtGsG9m_65e-yOLvpPVu9TV392jHgiVnwapCFw1pJMJQdpNznz9wzM_71SWUfYvsujyd28Bh2F0dG5ldHOIAAAAAAAAAACEZXRoMpCtUyzrBQAAAAAAAAAAAAAAgmlkgnY0iXNlY3AyNTZrMaEC-sukW2lZzBxeAOOdxHVfmGp6dUt3wHotipf22tJ_4XKIc3luY25ldHMAg3RjcIIPoYN1ZHCCD6A
INFO[06-13|06:57:41.351] Started Ethereum 2.0 Gossip Service      app=caplin
INFO[06-13|06:57:41.352] Beacon API started                       addr=localhost:5555
INFO[06-13|06:57:41.352] [Caplin] starting clstages loop          app=caplin
meta_unsteady:26382 wipe txn #3, meta 2
INFO[06-13|06:57:42.067] Starting downloading History             app=caplin stage=DownloadHistoricalBlocks from=11914400
INFO[06-13|06:57:42.348] [Sentinel] Update ENR on subscription    subnet=15 subscribe=true type=attestation
INFO[06-13|06:57:42.349] [Sentinel] Update ENR on subscription    subnet=11 subscribe=true type=attestation
INFO[06-13|06:57:49.098] [snapshots] initializing downloads       torrents=0/337
INFO[06-13|06:57:51.540] [snapshots] initializing downloads       torrents=337/337
INFO[06-13|06:57:51.612] [1/6 OtterSync] Header-chain synced
INFO[06-13|06:57:51.650] [1/6 OtterSync] Syncing remaining snapshots
INFO[06-13|06:57:51.799] [snapshots] initializing downloads       torrents=1554/1554
INFO[06-13|06:57:51.859] [snapshots] no metadata yet              files=43 list=idx/v2.0-logaddrs.1840-1841.ef,history/v1.0-storage.1840-1841.v,domain/v1.0-rcache.1840-1841.kvi,accessor/v1.0-receipt.1840-1841.vi,idx/v2.0-rcache.1840-1841.ef,...
INFO[06-13|06:57:51.860] [1/6 OtterSync] Syncing                  file-metadata=1511/1554 files=1510/1554 data=1.8TB time-left=inf total-time=17s download-rate=0B/s hashing-rate=0B/s alloc=8.3GB sys=8.9GB
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe3bf18]

goroutine 27130 [running]:
log/slog.(*Logger).Handler(...)
	/usr/lib/go-1.24/src/log/slog/logger.go:121
log/slog.(*Logger).Enabled(0x1?, {0x3a23090?, 0x6793a20?}, 0xc0530767c0?)
	/usr/lib/go-1.24/src/log/slog/logger.go:168 +0x18
log/slog.(*Logger).log(0x0, {0x3a23090, 0x6793a20}, 0x4, {0x30e0e25, 0x31}, {0xc19f843d00, 0x6, 0x6})
	/usr/lib/go-1.24/src/log/slog/logger.go:241 +0x6a
log/slog.(*Logger).Warn(...)
	/usr/lib/go-1.24/src/log/slog/logger.go:219
github.com/anacrolix/torrent/webseed.(*Client).checkContentLength(0xc198fb43c8, 0xc1a0c3b7a0, {0xc17a9e3b80, {0x1200000, 0x200000}, 0xc19fb21da0, 0xc195eecf70}, 0x1400000)
	/home/erigon/go/pkg/mod/github.com/anacrolix/torrent@v1.58.2-0.20250610025943-9b88d091c5df/webseed/client.go:135 +0x1da
github.com/anacrolix/torrent/webseed.(*Client).recvPartResult(0xc198fb43c8, {0x3a235d0, 0xc1a08a64b0}, {0x39ffcc0, 0xc19d7c67e0}, {0xc17a9e3b80, {0x1200000, 0x200000}, 0xc19fb21da0, 0xc195eecf70}, ...)
	/home/erigon/go/pkg/mod/github.com/anacrolix/torrent@v1.58.2-0.20250610025943-9b88d091c5df/webseed/client.go:168 +0x435
github.com/anacrolix/torrent/webseed.(*Client).readRequestPartResponses(0xc198fb43c8, {0x3a235d0, 0xc1a08a64b0}, {0x39ffcc0, 0xc19d7c67e0}, {0xc1a08aa240?, 0xc115f52fd0?, 0x601324?})
	/home/erigon/go/pkg/mod/github.com/anacrolix/torrent@v1.58.2-0.20250610025943-9b88d091c5df/webseed/client.go:213 +0x12a
github.com/anacrolix/torrent/webseed.(*Client).StartNewRequest.func2()
	/home/erigon/go/pkg/mod/github.com/anacrolix/torrent@v1.58.2-0.20250610025943-9b88d091c5df/webseed/client.go:109 +0x3e
created by github.com/anacrolix/torrent/webseed.(*Client).StartNewRequest in goroutine 26899
	/home/erigon/go/pkg/mod/github.com/anacrolix/torrent@v1.58.2-0.20250610025943-9b88d091c5df/webseed/client.go:108 +0x1ec
exit status 2

in latest version of this branch

Thanks for this. I added this to catch bad cases in Erigon webseeds, I never thought it would actually trigger. Fix shortly

Update: Okay fixed. I didn't fix the actual issue on the webseed hosts... I'm getting a lot of warnings that suggest that the preverified.toml and the webseed contents are currently out of sync for main/ethmainnet

@anacrolix
Copy link
Contributor Author

All regular CI/unit tests are passing.

@canepat's selection of snapshot CIs are passing (as far as I can tell only with occasional false negatives unrelating to snapshot download).

The only pending question is how to handle snapshot merges a client does locally that result in a file being removed. I don't know if it blocks using the PR, clients would potentially redownload snapshots they had merged after they restart the client. It could be fixed later. I have a change for it, and 2 potential paths forward.

@AskAlexSharov could you review? Do you want the last piece on the removals included?

@AskAlexSharov
Copy link
Collaborator

will review now
seems #15537 broken sync in main - better revert/fix it first. i will ask @JkLondon

@@ -526,7 +526,23 @@ func (a *Aggregator) BuildMissedAccessors(ctx context.Context, workers int) erro
ii.BuildMissedAccessors(ctx, g, ps, missedFilesItems.ii[ii.name])
}

if err := g.Wait(); err != nil {
err := func() error {
defer func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not, same as my other comment.

@@ -205,6 +206,7 @@ func ProcessFrozenBlocks(ctx context.Context, db kv.RwDB, blockReader services.F
func StageLoopIteration(ctx context.Context, db kv.RwDB, txc wrap.TxContainer, sync *stagedsync.Sync, initialCycle, firstCycle bool, logger log.Logger, blockReader services.FullBlockReader, hook *Hook) (err error) {
defer func() {
if rec := recover(); rec != nil {
debug.PrintStack()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. More general problem with hiding panics, but doesn't need to be in this PR.

go 1.23.0

toolchain go1.23.6
go 1.24
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we drop go 1.23 support when go 1.25.1 released.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do this because lint and mod checks were failing. I couldn't find an alternative.

@mriccobene
Copy link
Member

@AskAlexSharov AskAlexSharov enabled auto-merge (squash) June 19, 2025 14:20
@AskAlexSharov AskAlexSharov merged commit ebcd1a9 into main Jun 19, 2025
14 checks passed
@AskAlexSharov AskAlexSharov deleted the anacrolix/master branch June 19, 2025 15:15
anacrolix added a commit that referenced this pull request Jun 23, 2025
This is to complete an item from
#15043:

> Don't redownload files after preverified.toml is committed if they go
missing. This occurs due to merges etc.

At node startup, the preverified.toml that was committed locally from a
previous initial sync is scanned and the torrents in it are requested
from the downloader. If in a previous run, torrents were removed due to
merges, the torrents in the final snapshot will be added again (to my
understanding). I can see two ways to deal with this:

1. Mark manually removed torrents with a file like name.removed, or
2. Once the preverified.toml is committed, don't request any torrents
from it any longer.

The second option removes a lot of the logic from the downloader, where
it doesn't belong. This PR implements that. The only downside I can see
here is it makes it harder to manually repair a snapshot dir, for
example you could "resync" a dir by removing all the `.removed` files in
1. However you could also sync to the latest initial snapshot by
removing whatever preverified.toml you have, possibly wiping any
.torrent files, and let the node resync (it will reuse existing data now
too since #15043).

I'm working on a sequence diagram that should describe how it fits
together so it's clear and easy to work with the downloader snapshot dir
and to check this logic.
@anacrolix anacrolix mentioned this pull request Jul 14, 2025
AskAlexSharov pushed a commit that referenced this pull request Jul 27, 2025
In #15043 I didn't include the updated protobuf interfaces. I spotted
them in #16081.

The vendoring is too brittle and makes it hard to generate local changes
in a workspace. I switched it to using a git submodule which is much
easier to work with. The interfaces generated here are direct outputs of
the protobuf so there's no need to include a reference to the interfaces
revision used in the Go code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Downloader has synchronization issues
5 participants