Skip to content

Erigon v3+ Crashes goroutine torrent issue #14675

Open
@Aderks

Description

@Aderks

System information

Erigon version: v3.0.2 (issue seen on v3.0.0 and v3.0.1)

OS & Version: Ubuntu 24.04 / Erigon running in Docker through docker-compose

Commit hash: cd2863801089dacef6e6fa807eb02a531a7ab810

Erigon Command (with flags/config):

ETH mainnet w/ archival Caplin docker-compose command:

    command:
      - --chain=mainnet
      - --db.pagesize=4KB
      - --db.size.limit=8TB
      - --port=31303
      - --downloader.disable.ipv6=true
      - --http.addr=0.0.0.0
      - --http.port=8545
      - --http.api=net,web3,eth,admin,debug,txpool,engine,rpc,trace
      - --ws      
      - --trace.maxtraces=2000
      - --trace.compat
      - --metrics
      - --metrics.addr=0.0.0.0
      - --metrics.port=6060
      - --authrpc.port=8551
      - --authrpc.addr=0.0.0.0
      - --authrpc.jwtsecret=/opt/jwt/jwt.hex
      - --authrpc.vhosts=*
      - --http.vhosts=*
      - --http.corsdomain=*
      - --rpc.batch.concurrency=32
      - --db.read.concurrency=512
      - --rpc.returndata.limit=5000000000
      - --rpc.batch.limit=200
      - --torrent.download.slots=6
      - --torrent.download.rate=60mb
      - --rpc.gascap=5000000000
      - --prune.mode=archive
      - --caplin.blobs-no-pruning
      - --caplin.blobs-immediate-backfill
      - --caplin.states-archive
      - --caplin.blobs-archive
      - --caplin.blocks-archive
      - --caplin.discovery.addr=0.0.0.0
      - --caplin.discovery.port=51161
      - --caplin.discovery.tcpport=51162
      - --beacon.api=beacon,builder,config,debug,events,node,validator,lighthouse
      - --beacon.api.addr=0.0.0.0
      - --beacon.api.cors.allow-origins=*
      - --beacon.api.port=5059

Chain/Network: mainnet & gnosis

Expected behaviour

Erigon won't randomly crash and run as intended

Actual behaviour

Erigon randomly crashes with a long list of goroutine errors. The nodes are fully synced to head.

With a restart flag set in docker-compose Erigon stalls on following auto-restarts as Caplin discovery tcpport hasn't closed. Another restart is needed.

Section of log snippet:

goroutine 12470398 gp=0xc0a6a6f6c0 m=nil [select]:
runtime.gopark(0xc0f92eaf30?, 0x2?, 0xb8?, 0xad?, 0xc0f92eaef4?)
        runtime/proc.go:424 +0xce fp=0xc0f92ead70 sp=0xc0f92ead50 pc=0x493aee
runtime.selectgo(0xc0f92eaf30, 0xc0f92eaef0, 0xc27653ffa8?, 0x0, 0xc1a4901e60?, 0x1)
        runtime/select.go:335 +0x7a5 fp=0xc0f92eae98 sp=0xc0f92ead70 pc=0x46ee65
github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleSendingMessages(0xc27653ffd0?, {0x354a590, 0x5f5c560}, {0x356ce50, 0xc1a1694c80}, 0xc003770b60)
        github.com/libp2p/go-libp2p-pubsub@v0.11.0/comm.go:178 +0x128 fp=0xc0f92eafa0 sp=0xc0f92eae98 pc=0x18652c8
github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleNewPeer.gowrap1()
        github.com/libp2p/go-libp2p-pubsub@v0.11.0/comm.go:130 +0x34 fp=0xc0f92eafe0 sp=0xc0f92eafa0 pc=0x1864ed4
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0f92eafe8 sp=0xc0f92eafe0 pc=0x49c1c1
created by github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleNewPeer in goroutine 12489443
        github.com/libp2p/go-libp2p-pubsub@v0.11.0/comm.go:130 +0x2e5

goroutine 128769234 gp=0xc0a6cc3dc0 m=nil [select]:
runtime.gopark(0xc0076adf30?, 0x2?, 0xb8?, 0xdd?, 0xc0076adef4?)
        runtime/proc.go:424 +0xce fp=0xc0076add70 sp=0xc0076add50 pc=0x493aee
runtime.selectgo(0xc0076adf30, 0xc0076adef0, 0xc0a6cc3dc0?, 0x0, 0x4268be?, 0x1)
        runtime/select.go:335 +0x7a5 fp=0xc0076ade98 sp=0xc0076add70 pc=0x46ee65
github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleSendingMessages(0x10000c171399e60?, {0x354a590, 0x5f5c560}, {0x356cee0, 0xc14f5023a0}, 0xc084bb5110)
        github.com/libp2p/go-libp2p-pubsub@v0.11.0/comm.go:178 +0x128 fp=0xc0076adfa0 sp=0xc0076ade98 pc=0x18652c8
github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleNewPeer.gowrap1()
        github.com/libp2p/go-libp2p-pubsub@v0.11.0/comm.go:130 +0x34 fp=0xc0076adfe0 sp=0xc0076adfa0 pc=0x1864ed4
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0076adfe8 sp=0xc0076adfe0 pc=0x49c1c1
created by github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleNewPeer in goroutine 128766762
        github.com/libp2p/go-libp2p-pubsub@v0.11.0/comm.go:130 +0x2e5

goroutine 10914619 gp=0xc0aacf5340 m=nil [sync.Cond.Wait, 466 minutes]:
runtime.gopark(0xc0a4dc7dd8?, 0xe05b62?, 0x90?, 0x7d?, 0xc0a4dc7e18?)
        runtime/proc.go:424 +0xce fp=0xc0b774cd98 sp=0xc0b774cd78 pc=0x493aee
runtime.goparkunlock(...)
        runtime/proc.go:430
sync.runtime_notifyListWait(0xc168a804f0, 0x21b22)
        runtime/sema.go:587 +0x159 fp=0xc0b774cde8 sp=0xc0b774cd98 pc=0x495619
sync.(*Cond).Wait(0xc168a80090?)
        sync/cond.go:71 +0x85 fp=0xc0b774ce28 sp=0xc0b774cde8 pc=0x4b5465
github.com/anacrolix/torrent.(*webseedPeer).requester(0xc168a80008, 0x7f)
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/webseed-peer.go:337 +0x5e5 fp=0xc0b774cfc0 sp=0xc0b774ce28 pc=0x10a4f65
github.com/anacrolix/torrent.(*webseedPeer).requester.func3.gowrap3()
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/webseed-peer.go:267 +0x25 fp=0xc0b774cfe0 sp=0xc0b774cfc0 pc=0x10a5785
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0b774cfe8 sp=0xc0b774cfe0 pc=0x49c1c1
created by github.com/anacrolix/torrent.(*webseedPeer).requester.func3 in goroutine 8203931
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/webseed-peer.go:267 +0x1a5

goroutine 8259890 gp=0xc0ad8c2000 m=nil [sync.Cond.Wait, 531 minutes]:
runtime.gopark(0xc12e08bdd8?, 0xe05b62?, 0x90?, 0xbd?, 0xc12e08be18?)
        runtime/proc.go:424 +0xce fp=0xc1ac25ad98 sp=0xc1ac25ad78 pc=0x493aee
runtime.goparkunlock(...)
        runtime/proc.go:430
sync.runtime_notifyListWait(0xc10d63f670, 0xaa)
        runtime/sema.go:587 +0x159 fp=0xc1ac25ade8 sp=0xc1ac25ad98 pc=0x495619
sync.(*Cond).Wait(0xc10d63f210?)
        sync/cond.go:71 +0x85 fp=0xc1ac25ae28 sp=0xc1ac25ade8 pc=0x4b5465
github.com/anacrolix/torrent.(*webseedPeer).requester(0xc10d63f188, 0x11)
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/webseed-peer.go:337 +0x5e5 fp=0xc1ac25afc0 sp=0xc1ac25ae28 pc=0x10a4f65
github.com/anacrolix/torrent.(*webseedPeer).requester.func3.gowrap3()
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/webseed-peer.go:267 +0x25 fp=0xc1ac25afe0 sp=0xc1ac25afc0 pc=0x10a5785
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc1ac25afe8 sp=0xc1ac25afe0 pc=0x49c1c1
created by github.com/anacrolix/torrent.(*webseedPeer).requester.func3 in goroutine 405203
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/webseed-peer.go:267 +0x1a5

goroutine 129516757 gp=0xc0b044ddc0 m=nil [select]:
runtime.gopark(0xc2cb10bf40?, 0x2?, 0xf0?, 0x9c?, 0xc2cb10bf2c?)
        runtime/proc.go:424 +0xce fp=0xc2cb10bdb8 sp=0xc2cb10bd98 pc=0x493aee
runtime.selectgo(0xc2cb10bf40, 0xc2cb10bf28, 0xc129872cd0?, 0x0, 0xc12e9d6d80?, 0x1)
        runtime/select.go:335 +0x7a5 fp=0xc2cb10bee0 sp=0xc2cb10bdb8 pc=0x46ee65
github.com/anacrolix/torrent/tracker/udp.(*Client).requestWriter(0xc1b4df0140, {0x354aad0, 0xc129872cd0}, 0x1, {0xc12e9d6d80, 0x5d, 0x60}, 0x9592f3ee)
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/tracker/udp/client.go:177 +0x156 fp=0xc2cb10bf78 sp=0xc2cb10bee0 pc=0xeca396
github.com/anacrolix/torrent/tracker/udp.(*Client).request.func2()
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/tracker/udp/client.go:205 +0x3e fp=0xc2cb10bfe0 sp=0xc2cb10bf78 pc=0xecaafe
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc2cb10bfe8 sp=0xc2cb10bfe0 pc=0x49c1c1
created by github.com/anacrolix/torrent/tracker/udp.(*Client).request in goroutine 356414
        github.com/anacrolix/torrent@v1.52.6-0.20231201115409-7ea994b6bbd8/tracker/udp/client.go:204 +0x279

In our docker-compose.yml we have restart: unless-stopped setup. Erigon restarts but sometimes the restart is too quick and Caplin discovery ports are not closed so Erigon stalls.

[INFO] [04-19|08:38:29.828] Starting caplin
[EROR] [04-19|08:38:30.342] could not start caplin                   err="failed to listen on any addresses: [listen tcp4 0.0.0.0:51162: bind: address already in use]"
[INFO] [04-19|08:38:30.342] Exiting...
[INFO] [04-19|08:38:30.342] Exiting Engine...
[INFO] [04-19|08:38:30.342] RPC server shutting down
[INFO] [04-19|08:38:30.343] RPC server shutting down
[INFO] [04-19|08:38:30.343] HTTP endpoint closed                     url=[::]:8649
[INFO] [04-19|08:38:30.343] Engine HTTP endpoint close               url=[::]:8551
[INFO] [04-19|08:38:30.343] HTTP endpoint closed                     url=[::]:8546
[INFO] [04-19|08:38:30.343] RPC server shutting down
[INFO] [04-19|08:38:30.360] [txpool] stopped
[INFO] [04-19|08:38:30.360] devp2p txn pool goroutine terminated

A simple restart again fixes it as the port is closed by then.

Steps to reproduce the behaviour

Just running erigon through docker-compose and it randomly crashes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions