Skip to content

downloader: Lock-up during sync due to circular return logic #16539

Closed
@veox

Description

EDIT: Original title: Lock-up during initial sync

EDIT: See this comment for current best guess on cause.


System information

Geth version: v1.8.4-stable-2423ae01/linux-amd64/go1.10 (installed via Ubuntu PPA package)
OS & Version: Ubuntu 16.04.4 LTS (Xenial Xerus)
Machine: KVM VPS

% uname -a
Linux <hostname> 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Expected behaviour

Continuous fast-sync.

Running via systemd with:

/usr/bin/geth --pprof --metrics --datadir /home/geth/.ethereum --cache 4096 --txpool.pricelimit 31337000 --syncmode fast --ethstats "veox-geth-lightserv-new-RESYNC:$SECRET@ethstats.net"

Actual behaviour

After seemingly-normal operation, and dropping off "stalling" peers once in a while, non-debug log output stops, as shown in this log tail.

At this point, in console using geth attach:

> eth.syncing
{
  currentBlock: 4460991,
  highestBlock: 5402762,
  knownStates: 15834634,
  pulledStates: 15823975,
  startingBlock: 4327715
}
> admin.peers.length
25

Setting

> debug.vmodule("p2p=4,downloader=4")

results in

DEBUG[04-20|15:06:23] Recalculated downloader QoS values       rtt=5.195478857s confidence=1.000 ttl=15.586452156s

being printed repeatedly (the time changes - as expected; rtt/ttl values don't).

Forcibly disconnecting a peer using admin.removePeer("<enode>") works, a new peer is selected from the pool. In other words: p2p still works fine(-ish?).

Steps to reproduce the behaviour

Not sure; this is possibly related to networking conditions on the machine.

Happens anywhere between 5 minutes and 1 hour after starting the node.

Rambling

If I had to hazard a guess, I'd say the node corners itself into selecting peers so fast, that a small traffic spike on the VPS tower makes them all look just slow enough to be dropped.

After that, either the downloader fails to realise the sync-peers are no longer there; QoS fails at hysteresis; all remaining peers are malicious; or something of the sort.

Backtrace

See gist.

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions