URGENT - Archive node with latest snapshot cannot reach live #11306

MrFrogoz · 2024-09-28T00:26:24Z

Describe the bug

When the node is set with "--l2.enginekind=reth" and starts downloading from a checkpoint to a target, at this point it is very slow:

op-reth[38832]: 2024-09-28T00:14:27.546943Z  INFO Status connected_peers=22 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_progress=33.54% stage_eta=2h 26m 19s
op-reth[38832]: 2024-09-28T00:14:34.634691Z  INFO Committed stage progress pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_progress=33.77% stage_eta=1h 36m 42s
op-reth[38832]: 2024-09-28T00:14:35.068968Z  INFO Preparing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860
op-reth[38832]: 2024-09-28T00:14:35.069000Z  INFO Executing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_eta=1h 36m 42s
op-reth[38832]: 2024-09-28T00:14:37.419851Z  INFO Committed stage progress pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_progress=33.79% stage_eta=2h 41m 48s
op-reth[38832]: 2024-09-28T00:14:37.459676Z  INFO Preparing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860
op-reth[38832]: 2024-09-28T00:14:37.459701Z  INFO Executing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_eta=2h 41m 48s
op-reth[38832]: 2024-09-28T00:14:40.913377Z  INFO Committed stage progress pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_progress=33.82% stage_eta=2h 3m 13s
op-reth[38832]: 2024-09-28T00:14:40.973253Z  INFO Preparing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860

and when he finishes a set of blocks, he is still always a couple of hours behind and starts again with a new set of blocks. He does this for days without ever catching up to live.

Steps to reproduce

wget https://mainnet-reth-archive-snapshots.base.org/$(curl https://mainnet-reth-archive-snapshots.base.org/latest)

ExecStart=/root/build/op-node \
    --network=base-mainnet \
    --l1=........ \
    --l2=http://localhost:8551 \
    --rpc.addr=0.0.0.0 \
    --rpc.port=60004 \
    --l2.jwt-secret=/root/build/jwt.hex \
    --l1.trustrpc \
    --l1.rpckind=basic \
    --l1.beacon=........ \
    --syncmode=execution-layer \
    --l2.enginekind=reth

ExecStart=/root/build/op-reth node \
    --chain=base \
    --datadir=/db/reth/base \
    --authrpc.addr=0.0.0.0 \
    --authrpc.port=8551 \
    --authrpc.jwtsecret=/root/build/jwt.hex \
    --http.port=50004 \
    --http.addr=0.0.0.0 \
    --http.corsdomain=* \
    --http \
    --ws \
    --http.api=admin,debug,eth,net,trace,txpool,web3,rpc,reth,ots \
    --rollup.sequencer-http=https://mainnet-sequencer.base.org \
    --rpc-max-connections=1000000 \
    --rpc-max-tracing-requests=1000000

Node logs

No response

Platform(s)

Linux (x86)

What version/commit are you on?

latest

What database version are you on?

latest

Which chain / network are you on?

base

What type of node are you running?

Archive (default)

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

I agree to follow the Code of Conduct

The text was updated successfully, but these errors were encountered:

MrFrogoz · 2024-09-30T02:19:56Z

I'm also trying without the "--l2.enginekind=reth" option and the node can only recover one block every 2/3 seconds, it's extremely slow considering how the Base chain is going at the moment, to get back live with a node 1 day behind, it takes about 1 week

op-reth[56231]: 2024-09-30T02:17:49.552305Z  INFO Block added to canonical chain number=20361203 hash=0x797e1489659cb6e3deee73e87efad223f7ace461d1ab2ea899942710de6b9a71 peers=8 txs=98 gas=29.95 Mgas gas_throughput=327.24 Mgas/second full=22.7% base_fee=0.01gwei blobs=0 excess_blobs=0 elapsed=91.517882ms
op-reth[56231]: 2024-09-30T02:17:52.019591Z  INFO Block added to canonical chain number=20361204 hash=0x61aa864cf69080675ffdb279fea17061d8e3a60a908498b44d02f2c4af82f0ac peers=8 txs=137 gas=31.56 Mgas gas_throughput=366.72 Mgas/second full=23.9% base_fee=0.01gwei blobs=0 excess_blobs=0 elapsed=86.067298ms
op-reth[56231]: 2024-09-30T02:17:54.171398Z  INFO Block added to canonical chain number=20361205 hash=0xd379d74ae1b0858700cecbd4b829af58065a0b902bbced6cdf6d3b0df7c30b36 peers=8 txs=149 gas=26.66 Mgas gas_throughput=341.26 Mgas/second full=20.2% base_fee=0.01gwei blobs=0 excess_blobs=0 elapsed=78.134131ms
op-reth[56231]: 2024-09-30T02:17:56.456186Z  INFO Block added to canonical chain number=20361206 hash=0x77b76bdd0c02ceb4b2b0d8ad4d5091e3d90b72e64f7d5509db776b0f9402fa09 peers=8 txs=119 gas=25.17 Mgas gas_throughput=304.98 Mgas/second full=19.1% base_fee=0.01gwei blobs=0 excess_blobs=0 elapsed=82.518137ms

Rjected · 2024-10-03T17:52:58Z

Hi @MrFrogoz , what hardware are you running this on? Specifically, what kind of disk, cpu, and total ram?

MrFrogoz · 2024-10-03T18:27:27Z

AMD EPYC 7J13 16 Core, 64 GB Ram, Nvme 50K IOPS - 680 Mbps. Usage when running: CPU 3% - RAM 30%

I run many nodes of different blockchains with the same hardware, only Base has slowness issues with blocks sync

MrFrogoz · 2024-10-07T10:34:57Z

Is there anything news? in the meantime i noticed that when the node is under many debug trace calls, the synchronization is even slower, almost always remain late, the node usage always remains low. In the meantime reading the other thread, i try the "--engine.experimental" option to see if the node can stay live

MrFrogoz · 2024-10-08T17:13:23Z

I succeeded to put the node live with that option, then I removed it for the following problem: #11570. However, restarting the node without that option, to recover 10 minutes of offline, the node took 2 hours to go live again. I hope you will be able to improve the performance of the binary, I don't know if op-node has something to do with this continuous slowdown. I will wait a feedback

MrFrogoz · 2024-10-11T11:17:06Z

still considering that the sync is quite slow as written before, if I try to make trace block calls only with live blocks, the node is again unable to stay live, and starts to accumulate continuous delay

mattsse · 2024-10-12T15:41:50Z

--rpc-max-tracing-requests=1000000

this is likely the reason you're experiencing the tracing issues, tracing is CPU bound hence the concurrent tracing requests are by default limited to number of cores minus x depending on the number of available cores

Usage when running: CPU 3% - RAM 30%

this sounds more like a disk issue @Rjected ?

MrFrogoz · 2024-10-12T15:51:48Z

I understand, but even if I keep a node with rpc api closed to all, it is still very slow to resync to live. For disk I am using a m2 ssd with these settings:
IOPS:
50,000 IOPS
Throughput:
680 MB/s

Current utilization per second:
3,151 IOPS
15 MB/s

I have a cluster of 5 nodes and they all have the same problem, each node is a few hours late.
I have nodes of optimism and ethereum with reth and they have no problems, both mainnet and testnet, probably due to the fact that Base mainnet has a very high traffic that perhaps the reth node cannot handle well

MrFrogoz · 2024-10-15T16:09:21Z

Is it possible to have some feedback? are you working on it? are you checking if the problem is real? other developers from what I've seen online have the same slowdown even on different infrastructures

Rjected · 2024-10-15T16:15:29Z

@MrFrogoz sorry for the lack of an update - we still don't know what the issue could be. We've been running base nodes on our own infrastructure without issue, although base may just require lower disk latency (given the iops / bandwidth don't seem saturated). An exact disk model would help, since some disks perform better than others

MrFrogoz · 2024-10-15T16:27:52Z

Unfortunately the Oracle data center does not explicitly state which disk model they use, but they definitely use NVMe SSD units, theoretically it equals the aws gp3 disk type. The fact that it doesn't stay online even with closed rpc calls and that the disk metrics data is so unused, suggests that there is some lack of optimization once a certain tps value is reached

Rjected · 2024-10-15T16:45:44Z

@MrFrogoz is the storage analogous to i.e. GCP "local SSD"s, or AWS "instance storage", for example on their r5d machines? Or is the storage a network-attached block storage system (AWS EBS, GCP hyperdisk, etc)? Is there an instance type we can take a look at? reth performs poorly on network-attached block storage because those systems have much higher latency, making it much more difficult to utilize the metered IOPS. This is because our I/O access patterns during block validation are synchronous and not parallel.

MrFrogoz · 2024-10-16T12:12:39Z

As written above you can take as a reference aws ebs in gp3 is identical to the one I'm using.
I attach the disk performances that are equivalent to that of a gp3.
Obviously an io2 type has half latency
Command:

Random Reads	
Throughput(MB/s)	sudo fio --filename=/dev/device --direct=1 --rw=randread --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 --readonly
IOPS (avg)	sudo fio --filename=/dev/device --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
Latency (ms)	sudo fio --filename=/dev/device --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --numjobs=1 --time_based --group_reporting --name=readlatency-test-job --runtime=120 --eta-newline=1 --readonly
Random Read/Writes	
Throughput (MB/s)	sudo fio --filename=/dev/device --direct=1 --rw=randrw --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1
IOPS (avg)	sudo fio --filename=/dev/device --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1
Latency (ms)	sudo fio --filename=/dev/device --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=1 --numjobs=1 --time_based --group_reporting --name=rwlatency-test-job --runtime=120 --eta-newline=1
Sequential Reads	
Throughput (MB/s)	sudo fio --filename=/dev/device --direct=1 --rw=read --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 --readonly
IOPS	sudo fio --filename=/dev/device --direct=1 --rw=read --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

Block Volume:

MrFrogoz added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels Sep 28, 2024

mattsse mentioned this issue Oct 5, 2024

Base op-reth Archival Node: Can't sync #11512

Open

1 task

emhane added A-op-reth Related to Optimism and op-reth and removed S-needs-triage This issue needs to be labelled labels Oct 8, 2024

MrFrogoz mentioned this issue Oct 11, 2024

Archive node crash with setting --engine.experimental #11570

Closed

1 task

mattsse mentioned this issue Oct 16, 2024

docs: clarify max rpc tracing requests #11796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URGENT - Archive node with latest snapshot cannot reach live #11306

URGENT - Archive node with latest snapshot cannot reach live #11306

MrFrogoz commented Sep 28, 2024 •

edited

Loading

MrFrogoz commented Sep 30, 2024 •

edited

Loading

Rjected commented Oct 3, 2024 •

edited

Loading

MrFrogoz commented Oct 3, 2024 •

edited

Loading

MrFrogoz commented Oct 7, 2024

MrFrogoz commented Oct 8, 2024 •

edited

Loading

MrFrogoz commented Oct 11, 2024 •

edited

Loading

mattsse commented Oct 12, 2024

MrFrogoz commented Oct 12, 2024 •

edited

Loading

MrFrogoz commented Oct 15, 2024 •

edited

Loading

Rjected commented Oct 15, 2024

MrFrogoz commented Oct 15, 2024 •

edited

Loading

Rjected commented Oct 15, 2024

MrFrogoz commented Oct 16, 2024 •

edited

Loading

URGENT - Archive node with latest snapshot cannot reach live #11306

URGENT - Archive node with latest snapshot cannot reach live #11306

Comments

MrFrogoz commented Sep 28, 2024 • edited Loading

Describe the bug

Steps to reproduce

Node logs

Platform(s)

What version/commit are you on?

What database version are you on?

Which chain / network are you on?

What type of node are you running?

What prune config do you use, if any?

If you've built Reth from source, provide the full command you used

Code of Conduct

MrFrogoz commented Sep 30, 2024 • edited Loading

Rjected commented Oct 3, 2024 • edited Loading

MrFrogoz commented Oct 3, 2024 • edited Loading

MrFrogoz commented Oct 7, 2024

MrFrogoz commented Oct 8, 2024 • edited Loading

MrFrogoz commented Oct 11, 2024 • edited Loading

mattsse commented Oct 12, 2024

MrFrogoz commented Oct 12, 2024 • edited Loading

MrFrogoz commented Oct 15, 2024 • edited Loading

Rjected commented Oct 15, 2024

MrFrogoz commented Oct 15, 2024 • edited Loading

Rjected commented Oct 15, 2024

MrFrogoz commented Oct 16, 2024 • edited Loading

MrFrogoz commented Sep 28, 2024 •

edited

Loading

MrFrogoz commented Sep 30, 2024 •

edited

Loading

Rjected commented Oct 3, 2024 •

edited

Loading

MrFrogoz commented Oct 3, 2024 •

edited

Loading

MrFrogoz commented Oct 8, 2024 •

edited

Loading

MrFrogoz commented Oct 11, 2024 •

edited

Loading

MrFrogoz commented Oct 12, 2024 •

edited

Loading

MrFrogoz commented Oct 15, 2024 •

edited

Loading

MrFrogoz commented Oct 15, 2024 •

edited

Loading

MrFrogoz commented Oct 16, 2024 •

edited

Loading