Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URGENT - Archive node with latest snapshot cannot reach live #11306

Open
1 task done
MrFrogoz opened this issue Sep 28, 2024 · 13 comments
Open
1 task done

URGENT - Archive node with latest snapshot cannot reach live #11306

MrFrogoz opened this issue Sep 28, 2024 · 13 comments
Labels
A-op-reth Related to Optimism and op-reth C-bug An unexpected or incorrect behavior

Comments

@MrFrogoz
Copy link

MrFrogoz commented Sep 28, 2024

Describe the bug

When the node is set with "--l2.enginekind=reth" and starts downloading from a checkpoint to a target, at this point it is very slow:

op-reth[38832]: 2024-09-28T00:14:27.546943Z  INFO Status connected_peers=22 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_progress=33.54% stage_eta=2h 26m 19s
op-reth[38832]: 2024-09-28T00:14:34.634691Z  INFO Committed stage progress pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_progress=33.77% stage_eta=1h 36m 42s
op-reth[38832]: 2024-09-28T00:14:35.068968Z  INFO Preparing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860
op-reth[38832]: 2024-09-28T00:14:35.069000Z  INFO Executing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_eta=1h 36m 42s
op-reth[38832]: 2024-09-28T00:14:37.419851Z  INFO Committed stage progress pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_progress=33.79% stage_eta=2h 41m 48s
op-reth[38832]: 2024-09-28T00:14:37.459676Z  INFO Preparing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860
op-reth[38832]: 2024-09-28T00:14:37.459701Z  INFO Executing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_eta=2h 41m 48s
op-reth[38832]: 2024-09-28T00:14:40.913377Z  INFO Committed stage progress pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860 stage_progress=33.82% stage_eta=2h 3m 13s
op-reth[38832]: 2024-09-28T00:14:40.973253Z  INFO Preparing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=20337319 target=20343860

and when he finishes a set of blocks, he is still always a couple of hours behind and starts again with a new set of blocks. He does this for days without ever catching up to live.

Steps to reproduce

wget https://mainnet-reth-archive-snapshots.base.org/$(curl https://mainnet-reth-archive-snapshots.base.org/latest)

ExecStart=/root/build/op-node \
    --network=base-mainnet \
    --l1=........ \
    --l2=http://localhost:8551 \
    --rpc.addr=0.0.0.0 \
    --rpc.port=60004 \
    --l2.jwt-secret=/root/build/jwt.hex \
    --l1.trustrpc \
    --l1.rpckind=basic \
    --l1.beacon=........ \
    --syncmode=execution-layer \
    --l2.enginekind=reth

ExecStart=/root/build/op-reth node \
    --chain=base \
    --datadir=/db/reth/base \
    --authrpc.addr=0.0.0.0 \
    --authrpc.port=8551 \
    --authrpc.jwtsecret=/root/build/jwt.hex \
    --http.port=50004 \
    --http.addr=0.0.0.0 \
    --http.corsdomain=* \
    --http \
    --ws \
    --http.api=admin,debug,eth,net,trace,txpool,web3,rpc,reth,ots \
    --rollup.sequencer-http=https://mainnet-sequencer.base.org \
    --rpc-max-connections=1000000 \
    --rpc-max-tracing-requests=1000000

Node logs

No response

Platform(s)

Linux (x86)

What version/commit are you on?

latest

What database version are you on?

latest

Which chain / network are you on?

base

What type of node are you running?

Archive (default)

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

  • I agree to follow the Code of Conduct
@MrFrogoz MrFrogoz added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels Sep 28, 2024
@MrFrogoz
Copy link
Author

MrFrogoz commented Sep 30, 2024

I'm also trying without the "--l2.enginekind=reth" option and the node can only recover one block every 2/3 seconds, it's extremely slow considering how the Base chain is going at the moment, to get back live with a node 1 day behind, it takes about 1 week

op-reth[56231]: 2024-09-30T02:17:49.552305Z  INFO Block added to canonical chain number=20361203 hash=0x797e1489659cb6e3deee73e87efad223f7ace461d1ab2ea899942710de6b9a71 peers=8 txs=98 gas=29.95 Mgas gas_throughput=327.24 Mgas/second full=22.7% base_fee=0.01gwei blobs=0 excess_blobs=0 elapsed=91.517882ms
op-reth[56231]: 2024-09-30T02:17:52.019591Z  INFO Block added to canonical chain number=20361204 hash=0x61aa864cf69080675ffdb279fea17061d8e3a60a908498b44d02f2c4af82f0ac peers=8 txs=137 gas=31.56 Mgas gas_throughput=366.72 Mgas/second full=23.9% base_fee=0.01gwei blobs=0 excess_blobs=0 elapsed=86.067298ms
op-reth[56231]: 2024-09-30T02:17:54.171398Z  INFO Block added to canonical chain number=20361205 hash=0xd379d74ae1b0858700cecbd4b829af58065a0b902bbced6cdf6d3b0df7c30b36 peers=8 txs=149 gas=26.66 Mgas gas_throughput=341.26 Mgas/second full=20.2% base_fee=0.01gwei blobs=0 excess_blobs=0 elapsed=78.134131ms
op-reth[56231]: 2024-09-30T02:17:56.456186Z  INFO Block added to canonical chain number=20361206 hash=0x77b76bdd0c02ceb4b2b0d8ad4d5091e3d90b72e64f7d5509db776b0f9402fa09 peers=8 txs=119 gas=25.17 Mgas gas_throughput=304.98 Mgas/second full=19.1% base_fee=0.01gwei blobs=0 excess_blobs=0 elapsed=82.518137ms

@Rjected
Copy link
Member

Rjected commented Oct 3, 2024

Hi @MrFrogoz , what hardware are you running this on? Specifically, what kind of disk, cpu, and total ram?

@MrFrogoz
Copy link
Author

MrFrogoz commented Oct 3, 2024

AMD EPYC 7J13 16 Core, 64 GB Ram, Nvme 50K IOPS - 680 Mbps. Usage when running: CPU 3% - RAM 30%

I run many nodes of different blockchains with the same hardware, only Base has slowness issues with blocks sync

@MrFrogoz
Copy link
Author

MrFrogoz commented Oct 7, 2024

Is there anything news? in the meantime i noticed that when the node is under many debug trace calls, the synchronization is even slower, almost always remain late, the node usage always remains low. In the meantime reading the other thread, i try the "--engine.experimental" option to see if the node can stay live

@emhane emhane added A-op-reth Related to Optimism and op-reth and removed S-needs-triage This issue needs to be labelled labels Oct 8, 2024
@MrFrogoz
Copy link
Author

MrFrogoz commented Oct 8, 2024

I succeeded to put the node live with that option, then I removed it for the following problem: #11570. However, restarting the node without that option, to recover 10 minutes of offline, the node took 2 hours to go live again. I hope you will be able to improve the performance of the binary, I don't know if op-node has something to do with this continuous slowdown. I will wait a feedback

@MrFrogoz
Copy link
Author

MrFrogoz commented Oct 11, 2024

still considering that the sync is quite slow as written before, if I try to make trace block calls only with live blocks, the node is again unable to stay live, and starts to accumulate continuous delay

@mattsse
Copy link
Collaborator

mattsse commented Oct 12, 2024

--rpc-max-tracing-requests=1000000

this is likely the reason you're experiencing the tracing issues, tracing is CPU bound hence the concurrent tracing requests are by default limited to number of cores minus x depending on the number of available cores

Usage when running: CPU 3% - RAM 30%

this sounds more like a disk issue @Rjected ?

@MrFrogoz
Copy link
Author

MrFrogoz commented Oct 12, 2024

I understand, but even if I keep a node with rpc api closed to all, it is still very slow to resync to live. For disk I am using a m2 ssd with these settings:
IOPS:
50,000 IOPS
Throughput:
680 MB/s

Current utilization per second:
3,151 IOPS
15 MB/s

I have a cluster of 5 nodes and they all have the same problem, each node is a few hours late.
I have nodes of optimism and ethereum with reth and they have no problems, both mainnet and testnet, probably due to the fact that Base mainnet has a very high traffic that perhaps the reth node cannot handle well

@MrFrogoz
Copy link
Author

MrFrogoz commented Oct 15, 2024

Is it possible to have some feedback? are you working on it? are you checking if the problem is real? other developers from what I've seen online have the same slowdown even on different infrastructures

@Rjected
Copy link
Member

Rjected commented Oct 15, 2024

@MrFrogoz sorry for the lack of an update - we still don't know what the issue could be. We've been running base nodes on our own infrastructure without issue, although base may just require lower disk latency (given the iops / bandwidth don't seem saturated). An exact disk model would help, since some disks perform better than others

@MrFrogoz
Copy link
Author

MrFrogoz commented Oct 15, 2024

Unfortunately the Oracle data center does not explicitly state which disk model they use, but they definitely use NVMe SSD units, theoretically it equals the aws gp3 disk type. The fact that it doesn't stay online even with closed rpc calls and that the disk metrics data is so unused, suggests that there is some lack of optimization once a certain tps value is reached

@Rjected
Copy link
Member

Rjected commented Oct 15, 2024

@MrFrogoz is the storage analogous to i.e. GCP "local SSD"s, or AWS "instance storage", for example on their r5d machines? Or is the storage a network-attached block storage system (AWS EBS, GCP hyperdisk, etc)? Is there an instance type we can take a look at? reth performs poorly on network-attached block storage because those systems have much higher latency, making it much more difficult to utilize the metered IOPS. This is because our I/O access patterns during block validation are synchronous and not parallel.

@MrFrogoz
Copy link
Author

MrFrogoz commented Oct 16, 2024

As written above you can take as a reference aws ebs in gp3 is identical to the one I'm using.
I attach the disk performances that are equivalent to that of a gp3.
Obviously an io2 type has half latency
Command:

Random Reads	
Throughput(MB/s)	sudo fio --filename=/dev/device --direct=1 --rw=randread --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 --readonly
IOPS (avg)	sudo fio --filename=/dev/device --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
Latency (ms)	sudo fio --filename=/dev/device --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --numjobs=1 --time_based --group_reporting --name=readlatency-test-job --runtime=120 --eta-newline=1 --readonly
Random Read/Writes	
Throughput (MB/s)	sudo fio --filename=/dev/device --direct=1 --rw=randrw --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1
IOPS (avg)	sudo fio --filename=/dev/device --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1
Latency (ms)	sudo fio --filename=/dev/device --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=1 --numjobs=1 --time_based --group_reporting --name=rwlatency-test-job --runtime=120 --eta-newline=1
Sequential Reads	
Throughput (MB/s)	sudo fio --filename=/dev/device --direct=1 --rw=read --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 --readonly
IOPS	sudo fio --filename=/dev/device --direct=1 --rw=read --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

Block Volume:
oci-blockvolume

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-op-reth Related to Optimism and op-reth C-bug An unexpected or incorrect behavior
Projects
Status: Todo
Development

No branches or pull requests

4 participants