Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core, trie: intermediate mempool between trie and database #15857

Merged
merged 33 commits into from
Feb 5, 2018

Conversation

karalabe
Copy link
Member

@karalabe karalabe commented Jan 11, 2018

This PR is backwards incompatible! A chain starting using this PR will not be able to go back.

Currently the primary bottleneck on go-ethereum is disk IO. The database and especially the state trie within became gigantic (70M nodes on mainnet), which means that block processing takes a very long time to read data, and a very long time to write data. To make things worse, data constantly accumulates on disk, making both reads and writes constantly slower.

This PR tires to address to lowest hanging fruit of this problem, namely to try and reduce the amount of useless (or only temporarily useful) data written to disk in the first place. The important observation behind this PR is that a small part of every state update (i.e. new block) is useful data (e.g. new contract, new storage entry), but the most part of a state update is internal trie nodes which have a limited lifetime. This means, that most of the data hitting disk will become stale within a few blocks, so if we can avoid writing them in the first place, data usage and disk IO get significantly reduced.

The PR achieves this goal by inserting an in-memory trie.Database between our low level disk database, and our higher level trie abstraction. Whenever we commit a trie to "disk" it actually ends up cached in memory. From all intents and purposes, tries and go-ethereum think they've committed the data to disk, but they in reality get short circuited halfway down.

Keeping the recent nodes (trie diffs) in memory permits us to do state pruning based on reference counting garbage collection. Every trie node added to the memory database maintains a list of children references and parent counts. Each block processed adds a new set of nodes and references references; whereas decently old blocks dereference old state roots. This results in the trie.Database being a rolling garbage-collected trie-forest containing the most recent 128 tries only.

Of course, this in-memory database needs to be flushed eventually as it just grows with useful data ad infinitum. The PR contains two threshold that triggers the flushing of the tries:

  • If the memory consumption exceeds the user's allowance (25% of --cache), the tries are flushed to disk. This ensures that we keep and discard as much junk in memory-only as we can, but we have a hard limit that prevents the node from crashing off.
  • If the total block processing time exceeds 5 minutes, the tries are flushed to disk. This ensures that if the node terminates suddenly (crash, power failure), we lose at most 5 minutes worth of processing time.

The rest of this PR is the massive changes necessary to integrate this concept into go-ethereum. An important-to-mention extra modification is in the downloader:

  • Fast sync no longer randomizes pivot points from the last 256 blocks (since we only maintain the most recent 128 blocks), rather picks head - 64.
  • Fast sync no longer locks in the pivot point, rather rolls it forward if it gets older than head - 128 blocks. This is needed since we discard old state (unlike currently).
  • The PR also fixes various downloader bugs, which made state sync hang until now.

Some random benchmarks from this PR (they've been run during implementation, final results are a tad better):

Disk usage is significantly reduced (more so on chain genesis, less on chain head), resulting in significantly less disk IO, disk usage and processing time. Homestead (blocks 1..1919999) full import on a Core2 CPU, 4GB RAM laptop, 1GB cache:

Geth Disk Elapsed Database Disk IO¹
v1.8.0 HDD 5400RPM 7:55:07 4.232 GB 43.712 GB
v1.7.3 HDD 5400RPM 33:35:11 26.535 GB 716.300 GB

Disk growth is significantly reduced on mainnet after a successful fast sync, and after about a week of running this PR versus master, the block processing time is 20% faster (originally they are the same, but the disk growth of the non-pruned node eventually slows everything down).

1

Beside modifying the downloader to cater for state unavailability due to trie pruning, the PR also fixes the fast sync memory consumption issue, resulting in a much more pleasant fast sync experience:

2


Note, if you want to run an archive node without garbage collection, please run with --gcmode=archive.


Checklist to test before merging:

  • Importing blocks from chain dumps
  • Fast sync from the network
  • Archive sync from the network
  • Light sync from the network
  • Fast sync from an updated node
  • Light sync from an updated node
  • Forced import aborts and restarts (i.e. resuming from uncommitted state)
  • Small reorgs and side forks/uncles (i.e. no memory leak)
  • Large reorgs beyond the last commit point (relevant mostly on Ropsten)
  • Single transaction tracing (ensure we can use the in-memory tries)
  • Chain tracing (ensure the chain tracer + pruner still works with this PR)

trie/mempool.go Outdated
@@ -0,0 +1,181 @@
// Copyright 2017 The go-ethereum Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!

trie/mempool.go Outdated
parents map[common.Hash]int // Number of live nodes referencing a given one
children map[common.Hash]map[common.Hash]struct{} // Set of children referenced by given nodes

gctime time.Duration // Time spent on garbage collection since last commit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reference counting, not garbage collection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is garbage collection, it measures the data/time saved/spent on dereferencing stuff in memory and cleaning up things we don't need to save to disk in the end. So imho it is exactly gc :)

trie/hasher.go Outdated
if herr != nil {
h.mu.Lock() // rarely if ever locked, no congenstion
err = herr
h.mu.Unlock()
}
}
// If we're not running in threaded mode yet, span a goroutine for each child
if !h.threaded {
/*if !h.threaded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be deleted outright?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, left it in to investigate

trie/hasher.go Outdated
}
// Account nodes have very specific sizes, discard anything else
// TODO(karalabe): Seriously? Dafuq man?!
if size := len(val); size < 70 || size > 102 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass in a function to decode values and return references to the trie constructor instead of this hack?

trie/trie.go Outdated
@@ -447,12 +448,19 @@ func (t *Trie) resolve(n node, prefix []byte) (node, error) {
func (t *Trie) resolveHash(n hashNode, prefix []byte) (node, error) {
cacheMissCounter.Inc(1)

// Try to load the node from the recent mempool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make the mempool do the disk fallback?

@Arachnid
Copy link
Contributor

:shipit:🔫

@karalabe karalabe force-pushed the trie-node-mempool branch 2 times, most recently from 69a919f to 8998ed2 Compare January 17, 2018 08:31
@ekryski
Copy link

ekryski commented Jan 19, 2018

@karalabe let me know if you need help testing this. We're close to launching bitblocks.io and are actively testing a bunch of the stuff you have in your list on both SSDs and HDs.

Prior to this 1.8 was getting hung up on the last few hundred blocks during a full node sync and wasn't pulling in new chain segments, just new state entries and with 1.7.3 a 7200rpm disk can't catch up.

Nice work! 👏

@holiman
Copy link
Contributor

holiman commented Jan 25, 2018

@ekryski If you can test this PR out, that'd be appreciated. It needs all the testing it can get, and I think it's getting close to finished.

@karalabe
Copy link
Member Author

But please note, once you start importing blocks with this PR, there's no going back to 1.7.3. The old release will consider the database corrupted due to missing data.


for !bc.triegc.Empty() {
root, number := bc.triegc.Pop()
if uint64(-number) > chosen {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are going to lose precision here after 16M blocks (which seems a bit ugly but acceptable for the time being), I'd at least make this comparison err on the safe side in order to never delete anything newer than chosen:

if -number >= float32(chosen+1) {

@karalabe karalabe force-pushed the trie-node-mempool branch 3 times, most recently from 69e00e1 to c137a71 Compare January 30, 2018 14:49
for _, result := range results {
var size common.StorageSize

blob, _ := rlp.EncodeToBytes(result.Receipts) // TODO(karalabe): Whoah, nice hack, super optimal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)
Don't forget that TODO.
Q: Wouldn't it be pretty simple to implement rlp.CountBytes(...), which recursively just counts but does not actually allocate memory?

Copy link
Contributor

@holiman holiman Jan 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can do this instead, in rlp.go:

// EncodedBytesize returns the size that RLP encoding of val would have,
// if EncodeToBytes was called
func EncodedBytesize(val interface{}) (int, error) {
	eb := encbufPool.Get().(*encbuf)
	defer encbufPool.Put(eb)
	eb.reset()
	if err := eb.encode(val); err != nil {
		return 0, err
	}
	return eb.size(), nil
}

And then

bloblen,_ := rlp.EncodeByteSize(result.Receipts)
....

That'll roughly halve the memory allocations there.

@holiman
Copy link
Contributor

holiman commented Jan 31, 2018

At commit c137a71, tested with a copydb:

image

@GoodMirek
Copy link

GoodMirek commented Jan 31, 2018

While running any of four latest commits (c83ad93 - d0991c5) on mainnet, the fast sync eventually blocks before it downloads all the block headers and receipts. The DB does not get corrupt, after geth process restart it proceeds from where it left off and after some time it blocks again. Slower the node is the more often the issue happens. I am testing on one relatively slow node and one faster, in terms of block storage speed.
Are you aware of this issue? Is there anything I can collect to help?

@karalabe
Copy link
Member Author

karalabe commented Feb 1, 2018

@GoodMirek Thanks for the report! Yes, I've added some modifications to fast sync to cap the amount of memory it uses (currently it OOMs on mainnet) and it touches some sensitive scheduling logic, so there's a high probability I screwed something up. I will try to repro it locally, in the mean time any logs you can give us might help (potentially run with --vmodule=downloader=5).

@karalabe
Copy link
Member Author

karalabe commented Feb 1, 2018

@GoodMirek Can you also provide us with hw specs so we can better understand/try to repro? Also, when it "hangs" does it hang forever, or does it continue after a few (5-10 mins)? Pondering whether it could be database compaction in the background on an HDD.

@karalabe
Copy link
Member Author

karalabe commented Feb 1, 2018

@GoodMirek Tried to sync the PR on both a new laptop + SSD and an old laptop + HDD. Both seem to work fine, but the database compaction does catch both, causing a 8 / 18 minute pause on SSD / HDD. We probably need to add some user warning to display what's happening, but the code does seem to work correctly nonetheless.

@GoodMirek
Copy link

GoodMirek commented Feb 2, 2018

@karalabe You are right, I was not patient enough. After 10-20 minutes the sync proceeded. Now I face another issue - once all block headers and block receipts are downloaded, I am receiving thousands of warnings "Stalling state sync, dropping peer" and only rarely new state entries are imported. After geth restart, it imports new state entries quickly for about 5-30 minutes and then issue starts happening.
On stable version, both nodes were able to import the state entries at constant slow pace and then switched from fast to full sync.
I am going to retry with your yesterday's commits.

Regarding HW, these are cloud VPS, from different providers, without reasonable HW spec beside number of cores, size of RAM, diskspace size. They differ in speed of disk subsystem in order of magnitude.

@karalabe
Copy link
Member Author

karalabe commented Feb 2, 2018

@GoodMirek Yup, the Stalling state sync issue was a long standing one we had in Geth. It was really hard to reproduce so we couldn't properly debug it. My current work made it pop up very often, so I found the issue and also fixed it :) Latest code from yesterday should sync without issues (although it's disk IO intensive, especially after pulling in 70M states).

@GoodMirek
Copy link

GoodMirek commented Feb 2, 2018

UPDATE: Turned out it just took much longer to allocate the memory, not sure whether it is a cache or a leak.

@karalabe I confirm that with commit 951d398 state entries are being imported smoothly. Also, your branch no longer dies on OOM, what used to be very frequent issue during initial sync in my setup. Surprisingly, it even does not allocate all the memory specified as --cache, being it 2048 or 4096. It allocates just 75%, looking like the 25% dedicated to intermediate mempool is not used, at least during state entries import; however this is pure guess, I did not use delve or anything else to analyze the actual behavior.

My plan so far have been to let both systems fully sync with commit 951d398 and observe the behavior. In past:
Slow disk system, 8GB RAM : Geth no luck, Parity with 20 peers frequently lagging, needed outgoing bandwidth limitation on per-connection basis.
Fast disk system, 4GB RAM: Geth 1.8.0 with 500 peers kept in sync, most of the peers were LES, outgoing bandwidth limitation on per-connection basis applied only occassionaly.

Then, I plan to drop DB on both and retry with latest commit.

Copy link
Contributor

@fjl fjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have time to review in detail now, the changes look OK but there are still some rough edges. It's up to you when this gets merged.

CacheGCTimeoutFlag = cli.DurationFlag{
Name: "cache.gc.timeout",
Usage: "Maximum time after which to flush memory-only tries to disk",
Value: 5 * time.Minute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need CLI flags for such settings. You can always set them in the config file if you really want to tune details like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, PTAL.

for _, result := range results {
var size common.StorageSize

blob, _ := rlp.EncodeToBytes(result.Receipts) // TODO(karalabe): Whoah, nice hack, super optimal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't really need the encoded size here, any approximation of the size is enough. Maybe add a Size method to Receipt, Block, Transaction that returns the size of the struct + size of the content of slice fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transactions already had a Size method which worked based on RLP serialization. I've added a size method to the rest (which only approximates). PTAL

@holiman holiman added this to the 1.8.0 milestone Feb 5, 2018
@fjl fjl merged commit 55599ee into ethereum:master Feb 5, 2018
@mariushudea
Copy link

As long as this is gonna involve something backwards incompatible, would you maybe consider changing the defaults of the database (larger files, less often compacting), so that each individual file in the db is larger than ~ 2MB and we don't have tens of thousands of files (see the thread i posted in issues and the feedback for that) ?

@rjl493456442
Copy link
Member

Not look at the code. But if so, it means we can lost most of versioned data(e.g. We can not access historical data via old state root).

prestonvanloon pushed a commit to prestonvanloon/go-ethereum that referenced this pull request Apr 2, 2018
…15857)

This commit reduces database I/O by not writing every state trie to disk.
@vsdigitall
Copy link

vsdigitall commented Apr 2, 2018

I have the following hardware:

- Intel Xeon E3-1270v6 3.80GHz
- 16 GB ECC DDR4 RAM
- 250 GB PCI-E SSD
- 100 Mbit internet 

And it took about 30 hours already to sync with following command:
/root/p/geth/geth --fast --cache=1024 --datadir /root/p/geth/data --ipcpath /root/p/geth/geth.ipc

Geth version: Geth/v1.8.3-stable-329ac18e/linux-amd64/go1.10.
pulledStates: 62924164, geth is still syncing...

@karalabe I can give you an access to the server so you can check why it syncing so slow.

@duanbing
Copy link

duanbing commented May 21, 2018

In the interface state.Trie, the
CommitTo(trie.DatabaseWriter) (common.Hash, error) was changed to Commit(onleaf trie.LeafCallback) (common.Hash, error), so how can I commit the changes to Leveldb by manual?
@karalabe

@karalabe
Copy link
Member Author

@duanbing
Copy link

duanbing commented May 21, 2018

Thanks, solved it like this:

mdb, err := ethdb.NewLDBDatabase("/tmp/a.txt", 100, 100)
must(err)
db := state.NewDatabase(mdb)
root := common.Hash{}
statedb, err := state.New(root, db)

//update statedb

root, err = statedb.Commit(true)
must(err)
fmt.Println("Root Hash", root.Hex())

err = db.TrieDB().Commit(root, true)
must(err)

BTW, what a nice design of "intermediate mempool between trie and database"!

@karalabe
Copy link
Member Author

BTW, what a nice design of "intermediate mempool between trie and database"!

You're welcome to suggest cleaner APIs that retain the garbage collection capability.

@duanbing
Copy link

Would like to and am doing it. @karalabe

mariameda pushed a commit to NiluPlatform/go-nilu that referenced this pull request Aug 23, 2018
…15857)

This commit reduces database I/O by not writing every state trie to disk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.