-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core, trie: intermediate mempool between trie and database #15857
Conversation
trie/mempool.go
Outdated
@@ -0,0 +1,181 @@ | |||
// Copyright 2017 The go-ethereum Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!
trie/mempool.go
Outdated
parents map[common.Hash]int // Number of live nodes referencing a given one | ||
children map[common.Hash]map[common.Hash]struct{} // Set of children referenced by given nodes | ||
|
||
gctime time.Duration // Time spent on garbage collection since last commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reference counting, not garbage collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it is garbage collection, it measures the data/time saved/spent on dereferencing stuff in memory and cleaning up things we don't need to save to disk in the end. So imho it is exactly gc :)
trie/hasher.go
Outdated
if herr != nil { | ||
h.mu.Lock() // rarely if ever locked, no congenstion | ||
err = herr | ||
h.mu.Unlock() | ||
} | ||
} | ||
// If we're not running in threaded mode yet, span a goroutine for each child | ||
if !h.threaded { | ||
/*if !h.threaded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be deleted outright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, left it in to investigate
trie/hasher.go
Outdated
} | ||
// Account nodes have very specific sizes, discard anything else | ||
// TODO(karalabe): Seriously? Dafuq man?! | ||
if size := len(val); size < 70 || size > 102 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass in a function to decode values and return references to the trie constructor instead of this hack?
trie/trie.go
Outdated
@@ -447,12 +448,19 @@ func (t *Trie) resolve(n node, prefix []byte) (node, error) { | |||
func (t *Trie) resolveHash(n hashNode, prefix []byte) (node, error) { | |||
cacheMissCounter.Inc(1) | |||
|
|||
// Try to load the node from the recent mempool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make the mempool do the disk fallback?
🔫 |
69a919f
to
8998ed2
Compare
@karalabe let me know if you need help testing this. We're close to launching bitblocks.io and are actively testing a bunch of the stuff you have in your list on both SSDs and HDs. Prior to this 1.8 was getting hung up on the last few hundred blocks during a full node sync and wasn't pulling in new chain segments, just new state entries and with 1.7.3 a 7200rpm disk can't catch up. Nice work! 👏 |
06b2545
to
ff31e8e
Compare
@ekryski If you can test this PR out, that'd be appreciated. It needs all the testing it can get, and I think it's getting close to finished. |
But please note, once you start importing blocks with this PR, there's no going back to 1.7.3. The old release will consider the database corrupted due to missing data. |
core/blockchain.go
Outdated
|
||
for !bc.triegc.Empty() { | ||
root, number := bc.triegc.Pop() | ||
if uint64(-number) > chosen { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are going to lose precision here after 16M blocks (which seems a bit ugly but acceptable for the time being), I'd at least make this comparison err on the safe side in order to never delete anything newer than chosen
:
if -number >= float32(chosen+1) {
69e00e1
to
c137a71
Compare
eth/downloader/queue.go
Outdated
for _, result := range results { | ||
var size common.StorageSize | ||
|
||
blob, _ := rlp.EncodeToBytes(result.Receipts) // TODO(karalabe): Whoah, nice hack, super optimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
Don't forget that TODO.
Q: Wouldn't it be pretty simple to implement rlp.CountBytes(...)
, which recursively just counts but does not actually allocate memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you can do this instead, in rlp.go:
// EncodedBytesize returns the size that RLP encoding of val would have,
// if EncodeToBytes was called
func EncodedBytesize(val interface{}) (int, error) {
eb := encbufPool.Get().(*encbuf)
defer encbufPool.Put(eb)
eb.reset()
if err := eb.encode(val); err != nil {
return 0, err
}
return eb.size(), nil
}
And then
bloblen,_ := rlp.EncodeByteSize(result.Receipts)
....
That'll roughly halve the memory allocations there.
c137a71
to
c83ad93
Compare
While running any of four latest commits (c83ad93 - d0991c5) on mainnet, the fast sync eventually blocks before it downloads all the block headers and receipts. The DB does not get corrupt, after |
@GoodMirek Thanks for the report! Yes, I've added some modifications to fast sync to cap the amount of memory it uses (currently it OOMs on mainnet) and it touches some sensitive scheduling logic, so there's a high probability I screwed something up. I will try to repro it locally, in the mean time any logs you can give us might help (potentially run with |
@GoodMirek Can you also provide us with hw specs so we can better understand/try to repro? Also, when it "hangs" does it hang forever, or does it continue after a few (5-10 mins)? Pondering whether it could be database compaction in the background on an HDD. |
@GoodMirek Tried to sync the PR on both a new laptop + SSD and an old laptop + HDD. Both seem to work fine, but the database compaction does catch both, causing a 8 / 18 minute pause on SSD / HDD. We probably need to add some user warning to display what's happening, but the code does seem to work correctly nonetheless. |
@karalabe You are right, I was not patient enough. After 10-20 minutes the sync proceeded. Now I face another issue - once all block headers and block receipts are downloaded, I am receiving thousands of warnings "Stalling state sync, dropping peer" and only rarely new state entries are imported. After Regarding HW, these are cloud VPS, from different providers, without reasonable HW spec beside number of cores, size of RAM, diskspace size. They differ in speed of disk subsystem in order of magnitude. |
@GoodMirek Yup, the |
UPDATE: Turned out it just took much longer to allocate the memory, not sure whether it is a cache or a leak. @karalabe I confirm that with commit 951d398 state entries are being imported smoothly. Also, your branch no longer dies on OOM, what used to be very frequent issue during initial sync in my setup. Surprisingly, it even does not allocate all the memory specified as My plan so far have been to let both systems fully sync with commit 951d398 and observe the behavior. In past: Then, I plan to drop DB on both and retry with latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have time to review in detail now, the changes look OK but there are still some rough edges. It's up to you when this gets merged.
cmd/utils/flags.go
Outdated
CacheGCTimeoutFlag = cli.DurationFlag{ | ||
Name: "cache.gc.timeout", | ||
Usage: "Maximum time after which to flush memory-only tries to disk", | ||
Value: 5 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need CLI flags for such settings. You can always set them in the config file if you really want to tune details like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, PTAL.
eth/downloader/queue.go
Outdated
for _, result := range results { | ||
var size common.StorageSize | ||
|
||
blob, _ := rlp.EncodeToBytes(result.Receipts) // TODO(karalabe): Whoah, nice hack, super optimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't really need the encoded size here, any approximation of the size is enough. Maybe add a Size method to Receipt, Block, Transaction that returns the size of the struct + size of the content of slice fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transactions already had a Size method which worked based on RLP serialization. I've added a size method to the rest (which only approximates). PTAL
88f5fe4
to
54b8ded
Compare
As long as this is gonna involve something backwards incompatible, would you maybe consider changing the defaults of the database (larger files, less often compacting), so that each individual file in the db is larger than ~ 2MB and we don't have tens of thousands of files (see the thread i posted in issues and the feedback for that) ? |
Not look at the code. But if so, it means we can lost most of versioned data(e.g. We can not access historical data via old state root). |
…15857) This commit reduces database I/O by not writing every state trie to disk.
I have the following hardware:
And it took about 30 hours already to sync with following command: Geth version: Geth/v1.8.3-stable-329ac18e/linux-amd64/go1.10. @karalabe I can give you an access to the server so you can check why it syncing so slow. |
In the interface state.Trie, the |
Thanks, solved it like this:
BTW, what a nice design of "intermediate mempool between trie and database"! |
You're welcome to suggest cleaner APIs that retain the garbage collection capability. |
Would like to and am doing it. @karalabe |
…15857) This commit reduces database I/O by not writing every state trie to disk.
This PR is backwards incompatible! A chain starting using this PR will not be able to go back.
Currently the primary bottleneck on go-ethereum is disk IO. The database and especially the state trie within became gigantic (70M nodes on mainnet), which means that block processing takes a very long time to read data, and a very long time to write data. To make things worse, data constantly accumulates on disk, making both reads and writes constantly slower.
This PR tires to address to lowest hanging fruit of this problem, namely to try and reduce the amount of useless (or only temporarily useful) data written to disk in the first place. The important observation behind this PR is that a small part of every state update (i.e. new block) is useful data (e.g. new contract, new storage entry), but the most part of a state update is internal trie nodes which have a limited lifetime. This means, that most of the data hitting disk will become stale within a few blocks, so if we can avoid writing them in the first place, data usage and disk IO get significantly reduced.
The PR achieves this goal by inserting an in-memory
trie.Database
between our low level disk database, and our higher level trie abstraction. Whenever we commit a trie to "disk" it actually ends up cached in memory. From all intents and purposes, tries and go-ethereum think they've committed the data to disk, but they in reality get short circuited halfway down.Keeping the recent nodes (trie diffs) in memory permits us to do state pruning based on reference counting garbage collection. Every trie node added to the memory database maintains a list of children references and parent counts. Each block processed adds a new set of nodes and references references; whereas decently old blocks dereference old state roots. This results in the
trie.Database
being a rolling garbage-collected trie-forest containing the most recent 128 tries only.Of course, this in-memory database needs to be flushed eventually as it just grows with useful data ad infinitum. The PR contains two threshold that triggers the flushing of the tries:
25% of --cache
), the tries are flushed to disk. This ensures that we keep and discard as much junk in memory-only as we can, but we have a hard limit that prevents the node from crashing off.5 minutes
, the tries are flushed to disk. This ensures that if the node terminates suddenly (crash, power failure), we lose at most 5 minutes worth of processing time.The rest of this PR is the massive changes necessary to integrate this concept into go-ethereum. An important-to-mention extra modification is in the downloader:
head - 64
.head - 128
blocks. This is needed since we discard old state (unlike currently).Some random benchmarks from this PR (they've been run during implementation, final results are a tad better):
Disk usage is significantly reduced (more so on chain genesis, less on chain head), resulting in significantly less disk IO, disk usage and processing time. Homestead (blocks 1..1919999) full import on a Core2 CPU, 4GB RAM laptop, 1GB cache:
Disk growth is significantly reduced on mainnet after a successful fast sync, and after about a week of running this PR versus master, the block processing time is 20% faster (originally they are the same, but the disk growth of the non-pruned node eventually slows everything down).
Beside modifying the downloader to cater for state unavailability due to trie pruning, the PR also fixes the fast sync memory consumption issue, resulting in a much more pleasant fast sync experience:
Note, if you want to run an archive node without garbage collection, please run with
--gcmode=archive
.Checklist to test before merging: