Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/state: trie prefetcher change: calling trie() doesn't stop the associated subfetcher #29035

Conversation

jwasinger
Copy link
Contributor

@jwasinger jwasinger commented Feb 20, 2024

This pulls trie prefetcher changes from #29029 into a separate PR.

It changes the trie prefetcher s.t. a call to trie returns a trie (after any pending tasks have been resolved in the subfetcher). After a call to trie, the subfetcher is now available for more work (instead of terminating).

@jwasinger
Copy link
Contributor Author

Failed tests... the last commit borks something.

@jwasinger
Copy link
Contributor Author

jwasinger commented Feb 23, 2024

Okay. CI is running successfully locally on two separate machines (mac/linux) for me. Unsure why it fails in github actions.

core/state/trie_prefetcher.go Outdated Show resolved Hide resolved
core/state/trie_prefetcher.go Show resolved Hide resolved
core/state/trie_prefetcher.go Show resolved Hide resolved
@jwasinger jwasinger force-pushed the stateless-witness-prefetcher-changes branch from 5ecb17e to 13196b8 Compare February 28, 2024 03:25
Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate a bit more about what this PR tries to achieve, and why that is better.
As in, why it's needed for your upcoming work, but also what the effects on the current code is. Does it improve anything? Simplify something? Degrade something?

// repeated.
// 2. Finalize of the main account trie. This happens only once per block.
func (p *triePrefetcher) prefetch(owner common.Hash, root common.Hash, addr common.Address, keys [][]byte) error {
if p.closed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a multi-thread accessing prefetch / close ? If so, might be better to make closed into an atomic.Bool? But if we don't access it that way, it's fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefetcher is manipulated in the statedb, which is not regarded as thread-safe. I believe we never do multi-thread accessing prefetch.

@rjl493456442
Copy link
Member

rjl493456442 commented Mar 12, 2024

The biggest difference this pull request introduces is:

Originally, when a trie is requested, the prefetching tasks are suspended and resumed once the live trie is copied and returned. This approach could cause problems for witness collection. For instance:

If slot A is accessed but not modified, the relevant trie nodes should be wrapped as the witness. Subsequently, slot A will be scheduled for prefetching and queued in the task list.

However, if the corresponding storage trie is retrieved without slot A being preloaded at that time, then the relevant trie nodes won't be included in the witness.

In this pull request, the prefetching tasks scheduled before retrieving the trie will always be executed. Thus the issue is resolved.

Besides, the Copy function is dropped for simplification. And actually the original complexity is not worthwhile anyway. The statedb.Copy is mostly used in miner, but after the change made by marius, we don't copy the statedb in miner anymore. Make no sense to copy the prefetcher for the performance gain.

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but we need to run it on the benchmarking infra for a bit

@jwasinger jwasinger force-pushed the stateless-witness-prefetcher-changes branch from 453f06d to 287cead Compare March 14, 2024 02:53
@jwasinger
Copy link
Contributor Author

Unfortunately, the added overhead to block processing and increased memory usage means that this PR is a non-starter.

@jwasinger jwasinger closed this Mar 15, 2024
@holiman
Copy link
Contributor

holiman commented Mar 15, 2024

I think it's maybe a bit premature to close this. Please add the relevant graphs/charts from the benchmark-run, so we can discuss and reason about it

@holiman holiman reopened this Mar 15, 2024
@holiman
Copy link
Contributor

holiman commented Mar 15, 2024

Mar 14 04:09:55 bench05.ethdevops.io geth INFO [03-14|03:09:55.583] Starting peer-to-peer node instance=Geth/v1.14.0-unstable-3c26ffeb/linux-amd64/go1.21.8
Mar 14 04:10:12 bench06.ethdevops.io geth INFO [03-14|03:10:11.942] Starting peer-to-peer node instance=Geth/v1.14.0-unstable-287cead1/linux-amd64/go1.21.8

So apparently bench06 is running this PR.

Screenshot 2024-03-15 at 10-27-28 Dual Geth - Grafana

And bench06 is markedly slower, mainly in the bucket called execution. I guess that's where the time spent waiting for the subfetchers to finish is counted.

The prefetchers fetch roughly the same amount of data (bench06 a bit more slots):
Screenshot 2024-03-15 at 10-35-54 Dual Geth - Grafana

And the both have same "misses", that is, work performed that was never needed (e.g. fetching slots at the end of transaction, but the next transaction set the slots back to the previous value).

Screenshot 2024-03-15 at 10-36-45 Dual Geth - Grafana

In the "optimistic" prefetcher, if we overload it with requests, it will just fetch what it can, and then we'll say "ok stop now". In the strict one, overloading it with requests means we'll have to wait it out, it will penalize use. So if we do this, maybe we have to be more clever about what to schedule.

I'm not really 100% sure what scenarios causes the wasted work, or the delivery misses. Might be worth investigating a bit

@@ -237,7 +194,7 @@ func newSubfetcher(db Database, state common.Hash, owner common.Hash, root commo
owner: owner,
root: root,
addr: addr,
wake: make(chan struct{}, 1),
wake: make(chan struct{}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems problematic. Whereas the original code allowed to schedule the request and then fly off continuing execution, the new code (along with L217 below) will block until a previous request is done (actually, block on the second request).

Any particular reason you made the async notifier synchronous?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted!

@karalabe
Copy link
Member

There's a change in this PR that converted the async wake notifier into a sync one, which can potentially have a really heavy hit on performance as it ill block execution during preloading. Perhaps that's the root of the slowdown.

@karalabe
Copy link
Member

Closing and reopening my own because OP didn't permit maintainer changes.

@jwasinger
Copy link
Contributor Author

:'( I don't know why this is suddenly something I have to explicitly enable. Until recently, PRs I've made in the past have been maintainer-modifiable by default....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants