-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Split data sources for indexing #414
Comments
Right now, the biggest performance issue when it comes to the Live Indexer is that catching up is really slow. Splitting the data we read into multiple databases can also improve the performance for the static indexer, of course. The idea would be to have three distinct databases we read from:
By having those three databases instead of a single one, we can parallelize requests and reduce the friction in our current bottleneck, which is the indexing of transactions and trie updates. Things to clarify
|
I think one important point that is missing from your summary is that the protocol state database is the reference protocol state, as found on consensus nodes; not execution nodes. Other than that, it sounds excellent 👍. Regarding the approach, I think we would need to switch to a design with two writers, and possible 2-3 readers (though the readers could remain unified as one, just with three DB dependencies). Both writers (execution data, ledger data) would flush at badger transactions at the same interval, except that they would be offset by half an interval so they are interleaved optimally. |
How would you suggest we split them and how would they be used by the mapper? I guess we would need the mapper to run both indexing tasks in separate goroutines, and that might slightly complicate the error handling part of things.
So assuming that we take a new DB as input with the non-trie-update data from the exec nodes, we would now have:
Is that right? Having multiple databases but accessing them sequentially provides no performance improvement, so once again I assume that we'd want to somehow make the mapper fetch this data concurrently. |
Badger v3 UpgradeThis change was a simple switch of the Badger version used for our index database. Split up readerThis change was made (in a quick and dirty way) by cloning the protocol state DB and opening both of them with two separate readers so that it can be read in parallel by the mapper. It resulted in around a performance increase of around 14% on my machine. One reader reads specifically the Transaction Results, Events and Seals (execution data) while the other reads everything else. Split up writerThis change was made by no longer writing in one but three separate badger databases for indexing. This allows to write on both the protocol index and the chain index in parallel. Unfortunately on my machine, this only resulted in a performance increase of around 5%. One writer writes specifically the Transaction Results, Events and Seals while the other writes everything else but trie updates. Depending on the blocks, both operations seem to take roughly the same amount of time, but sometimes indexing consensus data takes about twice longer. Still, assuming they'd both take exactly the same amount of time always, we'd only save up another 5% at best. At least, with the dataset I have. Maybe this is more significant with real data. Example: logs.json{
"level": "info",
"component": "mapper_transitions",
"duration": 3.5319,
"db": "execution",
"time": "2021-10-01T08:37:49Z",
"message": "Finished indexing goroutine"
}
{
"level": "info",
"component": "mapper_transitions",
"duration": 5.7975,
"db": "consensus",
"time": "2021-10-01T08:37:49Z",
"message": "Finished indexing goroutine"
}
{
"level": "info",
"component": "mapper_transitions",
"duration": 3.4241,
"db": "execution",
"time": "2021-10-01T08:37:49Z",
"message": "Finished indexing goroutine"
}
{
"level": "info",
"component": "mapper_transitions",
"duration": 6.3003,
"db": "consensus",
"time": "2021-10-01T08:37:49Z",
"message": "Finished indexing goroutine"
} |
In order to prepare for augmenting the DPS index data, it would be good to separate chain data and DPS index data. The DPS index will only contain the Ledger payloads. Everything else, which is basically just transcoded data from Flow databases, will go into a separate database.
I recommend that we use the following nomenclature:
The text was updated successfully, but these errors were encountered: