Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Split data sources for indexing #414

Open
awfm9 opened this issue Sep 11, 2021 · 4 comments
Open

Performance: Split data sources for indexing #414

awfm9 opened this issue Sep 11, 2021 · 4 comments
Assignees

Comments

@awfm9
Copy link
Collaborator

awfm9 commented Sep 11, 2021

In order to prepare for augmenting the DPS index data, it would be good to separate chain data and DPS index data. The DPS index will only contain the Ledger payloads. Everything else, which is basically just transcoded data from Flow databases, will go into a separate database.

I recommend that we use the following nomenclature:

  • DPS ledger index/database
  • DPS entity index/database
  • DPS insight index/database
@awfm9 awfm9 changed the title DPS: separate DPS index from chain index Separate DPS index database from chain database Sep 27, 2021
@Ullaakut
Copy link
Contributor

Ullaakut commented Sep 28, 2021

Right now, the biggest performance issue when it comes to the Live Indexer is that catching up is really slow. Splitting the data we read into multiple databases can also improve the performance for the static indexer, of course.

The idea would be to have three distinct databases we read from:

  • The protocol state database from consensus nodes
  • A database specifically for trie updates
  • A database specifically for the rest of the data from the execution nodes

By having those three databases instead of a single one, we can parallelize requests and reduce the friction in our current bottleneck, which is the indexing of transactions and trie updates.

Things to clarify

How to split the data into multiple databases? We most likely don't want to go through all of it and clone it, as that would take as much time as actual indexing.

@Ullaakut Ullaakut changed the title Separate DPS index database from chain database Split data sources for indexing Sep 28, 2021
@awfm9
Copy link
Collaborator Author

awfm9 commented Sep 28, 2021

I think one important point that is missing from your summary is that the protocol state database is the reference protocol state, as found on consensus nodes; not execution nodes. Other than that, it sounds excellent 👍.

Regarding the approach, I think we would need to switch to a design with two writers, and possible 2-3 readers (though the readers could remain unified as one, just with three DB dependencies). Both writers (execution data, ledger data) would flush at badger transactions at the same interval, except that they would be offset by half an interval so they are interleaved optimally.

@Ullaakut Ullaakut self-assigned this Sep 29, 2021
@Ullaakut
Copy link
Contributor

@awfm9

I think we would need to switch to a design with two writers

How would you suggest we split them and how would they be used by the mapper? I guess we would need the mapper to run both indexing tasks in separate goroutines, and that might slightly complicate the error handling part of things.

and possible 2-3 readers (though the readers could remain unified as one, just with three DB dependencies)

So assuming that we take a new DB as input with the non-trie-update data from the exec nodes, we would now have:

  • The LedgerWAL consumed by the Feeder
  • The Protocol State DB consumed by the Chain, which would have a slightly reduced scope in what it fetches from the DB
  • The other db consumed by a new kind of reader, which would implement part of what the Chain currently does

Is that right? Having multiple databases but accessing them sequentially provides no performance improvement, so once again I assume that we'd want to somehow make the mapper fetch this data concurrently.

@Ullaakut
Copy link
Contributor

Ullaakut commented Sep 30, 2021

Change Performance Impact
Badger v3 Upgrade None
Split up reader +14%
Split up writer +5%

Badger v3 Upgrade

This change was a simple switch of the Badger version used for our index database.

Split up reader

This change was made (in a quick and dirty way) by cloning the protocol state DB and opening both of them with two separate readers so that it can be read in parallel by the mapper. It resulted in around a performance increase of around 14% on my machine.

One reader reads specifically the Transaction Results, Events and Seals (execution data) while the other reads everything else.

Split up writer

This change was made by no longer writing in one but three separate badger databases for indexing. This allows to write on both the protocol index and the chain index in parallel. Unfortunately on my machine, this only resulted in a performance increase of around 5%.

One writer writes specifically the Transaction Results, Events and Seals while the other writes everything else but trie updates. Depending on the blocks, both operations seem to take roughly the same amount of time, but sometimes indexing consensus data takes about twice longer. Still, assuming they'd both take exactly the same amount of time always, we'd only save up another 5% at best. At least, with the dataset I have. Maybe this is more significant with real data.

Example:

logs.json
{
  "level": "info",
  "component": "mapper_transitions",
  "duration": 3.5319,
  "db": "execution",
  "time": "2021-10-01T08:37:49Z",
  "message": "Finished indexing goroutine"
}
{
  "level": "info",
  "component": "mapper_transitions",
  "duration": 5.7975,
  "db": "consensus",
  "time": "2021-10-01T08:37:49Z",
  "message": "Finished indexing goroutine"
}
{
  "level": "info",
  "component": "mapper_transitions",
  "duration": 3.4241,
  "db": "execution",
  "time": "2021-10-01T08:37:49Z",
  "message": "Finished indexing goroutine"
}
{
  "level": "info",
  "component": "mapper_transitions",
  "duration": 6.3003,
  "db": "consensus",
  "time": "2021-10-01T08:37:49Z",
  "message": "Finished indexing goroutine"
}

@Ullaakut Ullaakut changed the title Split data sources for indexing Performance: Split data sources for indexing Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants