AccountsDB2.0: base layer for account state outside of validator #15

linuskendall · 2023-11-23T12:29:03Z

linuskendall · 2023-11-24T13:32:43Z

Some notes from a discussion with @grooviegermanikus on a streaming architecture for account state:

Terms:

buffer: a store of N slots of account_writes + blocks
producer plugin: a geyser plugin that fills the buffer
producer server: a server instance that exposes the grpc protocol for the buffer source
accountsdb-consumer: a tool that produces a valid accounts db at slot N and can continue to update this accounts db for the follwing slot either for all accounts or a subset of accounts
snapshot: a full copy of account state at slot X (or a partial account state for a specific program).

Overall architecture

From validator we would need to stream into some buffer. This buffer could be in memory, in postgres, in kafka or somewhere else.
There should be an interface to consume things from the buffer at any given slot. That is, a consumer should be able to resume a stream at slot X even if current slot on network is X+100.
The software (consumer) the recreates a local copy of the account state should be able to consume a snapshot of accounts and also start a stream from the buffer at the same slot it's account snapshot is from
We need some way to generate snapshots, for now this could be a simple as using solana-ledger-tool but ideally in the future there should be a method for generating snapshots at regular intervals of all or a subset of accoutnts.

So if the tool we want to create that recreates accountsdb outside of validator is called accountsdb-consumer one would start it something like this:

$ wget https://mango-juicer.com/snapshot-1231121-mango.tar.bz2
$ accountsdb-consumer --snapshot snapshot-1231121-mango.tar.bz2 --accounts-include mv3ekLzLbnVPNxjSKvqBpU3ZeZXPQdEC3bp5MDEBG68 --pg-conn-str postgres://localhost/ --grpc-url abc.com:443 --listen :2131

The accountsdb-consumer would rebuild the database starting from 1231121 and also start a grpc subscription that feeds any account_writes from slot 1231121 onwards.

Producer plugin

The goal of the producer plugin in the validator would be to only produce a buffer of account_writes and blocks. This could involve inserting things in a postgres database, pushing things to a kafka queue or building an in-memory buffer. This plugin would be pretty simple. For example, in the postgres state it could involve just inserting account writes into a table:

slot | write_version | account | owner | data

A separate process could prune according to some rule - e.g n-days, n-updates, n-slots, n-disk size etc.

Consumer grpc or other means?

In theory if the producer plugin inserts into PG we could have a consumer that also reads directly from the PG buffer. However, I propose for the initial design that we make this a grpc interface. Why? Because most of the interesting logic to resolve account state and produce an accountsdb copy will be in the accountsdb-consumer. If we build an initial consumer which reads gRPC then the source for the buffer is easy to replace (can be either a DB based producer, a Kafka based producer etc.)

However, another option would be to make the consumer just some kind of crate or sth that could be imported into any project that want to write a consumer which is a source or sink.

Each producer plugin/buffer source would need the corresponding grpc-server that would support streaming the account_writes and blocks from a given slot N. So if we have a geyser plugin called postgres-buffer then we need a corresponding postgres-grpc-server that at least exposes the gRPC methods for account subscriptions, slot subscriptions and/or block subscriptions (not sure about the last two, a starting point would be to just support account subscriptions).

Why not make the `accountsdb-consumer` directly in the plugin?

This approach is possible, but the main drawback with this approach is that it doesn't allow horizontal scaling and requires that each accountsdb-consumer is associated 1 to 1 with a solana-validator. In theory allowing the accountsdb-consumer to consume over the network would allow someone to have someone else provide the gRPC endpoint and the accountsdb-consumer to produce the account state database. This would reduce the traffic on the gRPC endpoints (since they dont need to service a full snapshot but rather just incremental changes).

Producing snapshots

Step 4 could as said use ledger-tool, it could use the built in solana snapshot mechanism, or it could be something completely different. Ideally I think step4 should use the validators bank/accountsdb functionality rather than be something created from a downstream tool because this would ensure some correctness. However, in theory you could also assume that you could bootstrap from another consumer:

$ accountsdb-consumer --bootstrap abc:2131 -accounts-include mv3ekLzLbnVPNxjSKvqBpU3ZeZXPQdEC3bp5MDEBG68 --pg-conn-str postgres://localhost/ --grpc-url abc.com:443 --listen :2131

Concerns

Performance would be a big concern (?) like how much latency we would have in:

validator -> producer plugin -> buffer -> producer-server -> accountsdb resolver

Potentially validator -> producer plugin -> buffer could be replaced by some form of in-memory setup so the producer-server can serve accounts state to the resolver without having to copy and serialise/deserialise all over the place. Another potential design could involve packaging the entire process into something that produces an account state in memory that we could then query from.

grooviegermanikus · 2023-11-24T22:38:41Z

Thanks a lot for the writeup!

I wonder if we can throw some real-world quantities, number of messages, gigabytes to handle.
There might be two interesting dimensions on the quantities:

how could segmentation of accounts look like (e.g. put all vote/stake/sysvar accounts in one segment, put oracles in another segment and allow consumers to choose)
what quantities should we expect at burst events - specifically on epoch change

grooviegermanikus · 2023-12-01T11:23:56Z

building blocks

grpc plugin for account updates
confirmation level merge logic
fix: 0 entries problem
detect missing account updates
- repair call
store the accounts - accounts_store (to be served to downstream), account.data
tx linking / caching
buffer accounts between grpc and accounts_store
tool for getting the compressed accounts
access to bootstrap data
filter the accounts
couple bootstrap + geyser stream
read path:
- need to serve this: get_program_accounts, get_multiple_accounts, get_balance, get_token_largest_accounts, get_account_info, get_slot
- caching
- DB connection pool

brainstormed on 2023-12-01

grooviegermanikus · 2023-12-01T11:26:24Z

data size (PosgreSQL store): 600GB for all AccountsDB (compressed account state on validator: 40GB)

grooviegermanikus · 2023-12-08T10:19:21Z

code goes here: https://github.com/solana-rpc-community/solana-rpc2-accountsdb

vovkman · 2023-12-28T13:28:25Z

A lot of good points here, a few thoughts from going through it

DAS for token methods

DAS in its current state is quite a large deployment, I would be hesitant to lump this into there because of how important it is to get the lowest layer of data correct

Sourcing account hashes

Also think this or some other method of verifying data correctness is necessary. Can draw on learnings from here - More efficient accounts hash calculation for incremental snapshots. solana-labs/solana#26848

Deletes

Should be more straight forward because of tracking the entire state? All zero lamport updates owned by system program can be processed as deletes.

General thought on the streaming architecture. Wondering if we should consider replacing or adding in addition to streaming, the option to pull data from geyser instead of relying on a stream of data. This has a few benefits

Removes potential back pressure from N consumers encountering issues
Allows consumers to use the same semantics for both keeping up/catching up to updates (ie. Fetch N to stay caught up, fetch N-20 to catch up)
Moves retry/state tracking burden to consumers (producer is no longer responsible for ensuring consumer receives current data, it simply serves the range requested)
Opens up other extensions for users who just want to receive periodic snapshots of a particular program

Some downsides

Higher potential for abuse
Need to keep some sort of internal buffer of previous slots (though this is probably necessary either way)
Updates might not be as real time as a gRPC stream/push model

linuskendall changed the title ~~AccountsDB2.0: base~~ AccountsDB2.0: base layer for account state outside of validator Nov 23, 2023

grooviegermanikus added upstream what is required from validator/cluster? downstream what rpc will provide to users/clients/consumers of the API labels Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AccountsDB2.0: base layer for account state outside of validator #15

AccountsDB2.0: base layer for account state outside of validator #15

linuskendall commented Nov 23, 2023 •

edited

Loading

linuskendall commented Nov 24, 2023 •

edited

Loading

grooviegermanikus commented Nov 24, 2023

grooviegermanikus commented Dec 1, 2023 •

edited

Loading

grooviegermanikus commented Dec 1, 2023

grooviegermanikus commented Dec 8, 2023

vovkman commented Dec 28, 2023

AccountsDB2.0: base layer for account state outside of validator #15

AccountsDB2.0: base layer for account state outside of validator #15

Comments

linuskendall commented Nov 23, 2023 • edited Loading

linuskendall commented Nov 24, 2023 • edited Loading

Terms:

Overall architecture

Producer plugin

Consumer grpc or other means?

Why not make the accountsdb-consumer directly in the plugin?

Producing snapshots

Concerns

grooviegermanikus commented Nov 24, 2023

grooviegermanikus commented Dec 1, 2023 • edited Loading

building blocks

grooviegermanikus commented Dec 1, 2023

grooviegermanikus commented Dec 8, 2023

vovkman commented Dec 28, 2023

linuskendall commented Nov 23, 2023 •

edited

Loading

linuskendall commented Nov 24, 2023 •

edited

Loading

Why not make the `accountsdb-consumer` directly in the plugin?

grooviegermanikus commented Dec 1, 2023 •

edited

Loading