LedgerEntry context ID / provenance tracking

This is a suggestion about something new we could do in core to minimize risks of data consistency issues.

### Background: LedgerTxn

Stellar core has a somewhat-centralized system for accessing the ledger called `LedgerTxn` (a.k.a. LTX) which is a sort of nested key/value map (mapping `LedgerKey` to `LedgerEntry`) that is used for several overlapping purposes:

1. Atomic commit/rollback (with exception safety) of multiple simultaneous changes.
2. Batching and caching accesses to the underlying database and/or buckets.
3. Minimizing the risk of data consistency errors.

This third task is subtle and the source of a lot of the complexity of the LTX: doing a lookup on an LTX produces a `LedgerTxnEntry` object, which is _not_ a raw `LedgerEntry`. The `LedgerTxnEntry` is instead a special smart pointer type that can be _invalidated at a distance_ by the ltx it came from. In particular when a parent LTX opens a child LTX, all existing `LedgerTxnEntry`s issued by _past lookups on the parent_ are dynamically invalidated, wherever they may be. Any attempt to read or write through them will fail (dynamically, at runtime).

The _purpose_ of it is to avoid two types of data inconsistency that can arise if a child LTX has newer entries than its parent:

1. Stale reads from the parent LTX that should be hidden by the child LTX's newer entry.
2. Lost writes to a parent LTX entry that will be overwritten by the child LTX's newer entry.

The purpose is good! But the _implementation_ of this leaves some things to be desired:

1. It makes use of destructors in a way that's not very obvious to the user. A lot of users don't know when they can or should call `.current()` on a `LedgerTxnEntry`.
2. It is relatively easy to accidentally subvert by capturing a reference to the underlying `LedgerEntry` in a local.
3. The implementation is quite confusing and involves a _lot_ of code and complexity for just this one task.
4. The implementation is tightly coupled to the `LedgerTxn` class itself, and does nothing to prevent similar inconsistencies in contexts where we don't or can't use the `LedgerTxn` class (eg. when operating directly on bucket lists and snapshots, or in the new parallel apply subsystem).

### Alternative: LedgerEntry context IDs / provenances

I propose we explore alternative designs to the LTX, not necessarily replacing it wholesale (or not yet) but as a first step factoring out the _purpose_ of its inconsistency-prevention system into a separate _mechanism_ that:

1. Is much simpler and has much less machinery
2. Is easier to understand, more explicit, and harder to accidentally misuse
3. most importantly: we actually _can_ use everywhere we use `LedgerEntry`s, not just in LTXs

The mechanism I have in mind has two pieces:

1. A `LedgerEntryContext` class that is lightweight and can either be separately inherited-from or manually instantiated anywhere there's a conceptual collection of `LedgerEntry`s that need to be guarded against stale reads or lost updates. This class contains a _number_ (no complex pointer types) that we arrange to uniquely identify the context in the code (exact mechanism of assigning numbers TBD; some mix of static and dynamic numbering, but assume there's some way of uniquely numbering them). We can call these "context IDs" or "provenances" depending on your taste.
2. A `LedgerEntryWrapper` class that _directly contains_ a  `LegerEntry` and so can be passed to and from functions, put in standard container types, buffered, cached, etc. etc. Additionally, this class carries the number of a context that the wrapper is associated with (it is set on construction).

These two pieces would collaborate to dynamically check a similar condition: that nobody reads or writes entries "in a context they shouldn't". Contexts need not be containers! Or if they are containers, they need not be containers shaped exactly like a map from keys to values (for example bucket lists and eviction iterators are more like just sequences or lists of entries).

I can think of at least 3 mechanisms of enforcement of such "context-sensitive access":

1. In order to _read from_ or _write to_ the `LedgerEntry` inside a `LedgerEntryWrapper`, a method like `context.unwrap_read(wrapper, [](LedgerEntry const&) { ... })` or `context.unwrap_write(wrapper, [](LedgerEntry&) {...})` needs to be called on a context. This method checks the context numbers match before allowing access and throws if they do not. Since it takes a closure it's less immediately easy to capture the reference in a local variable (it's not impossible -- at some level there's no way to prevent it -- but it's a little harder!)
2. Contexts may want to activate or de-activate themselves entirely -- masking off the possibility of _any_ access to a parent context while a child is open -- the same way LTXs do, with parent/child relationships tracked by a helper flag/method/destructor if needed. But not all contexts we're concerned about have a direct parent/child coupling in the call tree the same way LTXs do, so this might not be the only way of using them.
3. The assignment `operator=` of `LedgerEntryWrapper` may also want to contain a dynamic check that the assigned number matches the existing number, such that writing back a `LedgerEntryWrapper` to a container will fail unless the numbers match. For this to work there needs to either be a (partial?) order on numbers such that child-context entries _can_ overwrite parent-context entries when being committed-back to them, or else if numbers are opaque there needs to be a way to _intentionally_ re-home a bunch of wrappers (eg. by calling a method `wrapper.set_context(new_context)` on all of them before writing back.

### Longer term: the LTX itself

The LTX is a bit of a maintenance concern. The code is very complex, it is strictly sequential (not parallel friendly), I am very unsure of its exception-safety logic, much of it is now semi-redundant with the parallel apply path and the underlying bucketlist (which already does its own caching and batching), and its SQL caching and batching function (or SQL-access mediation at all) is no longer used for anything other than the orderbook.

While I do not want to dismiss the extent to which the orderbook is still a central use-case, I think it could be a healthy de-risking of the future of LTX client code to start finding ways to accomplish the remainder of what the LTX does without actually needing it. Or alternatively to refactoring its implementation to "meet the same goals" with a simpler internal implementation. I believe that if the LTX implemented this `LedgerEntryContext` interface, LTX client code could fairly easily be ported over to it piece by piece, after which a lot of the internal complexity of the LTX could be eliminated (retaining the pieces it needs to function as a transactional nested map and cache of orderbook state, again at least for now). I think this could even eventually make room for a standard tx-apply path that doesn't use the LTX at all -- only the new parallel apply machinery -- if we could find a parallel-friendly answer for the orderbook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LedgerEntry context ID / provenance tracking #4971

Background: LedgerTxn

Alternative: LedgerEntry context IDs / provenances

Longer term: the LTX itself

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LedgerEntry context ID / provenance tracking #4971

Description

Background: LedgerTxn

Alternative: LedgerEntry context IDs / provenances

Longer term: the LTX itself

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions