Skip to content

liveslots doesn't drop WatchedPromises from slotToVal #10756

Closed
@warner

Description

Liveslots has a bug (#10757) which leaks slotToVal entries. This will happen for any Promise/vpid pair when there are virtual-data references (vrm.getReachablePromiseRefcount(p) !== 0) at the moment when liveslots notices the resolution of the promise (unregisterUnreferencedVPID, as called by .then handlers created by followForKernel (for locally-decided promises) or notify (for kernel-decided promises). The root cause is a missing pillar-falling handler in the removeReachableVref() code that decrements the refcount: when that count reaches zero, it ought to be checking whether the Promise is resolved or not, but doesn't, and there are no other places where slotToVal.delete(vref) might be called, so the entry is leaked until the incarnation ends.

This is currently consuming over 600MB in the mainnet v43-walletFactory right now. Restarting (upgrading) the vat will temporarily reduce the memory usage, but without fixing the leak, the problem will come back in a month or two.

The PromiseWatcher code triggers this bug, by virtue of the sequence of operations in watchPromise:

        watchedPromiseTable.init(vpid, harden([[watcher, ...args]]));

        // Ensure that this vat's promises are rejected at termination.
        if (maybeExportPromise(vpid)) {
          syscall.subscribe(vpid);
        }

        promiseRegistrations.init(vpid, p);
        pseudoThen(p, vpid);

The maybeExportPromise function (in liveslots.js) uses followForKernel to set up the .then handler which could delete the slotToVal entry, if only the refcount were zero. The pseudoThen function sets up the promise-watcher handle which could remove the Promise from the merely-virtual promiseRegistrations table, which reduces the refcount, but doesn't attempt to delete the entry.

The JavaScript spec says that handlers are executed in the order they were added. p.then(handlerA); p.then(handlerB) means that handlerA will run in an earlier turn than handlerB. So this bug should not depend upon which engine we're using (V8 vs XS), or random timing or whatever.

While the full #10757 fix will make this sequence of operations safe (non-leaking), and will fix other cases that might get added in the future, it is likely to be somewhat invasive. Fortunately, there is an easy workaround: swap the order of these two calls. By calling pseudoThen first, and maybeExportPromise second, we can arrange for the refcount-reducing callback to be executed earlier, so the refcount should be zero by the time unregisterUnreferencedVPID() is called, so the latter will go ahead and delete the slotToVal entry.

And because this doesn't depend upon GC behavior or timing, we can satisfactorily unit-test this in packages/swingset-liveslots (under Node, not needing xsnap), by just measuring the slotToVal.size count. When the leak is present, a resolved watched promise will still have a vpid in that table, and the size will be higher. When the leak is fixed, the vpid will be missing, and the size will be lower.

Metadata

Assignees

Labels

SwingSetpackage: SwingSetbugSomething isn't workingliveslotsrequires vat-upgrade to deploy changes

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions