liveslots doesn't drop WatchedPromises from slotToVal #10756
Description
Liveslots has a bug (#10757) which leaks slotToVal
entries. This will happen for any Promise/vpid pair when there are virtual-data references (vrm.getReachablePromiseRefcount(p) !== 0
) at the moment when liveslots notices the resolution of the promise (unregisterUnreferencedVPID
, as called by .then
handlers created by followForKernel
(for locally-decided promises) or notify
(for kernel-decided promises). The root cause is a missing pillar-falling handler in the removeReachableVref()
code that decrements the refcount: when that count reaches zero, it ought to be checking whether the Promise is resolved or not, but doesn't, and there are no other places where slotToVal.delete(vref)
might be called, so the entry is leaked until the incarnation ends.
This is currently consuming over 600MB in the mainnet v43-walletFactory right now. Restarting (upgrading) the vat will temporarily reduce the memory usage, but without fixing the leak, the problem will come back in a month or two.
The PromiseWatcher code triggers this bug, by virtue of the sequence of operations in watchPromise
:
watchedPromiseTable.init(vpid, harden([[watcher, ...args]]));
// Ensure that this vat's promises are rejected at termination.
if (maybeExportPromise(vpid)) {
syscall.subscribe(vpid);
}
promiseRegistrations.init(vpid, p);
pseudoThen(p, vpid);
The maybeExportPromise
function (in liveslots.js) uses followForKernel
to set up the .then
handler which could delete the slotToVal
entry, if only the refcount were zero. The pseudoThen
function sets up the promise-watcher handle which could remove the Promise from the merely-virtual promiseRegistrations
table, which reduces the refcount, but doesn't attempt to delete the entry.
The JavaScript spec says that handlers are executed in the order they were added. p.then(handlerA); p.then(handlerB)
means that handlerA will run in an earlier turn than handlerB. So this bug should not depend upon which engine we're using (V8 vs XS), or random timing or whatever.
While the full #10757 fix will make this sequence of operations safe (non-leaking), and will fix other cases that might get added in the future, it is likely to be somewhat invasive. Fortunately, there is an easy workaround: swap the order of these two calls. By calling pseudoThen
first, and maybeExportPromise
second, we can arrange for the refcount-reducing callback to be executed earlier, so the refcount should be zero by the time unregisterUnreferencedVPID()
is called, so the latter will go ahead and delete the slotToVal
entry.
And because this doesn't depend upon GC behavior or timing, we can satisfactorily unit-test this in packages/swingset-liveslots
(under Node, not needing xsnap
), by just measuring the slotToVal.size
count. When the leak is present, a resolved watched promise will still have a vpid in that table, and the size will be higher. When the leak is fixed, the vpid will be missing, and the size will be lower.