increasing end-of-crank time consumed in testnet4.5 vattp trace, promises not getting GCed? XS bug? #3839

warner · 2021-09-16T21:42:56Z

I'm investigating performance data from the recent phase4.5 ("metering") testnet, specifically vattp (v14). The code in this vat is super-simple, and does not accumulate any significant state over time. Yet for some reason the time it took for each crank increased significantly over the week-long testnet period.

This is a graph of the total delivery time (slogfile deliver event until deliver-result), vs the delivery number, for just the deliverInboundAck() messages:

The code for deliverInboundAck() is:

    deliverInboundAck(name, ack) {
      const o = getRemote(name).outbound;
      let num = o.highestRemoved + 1;
      while (num <= o.highestAdded && num <= ack) {
        D(mailbox).remove(name, num);
        o.highestRemoved = num;
        num += 1;
      }
    },

It's just retrieving a high-water mark number from a Map for the named remote, and making one remove device invocation for each message number between that number and the incoming ack number, then updating the high-water mark. The observed metering-visible activity (computrons) is a nearly-deterministic function of the number of messages that get removed, over the entire testnet period, so at least the JS engine doesn't think the amount of work is increasing over time.

When I graph the amount of time that elapses between the initial deliver event and the kernel-side end of the last syscall, I see this shape:

which tells me:

1: it is not growing significantly over time (maybe 1ms total)
2: there are multiple evenly-spaced horizontal bands which probably represent the small extra time taken to remove 0/1/2/etc old messages, and each removal adds maybe 1ms
3: there's some additional delay (about 10ms) that comes and goes, for hours or days at a time
4: the total time spent in this early phase is fairly small, no more than 30ms or so
5: therefore the 150ms-200ms spent on each delivery by the end of the testnet is all happening after the last syscall returns

I'm fairly sure that the extra time is being spent during the forced GC that liveslots does at end-of-crank (the same thing we're aiming to do less frequently with #3767 ). However, given that vattp is not doing anything to justify an increased heap size (no objects are being passed around), my hypothesis is that something inside liveslots is causing the heap to grow, and the only thing I can think of is a bunch of result promises.

There are three pathways through vattp that get called a significant number of times (i.e. more than once per provisioned remote). Two (deliverInboundMessage and deliverInboundAck) are triggered by the mailbox device. They are "sendOnly", and do not have result promises. deliverInboundMessage emits a single eventual-send and a device invocation for each non-duplicate message:

    deliverInboundMessages(name, newMessages) {
      const i = getRemote(name).inbound;
      newMessages.forEach(m => {
        const [num, body] = m;
        if (num > i.highestDelivered) {
          // TODO: SO() / sendOnly()
          E(i.receiver).receive(body);
          i.highestDelivered = num;
          D(mailbox).ackInbound(name, num);
        }
      });
    },

(as noted, we want to change that i.receiver~.receive() to use sendOnly once we have suport for it, because we don't really care about the result promise).

deliverInboundAck does a device invocation, but no eventual-sends.

The third pathway is on the outbound side, when comms tells vattp to transmit a message to the remote:

      const transmitter = Far('transmitter', {
        transmit(msg) {
          const o = r.outbound;
          const num = o.highestAdded + 1;
          D(mailbox).add(name, num, msg);
          o.highestAdded = num;
        },
      });

This does not cause any eventual sends, and the comms message is sendOnly (because comms does not use liveslots).

So the only place that's obviously creating a result promise is the ~.receive() call. Comms should be resolving that (to undefined) a moment later, and liveslots is supposed to forget about resolved promises. But if something is broken in that pathway, maybe pendingPromises is growing, causing heap growth, and subsequent GC slowdowns.

My plan is to build a benchmark environment that creates a vattp vat (without a kernel), sets up a remote, then feeds a long stream of deliverInboundMessages / deliverInboundAck / transmit messages in, and watches how the execution time changes over time. If that demonstrates the same problem that we recorded on the testnet, the next step is to start instrumenting liveslots, specifically pendingPromises, to see if it is growing over time.

The text was updated successfully, but these errors were encountered:

warner · 2021-09-17T06:19:03Z

I've confirmed that the increase in runtime still occurs in a stripped-down environment (no kernel, just vattp and synthesized deliverInboundMessages/deliverInboundAck deliveries). I've also confirmed that neither pendingPromises nor slotToVal are growing over time.

I'm starting to suspect something in XS is failing to mark a Promise as unreferenced, since there's no obvious userspace strong collection of Promises, and a similar bug was fixed in XS several months ago.

warner · 2021-09-17T20:41:07Z

I'm pretty sure this is a problem in XS, and I've filed Moddable-OpenSource/moddable#701 with steps to reproduce for the Moddable folks to take a look at.

* fix a major memory leak: 64 bytes per Map `delete()`, 32 per Set `delete()` * should: closes #3839 * unfortunately Map/Set deletion is now O(N) not O(1) * possibly fix #3877 "cannot read (corrupted?) snapshot" Note that this breaks snapshot compatibility, and probably metering compatibility. closes #3889

We upgrade the XS submodule to the latest version: Moddable-OpenSource/moddable@10cc52e This fixes a major memory leak: 64 bytes per Map `delete()`, 32 per Set `delete()`. We believe this should: closes #3839 Unfortunately Map/Set deletion is now O(N) not O(1). This version of XS also fixes a bug that might be the cause of #3877 "cannot read (corrupted?) snapshot", but we're not sure. Note that this breaks snapshot compatibility (snapshots created before this version cannot be loaded by code after this version). It might also break metering compatibility, but the chances seem low enough that we decided to leave the metering version alone. closes #3889

warner · 2021-09-30T00:59:36Z

For reference, the most obvious problem turned out to be unrelated to Promises. XS had a memory leak in which deleting an entry from a Set cost 32 bytes, and 64 bytes for a Map. The ultimate cause was the tombstone entries in the insertion-order linked list used to ensure iterators behaved correctly in the face of deletion. The new version of XS fixes this, although at the cost of O(N) deletion operations.

I ran a simulation of vattp behavior (delivering synthetic messages to an XS worker hosting the vattp code) and observed mostly flat runtime. I ran it for about 10% of the total testnet traffic (the original trace showed a clear linear increase in runtime over that range). There remained some occasional jank that grew linearly, maybe once every 5k deliveries, which I don't understand, but I'm willing to believe it was happening in the V8 side that was driving the simulation. I also ran a much smaller test (100 steps) and inspected XS's the->currentHeapCount to confirm that it remained flat (the original XS worker grew by a couple for each delivery).

I ought to run a more extensive test in which I replay the entire testnet4.5 trace into a new-XS worker and examine the memory and time usage, but I'm more likely to wait until we do a deeper performance investigation of some other vat.

warner added bug Something isn't working SwingSet package: SwingSet labels Sep 16, 2021

warner self-assigned this Sep 16, 2021

warner changed the title ~~increasing end-of-crank time consumed in testnet4.5 vattp trace, maybe pendingPromises not getting pruned~~ increasing end-of-crank time consumed in testnet4.5 vattp trace, promises not getting GCed? XS bug? Sep 20, 2021

warner mentioned this issue Sep 24, 2021

update to latest XS, with map-iterator GC fix (also native lockdown/harden/purify) #3889

Closed

warner mentioned this issue Sep 29, 2021

fix(xsnap)!: upgrade to latest XS #3906

Merged

warner mentioned this issue Sep 30, 2021

liveslots should expose sendOnly functionality to vat userspace code #2691

Open

warner closed this as completed in #3906 Sep 30, 2021

dckc added testnet-problem problem found during incentivized testnet xsnap the XS execution tool labels Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increasing end-of-crank time consumed in testnet4.5 vattp trace, promises not getting GCed? XS bug? #3839

increasing end-of-crank time consumed in testnet4.5 vattp trace, promises not getting GCed? XS bug? #3839

warner commented Sep 16, 2021 •

edited

Loading

warner commented Sep 17, 2021

warner commented Sep 17, 2021

warner commented Sep 30, 2021

increasing end-of-crank time consumed in testnet4.5 vattp trace, promises not getting GCed? XS bug? #3839

increasing end-of-crank time consumed in testnet4.5 vattp trace, promises not getting GCed? XS bug? #3839

Comments

warner commented Sep 16, 2021 • edited Loading

warner commented Sep 17, 2021

warner commented Sep 17, 2021

warner commented Sep 30, 2021

warner commented Sep 16, 2021 •

edited

Loading