device errors should not (always) kill the kernel #4326

warner · 2022-01-19T23:51:53Z

What is the Problem Being Solved?

#4296 and #4297 are cases where userspace managed to cause the timer device to have an error (returning a non-hardened array in one case, doing SO() to a non-Presence in the other). Both caused a kernel panic, halting the chain.

The first one, where vat A did a D(devices.B).method() and method() had an error, should probably cause vat A to observe an Error instead of killing the kernel. The device might be able to avoid the error with careful programming: in this case, method() returned an unhardened list, which is a minor bug in the device itself that could be fixed. But given the rules about what can and cannot be passed to/from devices (e.g. no promises), the Error might be thrown by translators before or after the device code runs, so even a scrupulous device might not be able to prevent all errors. In that case, user code could crash the kernel any time it likes, not good.

This is a case where the calling context is well-defined. It might not be prepared to do anything meaningful with the error, but still it's an appropriate and understandable place to throw.

The potential risk is that we don't know what state this leaves the device in. Destroying the world means we don't need to worry about the state of a single device, but probably overkill.

The second one, where the error happened during an external (beyond the kernel) invocation of the device's external IO surface (BEGIN_BLOCK triggered the "time has passed" method on the timer device), this is outside the context of any particular vat-driven invocation. The only place to report an error to is the external caller. I'm less sure what to do here: a kernel panic doesn't seem unreasonable. If the external caller is watching for an error and is prepared to react to it, then maybe we shouldn't panic the kernel. But in this case, I think BEGIN_BLOCK would halt the chain anyways.

The real question is whether the error propagates, or if it gets dropped (log+ignore is awfully close to just-ignore), and what bad state we might get into as a result. If the error only causes bad service to the party that provoked it, that's fine (@erights 's definition of defensive consistency allows us to provide bad service to a bad caller, as long as they can't prevent us from providing good service to good callers). In #4296, there's no timer-device state being modified. In #4297, it would be safe for the timer device to ignore the failed notification attempt. But I've seen systems in which a repeating timer did some task, and only re-submitted the timer if the task succeeded, which meant that a single failure would halt the timer, causing all sorts of future work to get silently skipped. The symptom was that certain events just didn't happen, with no visible cause, and the only way to fix it was to restart the entire service.

This also came up in my recent work on the #1346 "raw devices", where some helper code is translating object/promise/device references into new Presence-like objects (not using liveslots). I don't have a sensible way to represent foreign device nodes, so my translator asserts that they aren't being given one. But I can't prevent userspace from sending one, which means that the current device manager policy would allow any vat that has access to two different devices to crash the entire kernel. If I change the policy to send back an error to the calling vat instead, this becomes safe, and much easier to deal with from the device code.

Description of the Design

Device invocation from the kernel results in a DeviceResults response of either ['ok', resultCapdata] or ['error', description]. This is translated (by the device's translator) into a KernelDeviceInvocationResults object with the same shape, but krefs instead of (device-)vrefs. Then, the calling vat's translator is used to convert this into a VatInvocationResults object with the same shape, but with (vat-)vrefs.

Currently, both translation steps assert that the first element is ok, and thus don't attempt to do anything about an error.

The change would be to modify the translators to convert error results too, treating the second argument as capdata of an Error object and translating it just as it would an ok response. Also, the code in deviceManager.js which calls dispatch.invoke needs to react to a caught exception by synthesizing a marshalled Error object and returning the ['error', errorCapdata] results, instead of re-throwing.

On the calling vat's side, the doSyscall code in supervisor-helper.js currently reacts to an ['error', ..] -returning syscall by throwing an error that includes some data from the response:

    insistVatSyscallResult(vsr);
    const [type, ...rest] = vsr;
    switch (type) {
      case 'ok': {
        const [data] = rest;
        return data;
      }
      case 'error': {
        const [err] = rest;
        throw Error(`syscall.${fields[0]} failed, prepare to die: ${err}`);
      }
      default:
        throw Error(`unknown result type ${type}`);
    }

and the liveslots code for D(..).method(args) which invokes it looks like:

          const ret = syscall.callNow(slot, prop, serArgs);
          insistCapData(ret);
          const retval = unmeteredUnserialize(ret);
          return retval;

That's perhaps good enough for now. Liveslots will see syscall.callNow throw the syscall.callNow failed, prepare to die: ${err} Error, which will probably render the errorCapdata as [object Object] but won't kill the kernel.

An improvement would be to change the definition of syscall.callNow to not throw an exception, but instead to return the entire VatSyscallResult object ('ok' / 'error' prefix plus capdata), and let liveslots handle both. Liveslots would want to deserialize the second argument in both cases, but ok means it returns the output, while error means it throws the output. Then the vat code would see whatever Error object was created by the device side (modulo the way marshal hides the details).

As for the second problem (externally-provoked errors), I think our current code actually just throws an error that the external caller sees.. I think it might not crash the kernel directly. If #4297 was a chain halt because the BEGIN_BLOCK cosmic-swingset code propagated the error up to the whole chain, then I think we don't need any kernel changes.

Security Considerations

Preventing userspace from killing the kernel would definitely be a security improvement.

Not crashing the kernel might leave us in a weird state, which could enable some sort of attack.

Test Plan

Unit tests using a real kernel+vat+device.

The text was updated successfully, but these errors were encountered:

warner · 2022-01-27T22:11:23Z

My branch currently has the following state:

Liveslots prohibits vat user code from sending Promises in the arguments of a
D() call: this throws a local error.

The vat-to-kernel syscall translator also prohibits vats (including
liveslots) from including promises in the argments of a syscall.callNow,
which should never happen because of the liveslots guard. If that guard let
them through, the translator's error would mark the vat for termination, but
would not panic the kernel. The vat would see the D() call throw an
exception, its delivery would complete, the crankBuffer state changes from
that delivery would be discarded, then the vat would be terminated.

The kernel-to-device delivery translator prohibits devices from receiving
promises. If these are not caught by the earlier translator, this would cause
a kernel panic, as well as marking the calling vat for termination (which
isn't entirely coherent in a panic).

Vat user code could send a foreign device node in a D() call (tell device-1
about something from device-2), which would pass liveslots and the
vat-to-kernel syscall translator, but would be rejected by the
kernel-to-device translator. This will panic the kernel.

Vat user code can send objects to devices without problems.

The device-to-kernel syscall translator prohibits devices from exporting
objects (i.e. return a new o+NN reference from an invocation, or include one
in a sendOnly) or Promises. This will cause the device's sendOnly() call to
throw an error, but will not panic the kernel. If the device does not catch
this error, it will propagate up to the external caller (for inputs), or
signal the calling vat with a catchable error (for syscall.callNow).

but I need to handle the foreign-device-node case, at the very least, because it would still allow any vat possessing a device node to crash the kernel.

Previously, any error thrown by a device invocation would set the kernel to the "panic" state, mark the calling vat as scheduled for termination, then return an error indication to the calling vat, which makes the `syscall.callNow()` throw an exception. From the calling vat's perspective, their `D()` invocation throw a "you killed my kernel, prepare to die" error, the delivery would complete, then the entire kernel would shut down. `controller.run()` rejects its return promise, causing the host application to halt without committing state. This was fail-stop safe against surprises in the device code, but is too difficult to use in practice. Devices could not use assert() to validate caller arguments without giving the calling vat the power to shut down the entire kernel. With this change, errors during device invocation cause the calling vat to get an error, as before, but the error is neither vat-fatal nor kernel-fatal. It is still vat-fatal to pass promises in arguments to devices, where the mistake is caught by the vat's outbound translator, rather than the device's inbound translator. We'll probably relax this restriction later. The device inbound translator used to throw an error if it got a foreign device node, or a promise that somehow got past the guard above. In both cases, the translator error caused a kernel panic. This commit changes the inbound translator to handle both foreign device nodes and promises, which ought to prevent vats from triggering kernel panics via this route. Deviceslots will still reject both, but since that error occurs in the device *invocation*, the calling vat should get a (catchable) error, and the kernel will not panic. GC handling is still really sketchy around devices: do not expect objects passed to a device to ever be dropped. closes #4326

warner · 2022-01-28T08:02:29Z

Ok I think that PR should not give userspace any way to panic the kernel via a device node. There are several places where errors might be thrown:

liveslots prohibits user code from including a promise in the D() arguments, resulting in a local error
the vat-to-kernel outbound translator prohibits promises in syscall.callNow arguments, resulting in a vat-fatal translation error (vatTranslator.js translateCallNow, caught+handled by kernel.js vatSyscallHandler)
the kernel-to-device inbound translator allows all forms of slots (objects, promises, device-nodes), although it doesn't do the full GC protocol, just enough to keep inbound objects pinned (forever)
deviceSlots (for non-raw devices) rejects inbound promises and foreign device nodes, throwing an error which is reported to the calling vat (no kernel panic, not vat-fatal)
any error thrown during the device invocation is reported to the calling vat (no kernel panic, not vat-fatal)
deviceSlots translates all new outbound remotables as device nodes, so don't return a Promise or things will get weird
the device-to-kernel outbound translator rejects new objects and promises, and this would cause a kernel panic, deviceSlots won't ever emit those. A raw device could, though.
the kernel-to-vat inbound translator should accept everything

In general, errors throw in the translators cause a kernel panic, while errors thrown in deviceSlots or the device itself will be reported to the calling vat without panic.

Previously, any error thrown by a device invocation would set the kernel to the "panic" state, mark the calling vat as scheduled for termination, then return an error indication to the calling vat, which makes the `syscall.callNow()` throw an exception. From the calling vat's perspective, their `D()` invocation throw a "you killed my kernel, prepare to die" error, the delivery would complete, then the entire kernel would shut down. `controller.run()` rejects its return promise, causing the host application to halt without committing state. This was fail-stop safe against surprises in the device code, but is too difficult to use in practice. Devices could not use assert() to validate caller arguments without giving the calling vat the power to shut down the entire kernel. With this change, errors during device invocation cause the calling vat to get an error, as before, but the error is neither vat-fatal nor kernel-fatal. It is still vat-fatal to pass promises in arguments to devices, where the mistake is caught by the vat's outbound translator, rather than the device's inbound translator. We'll probably relax this restriction later. The device inbound translator used to throw an error if it got a foreign device node, or a promise that somehow got past the guard above. In both cases, the translator error caused a kernel panic. This commit changes the inbound translator to handle both foreign device nodes and promises, which ought to prevent vats from triggering kernel panics via this route. Deviceslots will still reject both, but since that error occurs in the device *invocation*, the calling vat should get a (catchable) error, and the kernel will not panic. GC handling is still really sketchy around devices: do not expect objects passed to a device to ever be dropped. closes #4326

warner added enhancement New feature or request SwingSet package: SwingSet labels Jan 19, 2022

warner added this to the Mainnet: Phase 1 - RUN Protocol milestone Jan 19, 2022

warner added the MN-1 label Jan 25, 2022

warner self-assigned this Jan 27, 2022

warner mentioned this issue Jan 28, 2022

don't panic the kernel upon a device error #4414

Merged

mergify bot closed this as completed in #4414 Jan 28, 2022

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

warner mentioned this issue Apr 5, 2022

Sufficient durability / upgradability of Vat contracts for launch #5014

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device errors should not (always) kill the kernel #4326

device errors should not (always) kill the kernel #4326

warner commented Jan 19, 2022

warner commented Jan 27, 2022

warner commented Jan 28, 2022

device errors should not (always) kill the kernel #4326

device errors should not (always) kill the kernel #4326

Comments

warner commented Jan 19, 2022

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

warner commented Jan 27, 2022

warner commented Jan 28, 2022