-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Cranelift: add patchable call instructions. #12101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: add patchable call instructions. #12101
Conversation
The new `patchable_call` CLIF instruction pairs with the `patchable` ABI, and emits a callsite with one new key property: the MachBuffer carries metadata that describes exactly which byte range to "NOP out" (overwrite with NOP instructions) to disable that callsite. Doing so is semantically valid and explicitly supported. This enables patching of code at runtime to dynamically turn on and off features such as instrumentation or debugging hooks. We plan to use this to implement breakpoints in Wasmtime's guest debugging support. As part of this change, I added a notion of "unit of NOP bytes" to the MachBuffer so that the consumer (e.g., Wasmtime's Cranelift-based code compilation pipeline and metadata-producing logic) can handle patchable callsites without any other special knowledge of the ISA. For the "real metal" ISAs there are perfectly well-defined NOPs to use, but for Pulley, where all opcodes are assigned at compile time by macro magic, I explicitly defined NOP as opcode byte 0 by moving `Nop`'s definition to the top of the list and adding a unit test asserting its encoding. A design note: in principle it would be possible, as an alternative, to treat "patchability" as an orthogonal dimension of all callsites, and emit the metadata describing the instruction-offset range for any callsite with the flag set. The only truly necessary semantic restriction is that there are no return values (because if we turn the callsite off, nothing writes to them); we could support patchability for other ABIs and for the other kinds of call instructions. The `patchable` ABI would then be better described as something like the "no clobbers ABI". I opted not to generalize in this way because it creates some less-tested corners and the generalized form, at least at the MachInst level, is not really much simpler in the end. A testing note: I opted not to implement actual code patching in the `cranelift-tools` filetest runner and test patching callsites in/out via some actuation (e.g. a magic hostcall, like we do for throws) because (i) that's a lot of new plumbing and (ii) we are going to test this very shortly in Wasmtime anyway and (iii) the correctness (or not) of the location-and-length metadata is easy enough to verify in the disassemblies in the compile-tests.
alexcrichton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Pulley, I'd agree it'd be nice to not have to rely on opcode 0 since that'd enable us to move it around and add payloads as necessary too. I vaguely recall seeing somewhere that some ISA (maybe wasm?) has opcode 0 as a trap which means if you accidentally start executing a bunch of zeros it'll die quickly as opposed to sliding into whatever's next and start executing that.
Otherwise though, I'm realizing that in the context of epochs-with-signal-handlers this is a pretty powerful instruction since, in theory, a patchable load (with the right ABI) could be replaced with a load-from-thing. More generally everything we could want to do with some sort of resumable exception seems like it would ideally want to go through this mechanism (ish) where IR-wise it's a call but runtime-wise we patch it to something that's a context-specific nop and signal handlers get to "rewrite" it to a call.
Given this, would it be worth fleshing out the ABI a bit here? For example is there a downside to saying that the patchable ABI clobbers 1 or 2 registers? Or something along those lines in lieu of attempting to encode the ABI in the function itself or something like that.
I've not thought about this too too deeply per se, but it sort of feels like this is on the cusp of being a framework at least for catch-a-thing-with-a-signal-and-go-back-to-wasm albeit not precisely suitable as-is (due to no clobbers). I realize though that this goal is orthogonal to debugging stuff, so if you'd prefer to not try to think about this now that's also fine too
|
I agree that code-patching could be the basis for a lot of other mechanisms that push cost away from the common case (leaving it to be "just the cost of fetching some NOPs") and toward an uncommon case. In particular I like the idea (that I think you're suggesting?) of having a trap context rewrite all epoch-yield callsites into actual calls, while they're NOPs by default. The one difficulty to keep in mind wrt anything involving signals is that patching the code will also require ownership of the ... but now that I think about that, the existing instruction is already suitable: the ABI is "no clobbers but there are still register args", so a @erikrose there's a potential epoch-yield mechanism that requires no dead loads/stores at all -- happy to discuss further. I'm not sure what other generalizations we could consider here other than the ones mentioned in my commit message above (any callsite / any ABI vs |
|
Ah, actually, I'm thinking in terms of a signal taken at the site itself; a signal taken from an arbitrary point (say |
|
So I guess the instruction one would actually want for the epochs case is |
|
I should clarify I'm also stretching the imagination by using the term "patchable" here for what I'm saying. What I'd roughly envision is that we could have instructions (like epoch yields) which are translated to CLIF as a If we got this working though this would all naturally extend to breakpoints where we could change the size of the patchable call to the size of a breakpoint instruction for that purpose. In general my rough hope is that we wouldn't need a CLIF-instruction-per-use-case and could funnel them through one relatively narrow waist ideally. Until there's a solution for x64 Windows exceptions, however, I don't think it's worth pushing too much on this because anything non-portable is only of niche value to have lots of infrastructure supporting it |
|
A few thoughts:
All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term! |
👍 from me. I've added this to the Cranelift meeting where we can continue discussion |
fitzgen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the pulley nop thing.
Fine with not exercising the actual patching in cranelift-tools, since we have filetests, and all the actual patching will come in time as Wasmtime starts using this stuff.
cb1bac3 to
4451de1
Compare
This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.
This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.
This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.
This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.
The new
patchable_callCLIF instruction pairs with thepatchableABI, and emits a callsite with one new key property: the MachBuffer carries metadata that describes exactly which byte range to "NOP out" (overwrite with NOP instructions) to disable that callsite. Doing so is semantically valid and explicitly supported.This enables patching of code at runtime to dynamically turn on and off features such as instrumentation or debugging hooks. We plan to use this to implement breakpoints in Wasmtime's guest debugging support.
As part of this change, I added a notion of "unit of NOP bytes" to the MachBuffer so that the consumer (e.g., Wasmtime's Cranelift-based code compilation pipeline and metadata-producing logic) can handle patchable callsites without any other special knowledge of the ISA.
For the "real metal" ISAs there are perfectly well-defined NOPs to use, but for Pulley, where all opcodes are assigned at compile time by macro magic, I explicitly defined NOP as opcode byte 0 by moving
Nop's definition to the top of the list and adding a unit test asserting its encoding.A design note: in principle it would be possible, as an alternative, to treat "patchability" as an orthogonal dimension of all callsites, and emit the metadata describing the instruction-offset range for any callsite with the flag set. The only truly necessary semantic restriction is that there are no return values (because if we turn the callsite off, nothing writes to them); we could support patchability for other ABIs and for the other kinds of call instructions. The
patchableABI would then be better described as something like the "no clobbers ABI". I opted not to generalize in this way because it creates some less-tested corners and the generalized form, at least at the MachInst level, is not really much simpler in the end.A testing note: I opted not to implement actual code patching in the
cranelift-toolsfiletest runner and test patching callsites in/out via some actuation (e.g. a magic hostcall, like we do for throws) because (i) that's a lot of new plumbing and (ii) we are going to test this very shortly in Wasmtime anyway and (iii) the correctness (or not) of the location-and-length metadata is easy enough to verify in the disassemblies in the compile-tests.