Skip to content

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Dec 1, 2025

The new patchable_call CLIF instruction pairs with the patchable ABI, and emits a callsite with one new key property: the MachBuffer carries metadata that describes exactly which byte range to "NOP out" (overwrite with NOP instructions) to disable that callsite. Doing so is semantically valid and explicitly supported.

This enables patching of code at runtime to dynamically turn on and off features such as instrumentation or debugging hooks. We plan to use this to implement breakpoints in Wasmtime's guest debugging support.

As part of this change, I added a notion of "unit of NOP bytes" to the MachBuffer so that the consumer (e.g., Wasmtime's Cranelift-based code compilation pipeline and metadata-producing logic) can handle patchable callsites without any other special knowledge of the ISA.

For the "real metal" ISAs there are perfectly well-defined NOPs to use, but for Pulley, where all opcodes are assigned at compile time by macro magic, I explicitly defined NOP as opcode byte 0 by moving Nop's definition to the top of the list and adding a unit test asserting its encoding.

A design note: in principle it would be possible, as an alternative, to treat "patchability" as an orthogonal dimension of all callsites, and emit the metadata describing the instruction-offset range for any callsite with the flag set. The only truly necessary semantic restriction is that there are no return values (because if we turn the callsite off, nothing writes to them); we could support patchability for other ABIs and for the other kinds of call instructions. The patchable ABI would then be better described as something like the "no clobbers ABI". I opted not to generalize in this way because it creates some less-tested corners and the generalized form, at least at the MachInst level, is not really much simpler in the end.

A testing note: I opted not to implement actual code patching in the cranelift-tools filetest runner and test patching callsites in/out via some actuation (e.g. a magic hostcall, like we do for throws) because (i) that's a lot of new plumbing and (ii) we are going to test this very shortly in Wasmtime anyway and (iii) the correctness (or not) of the location-and-length metadata is easy enough to verify in the disassemblies in the compile-tests.

The new `patchable_call` CLIF instruction pairs with the `patchable`
ABI, and emits a callsite with one new key property: the MachBuffer
carries metadata that describes exactly which byte range to "NOP out"
(overwrite with NOP instructions) to disable that callsite. Doing so is
semantically valid and explicitly supported.

This enables patching of code at runtime to dynamically turn on and off
features such as instrumentation or debugging hooks. We plan to use this
to implement breakpoints in Wasmtime's guest debugging support.

As part of this change, I added a notion of "unit of NOP bytes" to the
MachBuffer so that the consumer (e.g., Wasmtime's Cranelift-based code
compilation pipeline and metadata-producing logic) can handle patchable
callsites without any other special knowledge of the ISA.

For the "real metal" ISAs there are perfectly well-defined NOPs to use,
but for Pulley, where all opcodes are assigned at compile time by macro
magic, I explicitly defined NOP as opcode byte 0 by moving `Nop`'s
definition to the top of the list and adding a unit test asserting its
encoding.

A design note: in principle it would be possible, as an alternative, to
treat "patchability" as an orthogonal dimension of all callsites, and
emit the metadata describing the instruction-offset range for any
callsite with the flag set. The only truly necessary semantic
restriction is that there are no return values (because if we turn the
callsite off, nothing writes to them); we could support patchability for
other ABIs and for the other kinds of call instructions. The `patchable`
ABI would then be better described as something like the "no clobbers
ABI". I opted not to generalize in this way because it creates some
less-tested corners and the generalized form, at least at the MachInst
level, is not really much simpler in the end.

A testing note: I opted not to implement actual code patching in the
`cranelift-tools` filetest runner and test patching callsites in/out via
some actuation (e.g. a magic hostcall, like we do for throws) because
(i) that's a lot of new plumbing and (ii) we are going to test this very
shortly in Wasmtime anyway and (iii) the correctness (or not) of the
location-and-length metadata is easy enough to verify in the
disassemblies in the compile-tests.
@cfallin cfallin requested review from a team as code owners December 1, 2025 06:05
@cfallin cfallin requested review from abrown and fitzgen and removed request for a team and abrown December 1, 2025 06:05
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. labels Dec 1, 2025
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Pulley, I'd agree it'd be nice to not have to rely on opcode 0 since that'd enable us to move it around and add payloads as necessary too. I vaguely recall seeing somewhere that some ISA (maybe wasm?) has opcode 0 as a trap which means if you accidentally start executing a bunch of zeros it'll die quickly as opposed to sliding into whatever's next and start executing that.

Otherwise though, I'm realizing that in the context of epochs-with-signal-handlers this is a pretty powerful instruction since, in theory, a patchable load (with the right ABI) could be replaced with a load-from-thing. More generally everything we could want to do with some sort of resumable exception seems like it would ideally want to go through this mechanism (ish) where IR-wise it's a call but runtime-wise we patch it to something that's a context-specific nop and signal handlers get to "rewrite" it to a call.

Given this, would it be worth fleshing out the ABI a bit here? For example is there a downside to saying that the patchable ABI clobbers 1 or 2 registers? Or something along those lines in lieu of attempting to encode the ABI in the function itself or something like that.

I've not thought about this too too deeply per se, but it sort of feels like this is on the cusp of being a framework at least for catch-a-thing-with-a-signal-and-go-back-to-wasm albeit not precisely suitable as-is (due to no clobbers). I realize though that this goal is orthogonal to debugging stuff, so if you'd prefer to not try to think about this now that's also fine too

@cfallin
Copy link
Member Author

cfallin commented Dec 1, 2025

I agree that code-patching could be the basis for a lot of other mechanisms that push cost away from the common case (leaving it to be "just the cost of fetching some NOPs") and toward an uncommon case. In particular I like the idea (that I think you're suggesting?) of having a trap context rewrite all epoch-yield callsites into actual calls, while they're NOPs by default.

The one difficulty to keep in mind wrt anything involving signals is that patching the code will also require ownership of the Store, which is something that we don't yet have in a signal context; I guess that's where the modified ABI comes in, so "vmctx is an implicit argument in a known fixed register" is a way around that.

... but now that I think about that, the existing instruction is already suitable: the ABI is "no clobbers but there are still register args", so a patchable_call %epoch_yield(v0) where v0 is vmctx will already have vmctx in rdi or x0 or whatever, even when the callsite is in its default NOP'd-out state. So from that we could get the Store (Instance::enter_from_wasm etc) and from that patch all sites.

@erikrose there's a potential epoch-yield mechanism that requires no dead loads/stores at all -- happy to discuss further.

I'm not sure what other generalizations we could consider here other than the ones mentioned in my commit message above (any callsite / any ABI vs patchable only) but for that dimension I think I want to stick with what we have for the time being. If there's some other instruction we should have here too I'm happy to discuss further of course (maybe as part of a follow-on).

@cfallin
Copy link
Member Author

cfallin commented Dec 1, 2025

Ah, actually, I'm thinking in terms of a signal taken at the site itself; a signal taken from an arbitrary point (say SIGALRM on a periodic basis) still has no way to get Store ownership in the general case, so scratch that...

@cfallin
Copy link
Member Author

cfallin commented Dec 1, 2025

So I guess the instruction one would actually want for the epochs case is patchable_call_or_load %epoch_yield(v0), v1 with v1 being the page that gets unmapped to cause a SIGSEGV, then in that context do the patch (and there we do have the Store. Happy to sketch how that should look if @erikrose is curious.

@alexcrichton
Copy link
Member

I should clarify I'm also stretching the imagination by using the term "patchable" here for what I'm saying. What I'd roughly envision is that we could have instructions (like epoch yields) which are translated to CLIF as a patchable_call instruction which gets machine code for "do the call". During compilation finalization, however, we'd rewrite all those call instructions to "do a load from rdi" or whatever register has the address of the load. At runtime nothing would actually end up being patched (avoids code copies which make instantiation expensive) and the signal handler would sort of do the patch for the code itself. The only thing that doesn't work in this scheme is the return address on x64 which must be located on the stack and we can't modify the stack in Windows vectored exception handlers (I even double-checked and apparently the red zone is 0 bytes on x64).

If we got this working though this would all naturally extend to breakpoints where we could change the size of the patchable call to the size of a breakpoint instruction for that purpose.

In general my rough hope is that we wouldn't need a CLIF-instruction-per-use-case and could funnel them through one relatively narrow waist ideally. Until there's a solution for x64 Windows exceptions, however, I don't think it's worth pushing too much on this because anything non-portable is only of niche value to have lots of infrastructure supporting it

@cfallin
Copy link
Member Author

cfallin commented Dec 1, 2025

A few thoughts:

  • a "Call or something else" doesn't need one to mutate the stack (fortunately) (I think!) because within the signal handler we just update the code we're about to resume at; then the actual call executes and does the return-address push; so this is strictly better than the difficulties we had with the various call-injection trampolines
  • I am imagining that when used for breakpoints, we do the patching to NOP during code finalization, so the image on disk is all NOPs and we have the bytes to patch in for a breakpoint call stored in metadata; so there's no CoW cost for any of this in the default case
  • That said, I'm not totally sure I get the full picture you have for what you'd have a breakpoint be -- currently my thought is NOP-by-default, patched-to-call. Where does the load come in?
  • Making this a more general framework where we could have (say) any two instructions, or patch one to another, or whatever, seems difficult from a compiler architecture point of view. I think the direction I see as feasible would be a "swiss army knife patchable instruction" that knows how to (provides metadata to) become (i) a call, (ii) a load, (iii) a NOP, (iv) maybe something else. Then any use case that needs to patch between any subset of these possible side-effects is subsumed by it.

All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term!

@alexcrichton
Copy link
Member

All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term!

👍 from me. I've added this to the Cranelift meeting where we can continue discussion

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the pulley nop thing.

Fine with not exercising the actual patching in cranelift-tools, since we have filetests, and all the actual patching will come in time as Wasmtime starts using this stuff.

@cfallin cfallin force-pushed the magical-disappearing-patchable-call-now-you-see-it-now-you-dont branch from cb1bac3 to 4451de1 Compare December 2, 2025 00:36
@cfallin cfallin enabled auto-merge December 2, 2025 00:37
@cfallin cfallin added this pull request to the merge queue Dec 2, 2025
Merged via the queue into bytecodealliance:main with commit c00e9ea Dec 2, 2025
76 checks passed
@cfallin cfallin deleted the magical-disappearing-patchable-call-now-you-see-it-now-you-dont branch December 2, 2025 01:22
@cfallin cfallin added the wasmtime:debugging Issues related to debugging of JIT'ed code label Dec 2, 2025
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 6, 2025
This is a PR that puts together a bunch of earlier pieces (patchable
calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all
the prior debug event and instrumentation infrastructure) to implement
breakpoints in the guest debugger.

These are implemented in the way we have planned in bytecodealliance#11964: each
sequence point (location prior to a Wasm opcode) is now a patchable call
instruction, patched out (replaced with NOPs) by default. When patched
in, the breakpoint callsite calls a trampoline with the `patchable` ABI
which then invokes the `breakpoint` hostcall. That hostcall emits the
debug event and nothing else.

A few of the interesting bits in this PR include:
- Implementations of "unpublish" (switch permissions back to read/write
  from read/execute) for mmap'd code memory on all our platforms.
- Infrastructure in the frame-tables (debug info) metadata producer and
  parser to record "breakpoint patches".
- A tweak to the NOP metadata packaged with the `MachBuffer` to allow
  multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for
  example (did you know x86-64 had these?!) rather than five 1-byte
  NOPs.

This PR also implements single-stepping with a global-per-`Store` flag,
because at this point why not; it's a small additional bit of logic to
do *all* patches in all modules registered in the `Store` when that flag
is enabled.

A few realizations for future work:
- The need for an introspection API available to a debugger to see the
  modules within a component is starting to become clear; either that,
  or the "module and PC" location identifier for a breakpoint switches
  to a "module or component" sum type. Right now, the tests for this
  feature use only core modules. Extending to components should not
  actually be hard at all, we just need to build the API for it.
- The interaction between inlining and `patchable_call` is interesting:
  what happens if we inline a `patchable_call` at a `try_call` callsite?
  Right now, we do *not* update the `patchable_call` to a `try_call`,
  because there is no `patchable_try_call`; this is fine in the Wasmtime
  embedding in practice because we never (today!) throw exceptions from
  a breakpoint handler. This does suggest to me that maybe we should
  make patchability a property of any callsite, and allow try-calls to
  be patchable too (with the same restriction about no return values as
  the only restriction); but happy to discuss that one further.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 6, 2025
This is a PR that puts together a bunch of earlier pieces (patchable
calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all
the prior debug event and instrumentation infrastructure) to implement
breakpoints in the guest debugger.

These are implemented in the way we have planned in bytecodealliance#11964: each
sequence point (location prior to a Wasm opcode) is now a patchable call
instruction, patched out (replaced with NOPs) by default. When patched
in, the breakpoint callsite calls a trampoline with the `patchable` ABI
which then invokes the `breakpoint` hostcall. That hostcall emits the
debug event and nothing else.

A few of the interesting bits in this PR include:
- Implementations of "unpublish" (switch permissions back to read/write
  from read/execute) for mmap'd code memory on all our platforms.
- Infrastructure in the frame-tables (debug info) metadata producer and
  parser to record "breakpoint patches".
- A tweak to the NOP metadata packaged with the `MachBuffer` to allow
  multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for
  example (did you know x86-64 had these?!) rather than five 1-byte
  NOPs.

This PR also implements single-stepping with a global-per-`Store` flag,
because at this point why not; it's a small additional bit of logic to
do *all* patches in all modules registered in the `Store` when that flag
is enabled.

A few realizations for future work:
- The need for an introspection API available to a debugger to see the
  modules within a component is starting to become clear; either that,
  or the "module and PC" location identifier for a breakpoint switches
  to a "module or component" sum type. Right now, the tests for this
  feature use only core modules. Extending to components should not
  actually be hard at all, we just need to build the API for it.
- The interaction between inlining and `patchable_call` is interesting:
  what happens if we inline a `patchable_call` at a `try_call` callsite?
  Right now, we do *not* update the `patchable_call` to a `try_call`,
  because there is no `patchable_try_call`; this is fine in the Wasmtime
  embedding in practice because we never (today!) throw exceptions from
  a breakpoint handler. This does suggest to me that maybe we should
  make patchability a property of any callsite, and allow try-calls to
  be patchable too (with the same restriction about no return values as
  the only restriction); but happy to discuss that one further.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 7, 2025
This is a PR that puts together a bunch of earlier pieces (patchable
calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all
the prior debug event and instrumentation infrastructure) to implement
breakpoints in the guest debugger.

These are implemented in the way we have planned in bytecodealliance#11964: each
sequence point (location prior to a Wasm opcode) is now a patchable call
instruction, patched out (replaced with NOPs) by default. When patched
in, the breakpoint callsite calls a trampoline with the `patchable` ABI
which then invokes the `breakpoint` hostcall. That hostcall emits the
debug event and nothing else.

A few of the interesting bits in this PR include:
- Implementations of "unpublish" (switch permissions back to read/write
  from read/execute) for mmap'd code memory on all our platforms.
- Infrastructure in the frame-tables (debug info) metadata producer and
  parser to record "breakpoint patches".
- A tweak to the NOP metadata packaged with the `MachBuffer` to allow
  multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for
  example (did you know x86-64 had these?!) rather than five 1-byte
  NOPs.

This PR also implements single-stepping with a global-per-`Store` flag,
because at this point why not; it's a small additional bit of logic to
do *all* patches in all modules registered in the `Store` when that flag
is enabled.

A few realizations for future work:
- The need for an introspection API available to a debugger to see the
  modules within a component is starting to become clear; either that,
  or the "module and PC" location identifier for a breakpoint switches
  to a "module or component" sum type. Right now, the tests for this
  feature use only core modules. Extending to components should not
  actually be hard at all, we just need to build the API for it.
- The interaction between inlining and `patchable_call` is interesting:
  what happens if we inline a `patchable_call` at a `try_call` callsite?
  Right now, we do *not* update the `patchable_call` to a `try_call`,
  because there is no `patchable_try_call`; this is fine in the Wasmtime
  embedding in practice because we never (today!) throw exceptions from
  a breakpoint handler. This does suggest to me that maybe we should
  make patchability a property of any callsite, and allow try-calls to
  be patchable too (with the same restriction about no return values as
  the only restriction); but happy to discuss that one further.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 11, 2025
This is a PR that puts together a bunch of earlier pieces (patchable
calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all
the prior debug event and instrumentation infrastructure) to implement
breakpoints in the guest debugger.

These are implemented in the way we have planned in bytecodealliance#11964: each
sequence point (location prior to a Wasm opcode) is now a patchable call
instruction, patched out (replaced with NOPs) by default. When patched
in, the breakpoint callsite calls a trampoline with the `patchable` ABI
which then invokes the `breakpoint` hostcall. That hostcall emits the
debug event and nothing else.

A few of the interesting bits in this PR include:
- Implementations of "unpublish" (switch permissions back to read/write
  from read/execute) for mmap'd code memory on all our platforms.
- Infrastructure in the frame-tables (debug info) metadata producer and
  parser to record "breakpoint patches".
- A tweak to the NOP metadata packaged with the `MachBuffer` to allow
  multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for
  example (did you know x86-64 had these?!) rather than five 1-byte
  NOPs.

This PR also implements single-stepping with a global-per-`Store` flag,
because at this point why not; it's a small additional bit of logic to
do *all* patches in all modules registered in the `Store` when that flag
is enabled.

A few realizations for future work:
- The need for an introspection API available to a debugger to see the
  modules within a component is starting to become clear; either that,
  or the "module and PC" location identifier for a breakpoint switches
  to a "module or component" sum type. Right now, the tests for this
  feature use only core modules. Extending to components should not
  actually be hard at all, we just need to build the API for it.
- The interaction between inlining and `patchable_call` is interesting:
  what happens if we inline a `patchable_call` at a `try_call` callsite?
  Right now, we do *not* update the `patchable_call` to a `try_call`,
  because there is no `patchable_try_call`; this is fine in the Wasmtime
  embedding in practice because we never (today!) throw exceptions from
  a breakpoint handler. This does suggest to me that maybe we should
  make patchability a property of any callsite, and allow try-calls to
  be patchable too (with the same restriction about no return values as
  the only restriction); but happy to discuss that one further.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift Issues related to the Cranelift code generator wasmtime:debugging Issues related to debugging of JIT'ed code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants