Cranelift: add patchable call instructions. #12101

cfallin · 2025-12-01T06:05:36Z

The new patchable_call CLIF instruction pairs with the patchable ABI, and emits a callsite with one new key property: the MachBuffer carries metadata that describes exactly which byte range to "NOP out" (overwrite with NOP instructions) to disable that callsite. Doing so is semantically valid and explicitly supported.

This enables patching of code at runtime to dynamically turn on and off features such as instrumentation or debugging hooks. We plan to use this to implement breakpoints in Wasmtime's guest debugging support.

As part of this change, I added a notion of "unit of NOP bytes" to the MachBuffer so that the consumer (e.g., Wasmtime's Cranelift-based code compilation pipeline and metadata-producing logic) can handle patchable callsites without any other special knowledge of the ISA.

For the "real metal" ISAs there are perfectly well-defined NOPs to use, but for Pulley, where all opcodes are assigned at compile time by macro magic, I explicitly defined NOP as opcode byte 0 by moving Nop's definition to the top of the list and adding a unit test asserting its encoding.

A design note: in principle it would be possible, as an alternative, to treat "patchability" as an orthogonal dimension of all callsites, and emit the metadata describing the instruction-offset range for any callsite with the flag set. The only truly necessary semantic restriction is that there are no return values (because if we turn the callsite off, nothing writes to them); we could support patchability for other ABIs and for the other kinds of call instructions. The patchable ABI would then be better described as something like the "no clobbers ABI". I opted not to generalize in this way because it creates some less-tested corners and the generalized form, at least at the MachInst level, is not really much simpler in the end.

A testing note: I opted not to implement actual code patching in the cranelift-tools filetest runner and test patching callsites in/out via some actuation (e.g. a magic hostcall, like we do for throws) because (i) that's a lot of new plumbing and (ii) we are going to test this very shortly in Wasmtime anyway and (iii) the correctness (or not) of the location-and-length metadata is easy enough to verify in the disassemblies in the compile-tests.

The new `patchable_call` CLIF instruction pairs with the `patchable` ABI, and emits a callsite with one new key property: the MachBuffer carries metadata that describes exactly which byte range to "NOP out" (overwrite with NOP instructions) to disable that callsite. Doing so is semantically valid and explicitly supported. This enables patching of code at runtime to dynamically turn on and off features such as instrumentation or debugging hooks. We plan to use this to implement breakpoints in Wasmtime's guest debugging support. As part of this change, I added a notion of "unit of NOP bytes" to the MachBuffer so that the consumer (e.g., Wasmtime's Cranelift-based code compilation pipeline and metadata-producing logic) can handle patchable callsites without any other special knowledge of the ISA. For the "real metal" ISAs there are perfectly well-defined NOPs to use, but for Pulley, where all opcodes are assigned at compile time by macro magic, I explicitly defined NOP as opcode byte 0 by moving `Nop`'s definition to the top of the list and adding a unit test asserting its encoding. A design note: in principle it would be possible, as an alternative, to treat "patchability" as an orthogonal dimension of all callsites, and emit the metadata describing the instruction-offset range for any callsite with the flag set. The only truly necessary semantic restriction is that there are no return values (because if we turn the callsite off, nothing writes to them); we could support patchability for other ABIs and for the other kinds of call instructions. The `patchable` ABI would then be better described as something like the "no clobbers ABI". I opted not to generalize in this way because it creates some less-tested corners and the generalized form, at least at the MachInst level, is not really much simpler in the end. A testing note: I opted not to implement actual code patching in the `cranelift-tools` filetest runner and test patching callsites in/out via some actuation (e.g. a magic hostcall, like we do for throws) because (i) that's a lot of new plumbing and (ii) we are going to test this very shortly in Wasmtime anyway and (iii) the correctness (or not) of the location-and-length metadata is easy enough to verify in the disassemblies in the compile-tests.

cranelift/codegen/src/isa/pulley_shared/inst/mod.rs

alexcrichton

For Pulley, I'd agree it'd be nice to not have to rely on opcode 0 since that'd enable us to move it around and add payloads as necessary too. I vaguely recall seeing somewhere that some ISA (maybe wasm?) has opcode 0 as a trap which means if you accidentally start executing a bunch of zeros it'll die quickly as opposed to sliding into whatever's next and start executing that.

Otherwise though, I'm realizing that in the context of epochs-with-signal-handlers this is a pretty powerful instruction since, in theory, a patchable load (with the right ABI) could be replaced with a load-from-thing. More generally everything we could want to do with some sort of resumable exception seems like it would ideally want to go through this mechanism (ish) where IR-wise it's a call but runtime-wise we patch it to something that's a context-specific nop and signal handlers get to "rewrite" it to a call.

Given this, would it be worth fleshing out the ABI a bit here? For example is there a downside to saying that the patchable ABI clobbers 1 or 2 registers? Or something along those lines in lieu of attempting to encode the ABI in the function itself or something like that.

I've not thought about this too too deeply per se, but it sort of feels like this is on the cusp of being a framework at least for catch-a-thing-with-a-signal-and-go-back-to-wasm albeit not precisely suitable as-is (due to no clobbers). I realize though that this goal is orthogonal to debugging stuff, so if you'd prefer to not try to think about this now that's also fine too

cfallin · 2025-12-01T20:50:42Z

I agree that code-patching could be the basis for a lot of other mechanisms that push cost away from the common case (leaving it to be "just the cost of fetching some NOPs") and toward an uncommon case. In particular I like the idea (that I think you're suggesting?) of having a trap context rewrite all epoch-yield callsites into actual calls, while they're NOPs by default.

The one difficulty to keep in mind wrt anything involving signals is that patching the code will also require ownership of the Store, which is something that we don't yet have in a signal context; I guess that's where the modified ABI comes in, so "vmctx is an implicit argument in a known fixed register" is a way around that.

... but now that I think about that, the existing instruction is already suitable: the ABI is "no clobbers but there are still register args", so a patchable_call %epoch_yield(v0) where v0 is vmctx will already have vmctx in rdi or x0 or whatever, even when the callsite is in its default NOP'd-out state. So from that we could get the Store (Instance::enter_from_wasm etc) and from that patch all sites.

@erikrose there's a potential epoch-yield mechanism that requires no dead loads/stores at all -- happy to discuss further.

I'm not sure what other generalizations we could consider here other than the ones mentioned in my commit message above (any callsite / any ABI vs patchable only) but for that dimension I think I want to stick with what we have for the time being. If there's some other instruction we should have here too I'm happy to discuss further of course (maybe as part of a follow-on).

cfallin · 2025-12-01T20:51:43Z

Ah, actually, I'm thinking in terms of a signal taken at the site itself; a signal taken from an arbitrary point (say SIGALRM on a periodic basis) still has no way to get Store ownership in the general case, so scratch that...

cfallin · 2025-12-01T20:54:28Z

So I guess the instruction one would actually want for the epochs case is patchable_call_or_load %epoch_yield(v0), v1 with v1 being the page that gets unmapped to cause a SIGSEGV, then in that context do the patch (and there we do have the Store. Happy to sketch how that should look if @erikrose is curious.

alexcrichton · 2025-12-01T21:00:03Z

I should clarify I'm also stretching the imagination by using the term "patchable" here for what I'm saying. What I'd roughly envision is that we could have instructions (like epoch yields) which are translated to CLIF as a patchable_call instruction which gets machine code for "do the call". During compilation finalization, however, we'd rewrite all those call instructions to "do a load from rdi" or whatever register has the address of the load. At runtime nothing would actually end up being patched (avoids code copies which make instantiation expensive) and the signal handler would sort of do the patch for the code itself. The only thing that doesn't work in this scheme is the return address on x64 which must be located on the stack and we can't modify the stack in Windows vectored exception handlers (I even double-checked and apparently the red zone is 0 bytes on x64).

If we got this working though this would all naturally extend to breakpoints where we could change the size of the patchable call to the size of a breakpoint instruction for that purpose.

In general my rough hope is that we wouldn't need a CLIF-instruction-per-use-case and could funnel them through one relatively narrow waist ideally. Until there's a solution for x64 Windows exceptions, however, I don't think it's worth pushing too much on this because anything non-portable is only of niche value to have lots of infrastructure supporting it

cfallin · 2025-12-01T21:09:57Z

A few thoughts:

a "Call or something else" doesn't need one to mutate the stack (fortunately) (I think!) because within the signal handler we just update the code we're about to resume at; then the actual call executes and does the return-address push; so this is strictly better than the difficulties we had with the various call-injection trampolines
I am imagining that when used for breakpoints, we do the patching to NOP during code finalization, so the image on disk is all NOPs and we have the bytes to patch in for a breakpoint call stored in metadata; so there's no CoW cost for any of this in the default case
That said, I'm not totally sure I get the full picture you have for what you'd have a breakpoint be -- currently my thought is NOP-by-default, patched-to-call. Where does the load come in?
Making this a more general framework where we could have (say) any two instructions, or patch one to another, or whatever, seems difficult from a compiler architecture point of view. I think the direction I see as feasible would be a "swiss army knife patchable instruction" that knows how to (provides metadata to) become (i) a call, (ii) a load, (iii) a NOP, (iv) maybe something else. Then any use case that needs to patch between any subset of these possible side-effects is subsumed by it.

All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term!

alexcrichton · 2025-12-01T21:16:40Z

All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term!

👍 from me. I've added this to the Cranelift meeting where we can continue discussion

fitzgen

LGTM modulo the pulley nop thing.

Fine with not exercising the actual patching in cranelift-tools, since we have filetests, and all the actual patching will come in time as Wasmtime starts using this stuff.

cranelift/codegen/src/isa/pulley_shared/inst/mod.rs

cranelift/codegen/src/verifier/mod.rs

…teral byte 0.

This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.

cfallin requested review from a team as code owners December 1, 2025 06:05

cfallin requested review from abrown and fitzgen and removed request for a team and abrown December 1, 2025 06:05

bjorn3 reviewed Dec 1, 2025

View reviewed changes

cranelift/codegen/src/isa/pulley_shared/inst/mod.rs Outdated Show resolved Hide resolved

alexcrichton reviewed Dec 1, 2025

View reviewed changes

fitzgen approved these changes Dec 1, 2025

View reviewed changes

cranelift/codegen/src/isa/pulley_shared/inst/mod.rs Outdated Show resolved Hide resolved

cranelift/codegen/src/verifier/mod.rs Show resolved Hide resolved

Review feedback: remove dependence on (and test for) NOP being the li…

4451de1

…teral byte 0.

cfallin force-pushed the magical-disappearing-patchable-call-now-you-see-it-now-you-dont branch from cb1bac3 to 4451de1 Compare December 2, 2025 00:36

cfallin enabled auto-merge December 2, 2025 00:37

cfallin added this pull request to the merge queue Dec 2, 2025

Merged via the queue into bytecodealliance:main with commit c00e9ea Dec 2, 2025
76 checks passed

cfallin deleted the magical-disappearing-patchable-call-now-you-see-it-now-you-dont branch December 2, 2025 01:22

cfallin mentioned this pull request Dec 2, 2025

Debug: plan for simple libcall/instrumentation-based MVP #11964

Open

12 tasks

cfallin added the wasmtime:debugging Issues related to debugging of JIT'ed code label Dec 2, 2025

erikrose mentioned this pull request Dec 3, 2025

Use virtual memory tricks to make interrupt checks smaller and faster #1749

Open

cfallin mentioned this pull request Dec 6, 2025

Debug: implement breakpoints and single-stepping. #12133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cranelift: add patchable call instructions. #12101

Cranelift: add patchable call instructions. #12101

Uh oh!

cfallin commented Dec 1, 2025

Uh oh!

Uh oh!

alexcrichton left a comment

Uh oh!

cfallin commented Dec 1, 2025

Uh oh!

cfallin commented Dec 1, 2025

Uh oh!

cfallin commented Dec 1, 2025 •

edited

Loading

Uh oh!

alexcrichton commented Dec 1, 2025

Uh oh!

cfallin commented Dec 1, 2025

Uh oh!

alexcrichton commented Dec 1, 2025

Uh oh!

fitzgen left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cranelift: add patchable call instructions. #12101

Cranelift: add patchable call instructions. #12101

Uh oh!

Conversation

cfallin commented Dec 1, 2025

Uh oh!

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

cfallin commented Dec 1, 2025

Uh oh!

cfallin commented Dec 1, 2025

Uh oh!

cfallin commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Dec 1, 2025

Uh oh!

cfallin commented Dec 1, 2025

Uh oh!

alexcrichton commented Dec 1, 2025

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cfallin commented Dec 1, 2025 •

edited

Loading