Skip to content

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Dec 6, 2025

This is a PR that puts together a bunch of earlier pieces (patchable calls in #12061 and #12101, private copies of code in #12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger.

These are implemented in the way we have planned in #11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the patchable ABI which then invokes the breakpoint hostcall. That hostcall emits the debug event and nothing else.

A few of the interesting bits in this PR include:

  • Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms.
  • Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches".
  • A tweak to the NOP metadata packaged with the MachBuffer to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs.

This PR also implements single-stepping with a global-per-Store flag, because at this point why not; it's a small additional bit of logic to do all patches in all modules registered in the Store when that flag is enabled.

A few realizations for future work:

  • The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it.
  • The interaction between inlining and patchable_call is interesting: what happens if we inline a patchable_call at a try_call callsite? Right now, we do not update the patchable_call to a try_call, because there is no patchable_try_call; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.

@cfallin cfallin requested review from a team as code owners December 6, 2025 23:31
@cfallin cfallin requested review from abrown and dicej and removed request for a team December 6, 2025 23:31
@cfallin cfallin added the wasmtime:debugging Issues related to debugging of JIT'ed code label Dec 6, 2025
@cfallin cfallin requested review from alexcrichton and fitzgen and removed request for abrown and dicej December 6, 2025 23:31
@cfallin cfallin force-pushed the wasm-breakpoints branch 2 times, most recently from 5c6fcc7 to f075ebe Compare December 7, 2025 00:42
@cfallin
Copy link
Member Author

cfallin commented Dec 7, 2025

The s390x failure looks like an oversight on my part in the patchable-ABI implementation on that ISA -- the clobber-save code implicitly assumes that clobber set fits in that ABI's special clobber-save region in each frame but that's no longer true when everything is clobbered. I'll rework it in a separate PR then rebase this.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen wasmtime:api Related to the API of the `wasmtime` crate itself labels Dec 7, 2025
Comment on lines +32 to +33
/// A trampoline from the patchable ABI to the given builtin.
PatchableToBuiltinTrampoline = FuncKey::new_kind(0b0100),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new kind here, I would have expected the ABI of the breakpoint builtin to be different from the rest. There's no use for a non-patchable version of the breakpoint builtin for example, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true currently, yeah; the reason I built it this way is that ABI is already a notion baked into the FuncKeyKind (explicitly during codegen for each such trampoline, and exposed via Abi too) and it seemed less error-prone to render that detail explicitly than to try to special-case one builtin (which would possibly require plumbing the builtin index into more places and conditionalizing a bunch of stuff). The lazy generation of trampolines on demand during compilation also means we don't ever actually generate a breakpoint trampoline with kind WasmToBuiltin. Maybe I'm missing somewhere this leads to an efficiency concern though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I feel like functions are easier to understand when they can't have variable ABIs and instead just have a single ABI for any particular definition, so I'd lean towards removing this new kind and updating the ABI definition of the breakpoint builtin.

There might be perf-ish concerns related to Nick's recent refactorings where sets are either dense-or-not and how things are encoded in metadata, but I'm not too concerned about that. I'm mostly worried about accidentally using a non-patchable-abi at some point since everything would more-or-less "just work" until runtime if that happened.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm still a little skeptical: there are baked-in notions in various places that the "Wasm ABI" means standard tail, and finding any places where we have to add a special case to break that seems worse and more error-prone than saying that there is a new kind of trampoline. I guess I don't see this as a "variable ABI" so much as "we can generate as needed a trampoline of the desired ABI1-to-ABI2 variant", with current cases Wasm-to-host and Patchable-to-host. That seems cleaner to me? Curious what @fitzgen thinks as well...

Comment on lines +450 to +452
if self.mmap.supports_virtual_memory() {
self.mmap.make_readwrite(0..self.mmap.len())?;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this'll want to return an error with bail! for when virtual memory isn't supported

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom-publish case should otherwise work if we don't bail here, and is ordinarily used in non-virtual-memory builds; is there a reason we want to rule that out? This function body is basically derived by starting with publish and then switching the permissions (and cprop'ing out the "if needs executable" because we want to make R/W regardless, e.g. for pulley)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding, but it looks like the custom-unpublished returns early above meaning it won't reach here. Returning an error if !mmap.supports_virtual_memory() is the same as publish above where if the host has virtual memory but the mapping doesn't support it then this operation fails. (technically this should be an assert of some kind since it would have failed in publish, but that's ok)

Comment on lines +423 to +432
/// # Safety
///
/// This code memory may not be actively executing when this
/// method is invoked or after it returns until the code is
/// re-[`publish`]ed. Doing so is UB (and will likely cause a
/// segfault, or possibly worse due to icache incoherence).
///
/// If this method returns an error, the code memory is in an
/// undetermined state and must never be executed again.
pub unsafe fn unpublish(&mut self) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd say that this Safety section and unsafe is unnecessary since, if &mut exists, it's already guaranteed that this has exclusive access over the memory.

Put another way, the unsafety here, other threads executing code, already shows unsafety unrelated to calling this function insofar as it should not be possible to acquire &mut CodeMemory while other threads are executing in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, either all that, or the clause about other threads should be removed from the unsafe docs since that should not be applicable given the rules of &mut

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't totally sure what to do here: it's not clear (to me) that executing code in the code memory is meant to be morally equivalent to an immutable borrow from that code. I guess it has to be since that's the only safe API we provide that can read the machine code (.text()) and executing instructions implies reading them. In that sense, though, the Wasm code is technically violating the borrowing invariants because it is fetching the instructions (immutable borrow) and also has a separate mut borrow to the whole Store.

Seen another way: should it be possible to cause a segfault by executing non-executable-bit regions when using only safe APIs on this struct within a hostcall? If this method isn't unsafe, then it is: we get a mut Store as a parameter to a hostcall, we unpublish, we return. Wasm code passed in a mut borrow so that was our right; but ending that borrow does not really end its effect.

(By the way that's what I was trying to capture with the edit function that takes the closure, though a refactor pulled the actual unpublish method out because I wanted to make the edits batched -- I think that's the right call but we need to figure out the above re: safe semantics)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't see the word "thread" anywhere in the docs on this method but let me know which part you want clarified and I'm happy to do so!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah modeling this all in Rust's borrowing I've never quite wrapped my head around and it feels a bit like a leaky abstraction. One way to interpret this is that wasm gave the host a &mut Thing borrow and the host did *thing = Thing::default(). Another way to interpret it though is that wasm is holding pointers into Thing (e.g. return addresses on the stack) so it's not valid for wasm to hand out &mut Thing to the host.

On the other hand though a controlled deterministic crash is not something I would consider unsafe. For example aborting the process is a safe operation and here the possible fault of forgetting to call publish is a deterministic "you don't have permissions to execute code there" failure.

So overall I feel like this still doesn't meet the criteria of being unsafe. I'm not aware of UB if this function is called in how it interacts with the rest of the system at least.

As for edit -- it makes sense to have that style of API but I don't think that this is the right layer to implement it. Rather the BreakpointEdit abstraction you added feels more appropriate. Additionally when designing unsafe primitives I've historically felt that the abstraction needs to be as easy-to-use as possible without increasing the risk of UB. For example text_mut doesn't increase the risk of UB it just makes it easier to use. Effectively writing non-idiomatic code is in a sense more risky than using standard borrowing/etc idioms with clearly defined unsafe contracts and such.

As for "thread", sorry about that I skimmed the docs and just interpreted actively-in-use as threads in the system. What I mean is that &mut CodeMemory provides a static guarantee that it's not actively-in-use and thus that part of the unsafety contract can go away as Rust's type system provides such a guarantee.

Comment on lines +639 to +647
if dirty_modules.insert(store_code_pc) {
// SAFETY: We have a mutable borrow to the `registry`,
// which is part of the `Store`. Code in this store cannot
// run while we hold that borrow. We re-publish when we
// are dropped.
unsafe {
code_memory.unpublish()?;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something around this is going to need some special care, specifically when unpublish returns an error. There are situations that this implementation doesn't currently handle such as:

  • If unpublish fails then it's documented as leaving things in an indeterminate state which means the entire store needs to be locked down and completely unusable (or something like that), but no such enforcement happens here.
  • If unpublish succeeds, there's no guarantee that the future publish in Drop will succeed. I don't think that a futuer publish failing should result in a panic-during-Drop since that's easy to get into a double-panic situation that aborts the process.

Overall I don't know how to handle this though. If unpublish fails and things are indeterminate, we can't handle that because returning from wasm requires executing some wasm. If publish fails then things definitely aren't executable, so we can't return to wasm to jump back to the other end of wasm. I hadn't really thought about this until reading this just now, but do you have ideas on how to handle the failures here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the closest thing that I can think of for handling this is to document abort-on-error semantics both here during unpublish and in Drop where the re-publish happens. It's on us to ensure that nothing outside of like "kernel exceeded limit of VMAs" errors happen and in such a situation there's not much we can do other than abort sort of...

Copy link
Member Author

@cfallin cfallin Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, I am just realizing that even an ordinary panic isn't sufficient here: that will unwind to the top of the hostcall stack frames, but then we'll catch-unwind and return an error through Wasm, but oops the trampoline isn't executable anymore, so we segfault.

We could almost make this work if we insisted that our trampolines (where the error check and raise that will longjmp back over the activation) were executed in engine code, not store code. Then an error would make Wasm non-executable but leave trampolines executable. We currently use engine-code pointers at some points (as you probably remember from the private-code PR) but for the Wasm-to-host trampolines for builtins, we'll use store-code variants because we emit direct calls from Wasm functions to those trampolines, so that's not a solution (or at least not without making all breakpoint patchable calls indirects and burning another register).

I agree that there's not much we can do if (in the Linux case) the kernel fails to allocate a new struct vma, and also that that's the only error case (ENOMEM generally) that I see in the Linux manpage for mprotect that isn't a bad-parameter sort of error avoided by construction. I could almost make an argument that the re-publish case should only be merging VMAs or keeping the same number, not splitting them (the whole mmap is initially one; initially publishing text splits into three (R, R+X, R); flipping the middle to R+W and back won't cause a merge nor a split); of course we can't rely on that for edge-case arguments about kernel OOM'ing and syscall errors but it should put us slightly more at ease...

Is there an idiomatic way to panic-and-really-abort-the-process even in a panic=unwind build?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately, an idea occurs: this all happens on a fiber in practice, since guest-debugging relies on that. Could we suspend out of the fiber and "poison" it somehow? Then we eliminate the re-execution of now-non-executable code by dropping those stackframes on the floor, never returning into them again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For aborting that'd be std::process::abort, and I agree that it's sort of our only recourse here.

Personally I've always felt that man pages are not a reliable source of information when it comes to "what are the exhaustive set of error conditions that could apply to this function" and instead I frequently have seen errors not documented in man pages. Given that I don't think we can reason our way around this and conclude that it shouldn't ever actually panic on Linux so we're ok. Instead I think we should expect such errors to be rare (if ever) and still ensure something reasonable happens as a result (e.g. aborting)

Another option for aborting would be to funnel things through an extern "C" function pointer which is defined as "cannot unwind" and will catch unwinds and abort the process. That's arguably better since we want to handle panics-while-editing in addition to actual mprotect errors. Or maybe have some sort of flag on BreakpointEdit which indicates a "yes this is a normal drop" vs an unwinding drop.

I do think it would be somewhat possible to basically leak the entire fiber but that's definitely resource leakage.


Do you know how other engines cope with this? I would expect other JITs to more-or-less run into the same problem here modulo they might not have to return back to code to execute a shim to jump to the start. Otherwise though this seems like something almost all JITs would eventually run into.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like SpiderMonkey also landed on the "just abort the process because there's no better option" option -- see AutoWritableJitCodeFallible. Notably the "fallible" bit is to make it writable; see the comment there about how it cannot fail to make it executable again in the destructor, and that giant blinking MOZ_CRASH invocation if makeExecutableAndFlushICache fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that seems convincing that we should probably do similar yeah. I'm curious though, what led you to document in unpublish that if it returns an error that things are left in an indeterminate state? I would naively expect that to be able to document/implement "if this fails then nothing changes, that's a guarantee" which would mean that starting a breakpoint-edit session is fallible, and it's just dropping the session that's infallible/aborts

@cfallin cfallin force-pushed the wasm-breakpoints branch 2 times, most recently from fcca65a to 9d69163 Compare December 9, 2025 09:04
@cfallin
Copy link
Member Author

cfallin commented Dec 10, 2025

Fixed s390x in #12148; that commit is also on top here to see the fix in CI but I'll rebase out once that merges first.

This is a PR that puts together a bunch of earlier pieces (patchable
calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all
the prior debug event and instrumentation infrastructure) to implement
breakpoints in the guest debugger.

These are implemented in the way we have planned in bytecodealliance#11964: each
sequence point (location prior to a Wasm opcode) is now a patchable call
instruction, patched out (replaced with NOPs) by default. When patched
in, the breakpoint callsite calls a trampoline with the `patchable` ABI
which then invokes the `breakpoint` hostcall. That hostcall emits the
debug event and nothing else.

A few of the interesting bits in this PR include:
- Implementations of "unpublish" (switch permissions back to read/write
  from read/execute) for mmap'd code memory on all our platforms.
- Infrastructure in the frame-tables (debug info) metadata producer and
  parser to record "breakpoint patches".
- A tweak to the NOP metadata packaged with the `MachBuffer` to allow
  multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for
  example (did you know x86-64 had these?!) rather than five 1-byte
  NOPs.

This PR also implements single-stepping with a global-per-`Store` flag,
because at this point why not; it's a small additional bit of logic to
do *all* patches in all modules registered in the `Store` when that flag
is enabled.

A few realizations for future work:
- The need for an introspection API available to a debugger to see the
  modules within a component is starting to become clear; either that,
  or the "module and PC" location identifier for a breakpoint switches
  to a "module or component" sum type. Right now, the tests for this
  feature use only core modules. Extending to components should not
  actually be hard at all, we just need to build the API for it.
- The interaction between inlining and `patchable_call` is interesting:
  what happens if we inline a `patchable_call` at a `try_call` callsite?
  Right now, we do *not* update the `patchable_call` to a `try_call`,
  because there is no `patchable_try_call`; this is fine in the Wasmtime
  embedding in practice because we never (today!) throw exceptions from
  a breakpoint handler. This does suggest to me that maybe we should
  make patchability a property of any callsite, and allow try-calls to
  be patchable too (with the same restriction about no return values as
  the only restriction); but happy to discuss that one further.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:debugging Issues related to debugging of JIT'ed code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants