-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Debug: implement breakpoints and single-stepping. #12133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5c6fcc7 to
f075ebe
Compare
|
The s390x failure looks like an oversight on my part in the patchable-ABI implementation on that ISA -- the clobber-save code implicitly assumes that clobber set fits in that ABI's special clobber-save region in each frame but that's no longer true when everything is clobbered. I'll rework it in a separate PR then rebase this. |
| /// A trampoline from the patchable ABI to the given builtin. | ||
| PatchableToBuiltinTrampoline = FuncKey::new_kind(0b0100), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a new kind here, I would have expected the ABI of the breakpoint builtin to be different from the rest. There's no use for a non-patchable version of the breakpoint builtin for example, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true currently, yeah; the reason I built it this way is that ABI is already a notion baked into the FuncKeyKind (explicitly during codegen for each such trampoline, and exposed via Abi too) and it seemed less error-prone to render that detail explicitly than to try to special-case one builtin (which would possibly require plumbing the builtin index into more places and conditionalizing a bunch of stuff). The lazy generation of trampolines on demand during compilation also means we don't ever actually generate a breakpoint trampoline with kind WasmToBuiltin. Maybe I'm missing somewhere this leads to an efficiency concern though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I feel like functions are easier to understand when they can't have variable ABIs and instead just have a single ABI for any particular definition, so I'd lean towards removing this new kind and updating the ABI definition of the breakpoint builtin.
There might be perf-ish concerns related to Nick's recent refactorings where sets are either dense-or-not and how things are encoded in metadata, but I'm not too concerned about that. I'm mostly worried about accidentally using a non-patchable-abi at some point since everything would more-or-less "just work" until runtime if that happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm still a little skeptical: there are baked-in notions in various places that the "Wasm ABI" means standard tail, and finding any places where we have to add a special case to break that seems worse and more error-prone than saying that there is a new kind of trampoline. I guess I don't see this as a "variable ABI" so much as "we can generate as needed a trampoline of the desired ABI1-to-ABI2 variant", with current cases Wasm-to-host and Patchable-to-host. That seems cleaner to me? Curious what @fitzgen thinks as well...
| if self.mmap.supports_virtual_memory() { | ||
| self.mmap.make_readwrite(0..self.mmap.len())?; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this'll want to return an error with bail! for when virtual memory isn't supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom-publish case should otherwise work if we don't bail here, and is ordinarily used in non-virtual-memory builds; is there a reason we want to rule that out? This function body is basically derived by starting with publish and then switching the permissions (and cprop'ing out the "if needs executable" because we want to make R/W regardless, e.g. for pulley)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be misunderstanding, but it looks like the custom-unpublished returns early above meaning it won't reach here. Returning an error if !mmap.supports_virtual_memory() is the same as publish above where if the host has virtual memory but the mapping doesn't support it then this operation fails. (technically this should be an assert of some kind since it would have failed in publish, but that's ok)
| /// # Safety | ||
| /// | ||
| /// This code memory may not be actively executing when this | ||
| /// method is invoked or after it returns until the code is | ||
| /// re-[`publish`]ed. Doing so is UB (and will likely cause a | ||
| /// segfault, or possibly worse due to icache incoherence). | ||
| /// | ||
| /// If this method returns an error, the code memory is in an | ||
| /// undetermined state and must never be executed again. | ||
| pub unsafe fn unpublish(&mut self) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd say that this Safety section and unsafe is unnecessary since, if &mut exists, it's already guaranteed that this has exclusive access over the memory.
Put another way, the unsafety here, other threads executing code, already shows unsafety unrelated to calling this function insofar as it should not be possible to acquire &mut CodeMemory while other threads are executing in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, either all that, or the clause about other threads should be removed from the unsafe docs since that should not be applicable given the rules of &mut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't totally sure what to do here: it's not clear (to me) that executing code in the code memory is meant to be morally equivalent to an immutable borrow from that code. I guess it has to be since that's the only safe API we provide that can read the machine code (.text()) and executing instructions implies reading them. In that sense, though, the Wasm code is technically violating the borrowing invariants because it is fetching the instructions (immutable borrow) and also has a separate mut borrow to the whole Store.
Seen another way: should it be possible to cause a segfault by executing non-executable-bit regions when using only safe APIs on this struct within a hostcall? If this method isn't unsafe, then it is: we get a mut Store as a parameter to a hostcall, we unpublish, we return. Wasm code passed in a mut borrow so that was our right; but ending that borrow does not really end its effect.
(By the way that's what I was trying to capture with the edit function that takes the closure, though a refactor pulled the actual unpublish method out because I wanted to make the edits batched -- I think that's the right call but we need to figure out the above re: safe semantics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't see the word "thread" anywhere in the docs on this method but let me know which part you want clarified and I'm happy to do so!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah modeling this all in Rust's borrowing I've never quite wrapped my head around and it feels a bit like a leaky abstraction. One way to interpret this is that wasm gave the host a &mut Thing borrow and the host did *thing = Thing::default(). Another way to interpret it though is that wasm is holding pointers into Thing (e.g. return addresses on the stack) so it's not valid for wasm to hand out &mut Thing to the host.
On the other hand though a controlled deterministic crash is not something I would consider unsafe. For example aborting the process is a safe operation and here the possible fault of forgetting to call publish is a deterministic "you don't have permissions to execute code there" failure.
So overall I feel like this still doesn't meet the criteria of being unsafe. I'm not aware of UB if this function is called in how it interacts with the rest of the system at least.
As for edit -- it makes sense to have that style of API but I don't think that this is the right layer to implement it. Rather the BreakpointEdit abstraction you added feels more appropriate. Additionally when designing unsafe primitives I've historically felt that the abstraction needs to be as easy-to-use as possible without increasing the risk of UB. For example text_mut doesn't increase the risk of UB it just makes it easier to use. Effectively writing non-idiomatic code is in a sense more risky than using standard borrowing/etc idioms with clearly defined unsafe contracts and such.
As for "thread", sorry about that I skimmed the docs and just interpreted actively-in-use as threads in the system. What I mean is that &mut CodeMemory provides a static guarantee that it's not actively-in-use and thus that part of the unsafety contract can go away as Rust's type system provides such a guarantee.
| if dirty_modules.insert(store_code_pc) { | ||
| // SAFETY: We have a mutable borrow to the `registry`, | ||
| // which is part of the `Store`. Code in this store cannot | ||
| // run while we hold that borrow. We re-publish when we | ||
| // are dropped. | ||
| unsafe { | ||
| code_memory.unpublish()?; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something around this is going to need some special care, specifically when unpublish returns an error. There are situations that this implementation doesn't currently handle such as:
- If
unpublishfails then it's documented as leaving things in an indeterminate state which means the entire store needs to be locked down and completely unusable (or something like that), but no such enforcement happens here. - If
unpublishsucceeds, there's no guarantee that the futurepublishinDropwill succeed. I don't think that a futuerpublishfailing should result in a panic-during-Dropsince that's easy to get into a double-panic situation that aborts the process.
Overall I don't know how to handle this though. If unpublish fails and things are indeterminate, we can't handle that because returning from wasm requires executing some wasm. If publish fails then things definitely aren't executable, so we can't return to wasm to jump back to the other end of wasm. I hadn't really thought about this until reading this just now, but do you have ideas on how to handle the failures here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the closest thing that I can think of for handling this is to document abort-on-error semantics both here during unpublish and in Drop where the re-publish happens. It's on us to ensure that nothing outside of like "kernel exceeded limit of VMAs" errors happen and in such a situation there's not much we can do other than abort sort of...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, I am just realizing that even an ordinary panic isn't sufficient here: that will unwind to the top of the hostcall stack frames, but then we'll catch-unwind and return an error through Wasm, but oops the trampoline isn't executable anymore, so we segfault.
We could almost make this work if we insisted that our trampolines (where the error check and raise that will longjmp back over the activation) were executed in engine code, not store code. Then an error would make Wasm non-executable but leave trampolines executable. We currently use engine-code pointers at some points (as you probably remember from the private-code PR) but for the Wasm-to-host trampolines for builtins, we'll use store-code variants because we emit direct calls from Wasm functions to those trampolines, so that's not a solution (or at least not without making all breakpoint patchable calls indirects and burning another register).
I agree that there's not much we can do if (in the Linux case) the kernel fails to allocate a new struct vma, and also that that's the only error case (ENOMEM generally) that I see in the Linux manpage for mprotect that isn't a bad-parameter sort of error avoided by construction. I could almost make an argument that the re-publish case should only be merging VMAs or keeping the same number, not splitting them (the whole mmap is initially one; initially publishing text splits into three (R, R+X, R); flipping the middle to R+W and back won't cause a merge nor a split); of course we can't rely on that for edge-case arguments about kernel OOM'ing and syscall errors but it should put us slightly more at ease...
Is there an idiomatic way to panic-and-really-abort-the-process even in a panic=unwind build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately, an idea occurs: this all happens on a fiber in practice, since guest-debugging relies on that. Could we suspend out of the fiber and "poison" it somehow? Then we eliminate the re-execution of now-non-executable code by dropping those stackframes on the floor, never returning into them again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For aborting that'd be std::process::abort, and I agree that it's sort of our only recourse here.
Personally I've always felt that man pages are not a reliable source of information when it comes to "what are the exhaustive set of error conditions that could apply to this function" and instead I frequently have seen errors not documented in man pages. Given that I don't think we can reason our way around this and conclude that it shouldn't ever actually panic on Linux so we're ok. Instead I think we should expect such errors to be rare (if ever) and still ensure something reasonable happens as a result (e.g. aborting)
Another option for aborting would be to funnel things through an extern "C" function pointer which is defined as "cannot unwind" and will catch unwinds and abort the process. That's arguably better since we want to handle panics-while-editing in addition to actual mprotect errors. Or maybe have some sort of flag on BreakpointEdit which indicates a "yes this is a normal drop" vs an unwinding drop.
I do think it would be somewhat possible to basically leak the entire fiber but that's definitely resource leakage.
Do you know how other engines cope with this? I would expect other JITs to more-or-less run into the same problem here modulo they might not have to return back to code to execute a shim to jump to the start. Otherwise though this seems like something almost all JITs would eventually run into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like SpiderMonkey also landed on the "just abort the process because there's no better option" option -- see AutoWritableJitCodeFallible. Notably the "fallible" bit is to make it writable; see the comment there about how it cannot fail to make it executable again in the destructor, and that giant blinking MOZ_CRASH invocation if makeExecutableAndFlushICache fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that seems convincing that we should probably do similar yeah. I'm curious though, what led you to document in unpublish that if it returns an error that things are left in an indeterminate state? I would naively expect that to be able to document/implement "if this fails then nothing changes, that's a guarantee" which would mean that starting a breakpoint-edit session is fallible, and it's just dropping the session that's infallible/aborts
fcca65a to
9d69163
Compare
|
Fixed s390x in #12148; that commit is also on top here to see the fix in CI but I'll rebase out once that merges first. |
This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.
a8f450c to
2e00ca6
Compare
This is a PR that puts together a bunch of earlier pieces (patchable calls in #12061 and #12101, private copies of code in #12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger.
These are implemented in the way we have planned in #11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the
patchableABI which then invokes thebreakpointhostcall. That hostcall emits the debug event and nothing else.A few of the interesting bits in this PR include:
MachBufferto allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs.This PR also implements single-stepping with a global-per-
Storeflag, because at this point why not; it's a small additional bit of logic to do all patches in all modules registered in theStorewhen that flag is enabled.A few realizations for future work:
patchable_callis interesting: what happens if we inline apatchable_callat atry_callcallsite? Right now, we do not update thepatchable_callto atry_call, because there is nopatchable_try_call; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.