-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[lldb] Fix a crash in lldb-server during RemoveSoftwareBreakpoint() #148738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lldb] Fix a crash in lldb-server during RemoveSoftwareBreakpoint() #148738
Conversation
|
@llvm/pr-subscribers-lldb Author: None (royitaqi) ChangesLldb-server crashWe have seen stacks like the following in lldb-server core dumps: Hypothesis of root causeIn However, if any of the validations for the above reads and writes goes wrong, the code returns an error early, skipping the removal of the entry. This leaves the entry behind with a The next call to * We haven't found a way to repro such a next call. This is because both lldb and lldb-dap removes the breakpoint from their internal list when they get any response from the lldb-server (OK or error). Asking the client to delete the breakpoint a second time doesn't trigger the client to send the "$z" gdb packet (remove breakpoint) to lldb-server. FixLift the removal of the map entry to be immediately after the decrement of TestWe haven't found a way to test this change. Manual testing via lldb and lldb-dap doesn't work because the above reason (see the last paragraph). For unit testing, we couldn't find a way to trigger the early returns via writing random bytes into the process' memory to override the breakpoint opcodes. Full diff: https://github.com/llvm/llvm-project/pull/148738.diff 1 Files Affected:
diff --git a/lldb/source/Host/common/NativeProcessProtocol.cpp b/lldb/source/Host/common/NativeProcessProtocol.cpp
index 405acbb5662d6..196f54b93538d 100644
--- a/lldb/source/Host/common/NativeProcessProtocol.cpp
+++ b/lldb/source/Host/common/NativeProcessProtocol.cpp
@@ -366,12 +366,19 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
if (--it->second.ref_count > 0)
return Status();
+ // Remove the entry from m_software_breakpoints rightaway, so that we don't
+ // leave behind an entry with ref_count == 0 in case one of the following
+ // conditions returns an error. The breakpoint is moved so that it can be
+ // accessed below.
+ SoftwareBreakpoint bkpt = std::move(it->second);
+ m_software_breakpoints.erase(it);
+
// This is the last reference. Let's remove the breakpoint.
Status error;
// Clear a software breakpoint instruction
- llvm::SmallVector<uint8_t, 4> curr_break_op(
- it->second.breakpoint_opcodes.size(), 0);
+ llvm::SmallVector<uint8_t, 4> curr_break_op(bkpt.breakpoint_opcodes.size(),
+ 0);
// Read the breakpoint opcode
size_t bytes_read = 0;
@@ -382,10 +389,10 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
"addr=0x%" PRIx64 ": tried to read %zu bytes but only read %zu", addr,
curr_break_op.size(), bytes_read);
}
- const auto &saved = it->second.saved_opcodes;
+ const auto &saved = bkpt.saved_opcodes;
// Make sure the breakpoint opcode exists at this address
- if (llvm::ArrayRef(curr_break_op) != it->second.breakpoint_opcodes) {
- if (curr_break_op != it->second.saved_opcodes)
+ if (llvm::ArrayRef(curr_break_op) != bkpt.breakpoint_opcodes) {
+ if (curr_break_op != bkpt.saved_opcodes)
return Status::FromErrorString(
"Original breakpoint trap is no longer in memory.");
LLDB_LOG(log,
@@ -418,7 +425,6 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
llvm::make_range(saved.begin(), saved.end()));
}
- m_software_breakpoints.erase(it);
return Status();
}
|
|
You should be able to extend the unit test in unittests/Host/NativeProcessProtocolTest.cpp to cover this scenario. |
Thank you, @labath! Added unit tests. Updating the PR description. (Reflection: I should have checked the unit test file in the first place.) |
|
Looks good to me. @labath are you good with this now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
…lvm#148738) # Lldb-server crash We have seen stacks like the following in lldb-server core dumps: ``` [ "__GI___pthread_kill at pthread_kill.c:46", "__GI_raise at raise.c:26", "__GI_abort at abort.c:100", "__assert_fail_base at assert.c:92", "__GI___assert_fail at assert.c:101", "lldb_private::NativeProcessProtocol::RemoveSoftwareBreakpoint(unsigned long) at /redacted/lldb-server:0" ] ``` # Hypothesis of root cause In `NativeProcessProtocol::RemoveSoftwareBreakpoint()` ([code](https://github.com/llvm/llvm-project/blob/19b2dd9d798c124406b0124a1b8debb711675281/lldb/source/Host/common/NativeProcessProtocol.cpp#L359-L423)), a `ref_count` is asserted and reduced. If it becomes zero, the code first go through a series of memory reads and writes to remove the breakpoint trap opcode and to restore the original process code, then, if everything goes fine, removes the entry from the map `m_software_breakpoints` at the end of the function. However, if any of the validations for the above reads and writes goes wrong, the code returns an error early, skipping the removal of the entry. This leaves the entry behind with a `ref_count` of zero. The next call to `NativeProcessProtocol::RemoveSoftwareBreakpoint()` for the same breakpoint[*] would violate the assertion about `ref_count > 0` ([here](https://github.com/llvm/llvm-project/blob/19b2dd9d798c124406b0124a1b8debb711675281/lldb/source/Host/common/NativeProcessProtocol.cpp#L365)), which would cause a crash. [*] We haven't found a *regular* way to repro such a next call in lldb or lldb-dap. This is because both of them remove the breakpoint from their internal list when they get any response from the lldb-server (OK or error). Asking the client to delete the breakpoint a second time doesn't trigger the client to send the `$z` gdb packet to lldb-server. We are able to trigger the crash by sending the `$z` packet directly, see "Manual test" below. # Fix Lift the removal of the map entry to be immediately after the decrement of `ref_count`, before the early returns. This ensures that the asserted case will never happen. The validation errors can still happen, and whether they happen or not, the breakpoint has been removed from the perspective of the lldb-server (same as that of lldb and lldb-dap). # Manual test & unit test See PR. (cherry picked from commit a13712e)
Lldb-server crash
We have seen stacks like the following in lldb-server core dumps:
Hypothesis of root cause
In
NativeProcessProtocol::RemoveSoftwareBreakpoint()(code), aref_countis asserted and reduced. If it becomes zero, the code first go through a series of memory reads and writes to remove the breakpoint trap opcode and to restore the original process code, then, if everything goes fine, removes the entry from the mapm_software_breakpointsat the end of the function.However, if any of the validations for the above reads and writes goes wrong, the code returns an error early, skipping the removal of the entry. This leaves the entry behind with a
ref_countof zero.The next call to
NativeProcessProtocol::RemoveSoftwareBreakpoint()for the same breakpoint[*] would violate the assertion aboutref_count > 0(here), which would cause a crash.[*] We haven't found a regular way to repro such a next call in lldb or lldb-dap. This is because both of them remove the breakpoint from their internal list when they get any response from the lldb-server (OK or error). Asking the client to delete the breakpoint a second time doesn't trigger the client to send the
$zgdb packet to lldb-server. We are able to trigger the crash by sending the$zpacket directly, see "Manual test" below.Fix
Lift the removal of the map entry to be immediately after the decrement of
ref_count, before the early returns. This ensures that the asserted case will never happen. The validation errors can still happen, and whether they happen or not, the breakpoint has been removed from the perspective of the lldb-server (same as that of lldb and lldb-dap).Manual test
Manually verified by adding the following code snippet into this patch, before and after the removal of the map entry in
RemoveSoftwareBreakpoint(), simulating an early return of errors.Before (simulating the current behavior, i.e. without patch): The 2nd attempt to remove the same breakpoint results in
SIGABRT.After (simulating the future behavior, i.e. with patch): The 2nd attempt to remove the same breakpoint results in
E09as expected, not crash.Unit test
Added two tests.
m_software_breakpoints.ReadMemory()in the first removal returns unexpected opcode. The second removal should return an error (and not crash with an assertion failure), because the breakpoint should have been removed fromm_software_breakpoints.The second test case crashes with assertion error without patch and succeeds with patch.