Skip to content

Conversation

@royitaqi
Copy link
Contributor

@royitaqi royitaqi commented Jul 14, 2025

Lldb-server crash

We have seen stacks like the following in lldb-server core dumps:

[
  "__GI___pthread_kill at pthread_kill.c:46",
  "__GI_raise at raise.c:26",
  "__GI_abort at abort.c:100",
  "__assert_fail_base at assert.c:92",
  "__GI___assert_fail at assert.c:101",
  "lldb_private::NativeProcessProtocol::RemoveSoftwareBreakpoint(unsigned long) at /redacted/lldb-server:0"
]

Hypothesis of root cause

In NativeProcessProtocol::RemoveSoftwareBreakpoint() (code), a ref_count is asserted and reduced. If it becomes zero, the code first go through a series of memory reads and writes to remove the breakpoint trap opcode and to restore the original process code, then, if everything goes fine, removes the entry from the map m_software_breakpoints at the end of the function.

However, if any of the validations for the above reads and writes goes wrong, the code returns an error early, skipping the removal of the entry. This leaves the entry behind with a ref_count of zero.

The next call to NativeProcessProtocol::RemoveSoftwareBreakpoint() for the same breakpoint[*] would violate the assertion about ref_count > 0 (here), which would cause a crash.

[*] We haven't found a regular way to repro such a next call in lldb or lldb-dap. This is because both of them remove the breakpoint from their internal list when they get any response from the lldb-server (OK or error). Asking the client to delete the breakpoint a second time doesn't trigger the client to send the $z gdb packet to lldb-server. We are able to trigger the crash by sending the $z packet directly, see "Manual test" below.

Fix

Lift the removal of the map entry to be immediately after the decrement of ref_count, before the early returns. This ensures that the asserted case will never happen. The validation errors can still happen, and whether they happen or not, the breakpoint has been removed from the perspective of the lldb-server (same as that of lldb and lldb-dap).

Manual test

Manually verified by adding the following code snippet into this patch, before and after the removal of the map entry in RemoveSoftwareBreakpoint(), simulating an early return of errors.

// Replace the following 0x value with the breakpoint's address.
bool should_error = addr == 0x55555555513b;
if (should_error)
  return Status::FromErrorString("DEBUG");

Before (simulating the current behavior, i.e. without patch): The 2nd attempt to remove the same breakpoint results in SIGABRT.

(lldb) br del 1
lldb             <  21> send packet: $z0,55555555513b,1#d6
lldb             <   7> read packet: $E09#ae
1 breakpoints deleted; 0 breakpoint locations disabled.

(lldb) process plugin packet send z0,55555555513b,1#d6
lldb             <  24> send packet: $z0,55555555513b,1#d6#93
lldb             <  11> send packet: $qEcho:1#5b
  packet: z0,55555555513b,1#d6
response:
error: UNIMPLEMENTED
Process 1363559 exited with status = -1 (0xffffffff) lldb-server died with signal SIGABRT

After (simulating the future behavior, i.e. with patch): The 2nd attempt to remove the same breakpoint results in E09 as expected, not crash.

(lldb) br del 1
lldb             <  21> send packet: $z0,55555555513b,1#d6
lldb             <   7> read packet: $E09#ae
1 breakpoints deleted; 0 breakpoint locations disabled.

(lldb) process plugin packet send z0,55555555513b,1#d6
lldb             <  24> send packet: $z0,55555555513b,1#d6#93
lldb             <   7> read packet: $E09#ae
  packet: z0,55555555513b,1#d6
response: E09

Unit test

Added two tests.

  1. The happy path: A breakpoint is added once and removed twice. The second removal should return an error, because the breakpoint should have been removed from m_software_breakpoints.
  2. A corner case: A breakpoint is added once and removed twice, but the ReadMemory() in the first removal returns unexpected opcode. The second removal should return an error (and not crash with an assertion failure), because the breakpoint should have been removed from m_software_breakpoints.

The second test case crashes with assertion error without patch and succeeds with patch.

@royitaqi royitaqi requested a review from JDevlieghere as a code owner July 14, 2025 22:18
@llvmbot llvmbot added the lldb label Jul 14, 2025
@royitaqi royitaqi requested review from clayborg, dmpots and serge-sans-paille and removed request for JDevlieghere July 14, 2025 22:18
@llvmbot
Copy link
Member

llvmbot commented Jul 14, 2025

@llvm/pr-subscribers-lldb

Author: None (royitaqi)

Changes

Lldb-server crash

We have seen stacks like the following in lldb-server core dumps:

[
  "__GI___pthread_kill at pthread_kill.c:46",
  "__GI_raise at raise.c:26",
  "__GI_abort at abort.c:100",
  "__assert_fail_base at assert.c:92",
  "__GI___assert_fail at assert.c:101",
  "lldb_private::NativeProcessProtocol::RemoveSoftwareBreakpoint(unsigned long) at /redacted/lldb-server:0"
]

Hypothesis of root cause

In NativeProcessProtocol::RemoveSoftwareBreakpoint() (code), a ref_count is asserted and reduced. If it becomes zero, the code first go through a series of memory reads and writes to remove the breakpoint trap opcode and to restore the original process code, then, if everything goes fine, removes the entry from the map m_software_breakpoints at the end of the function.

However, if any of the validations for the above reads and writes goes wrong, the code returns an error early, skipping the removal of the entry. This leaves the entry behind with a ref_count of zero.

The next call to NativeProcessProtocol::RemoveSoftwareBreakpoint() for the same breakpoint (*) would violate the assertion about ref_count &gt; 0 (here), which would cause a crash.

* We haven't found a way to repro such a next call. This is because both lldb and lldb-dap removes the breakpoint from their internal list when they get any response from the lldb-server (OK or error). Asking the client to delete the breakpoint a second time doesn't trigger the client to send the "$z" gdb packet (remove breakpoint) to lldb-server.

Fix

Lift the removal of the map entry to be immediately after the decrement of ref_count, before the early returns. This ensures that the asserted case will never happen. The validation errors can still happen, and whether they happen or not, the breakpoint has been removed from the perspective of the lldb-server (same as that of lldb and lldb-dap).

Test

We haven't found a way to test this change. Manual testing via lldb and lldb-dap doesn't work because the above reason (see the last paragraph).

For unit testing, we couldn't find a way to trigger the early returns via writing random bytes into the process' memory to override the breakpoint opcodes.


Full diff: https://github.com/llvm/llvm-project/pull/148738.diff

1 Files Affected:

  • (modified) lldb/source/Host/common/NativeProcessProtocol.cpp (+12-6)
diff --git a/lldb/source/Host/common/NativeProcessProtocol.cpp b/lldb/source/Host/common/NativeProcessProtocol.cpp
index 405acbb5662d6..196f54b93538d 100644
--- a/lldb/source/Host/common/NativeProcessProtocol.cpp
+++ b/lldb/source/Host/common/NativeProcessProtocol.cpp
@@ -366,12 +366,19 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
   if (--it->second.ref_count > 0)
     return Status();
 
+  // Remove the entry from m_software_breakpoints rightaway, so that we don't
+  // leave behind an entry with ref_count == 0 in case one of the following
+  // conditions returns an error. The breakpoint is moved so that it can be
+  // accessed below.
+  SoftwareBreakpoint bkpt = std::move(it->second);
+  m_software_breakpoints.erase(it);
+
   // This is the last reference. Let's remove the breakpoint.
   Status error;
 
   // Clear a software breakpoint instruction
-  llvm::SmallVector<uint8_t, 4> curr_break_op(
-      it->second.breakpoint_opcodes.size(), 0);
+  llvm::SmallVector<uint8_t, 4> curr_break_op(bkpt.breakpoint_opcodes.size(),
+                                              0);
 
   // Read the breakpoint opcode
   size_t bytes_read = 0;
@@ -382,10 +389,10 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
         "addr=0x%" PRIx64 ": tried to read %zu bytes but only read %zu", addr,
         curr_break_op.size(), bytes_read);
   }
-  const auto &saved = it->second.saved_opcodes;
+  const auto &saved = bkpt.saved_opcodes;
   // Make sure the breakpoint opcode exists at this address
-  if (llvm::ArrayRef(curr_break_op) != it->second.breakpoint_opcodes) {
-    if (curr_break_op != it->second.saved_opcodes)
+  if (llvm::ArrayRef(curr_break_op) != bkpt.breakpoint_opcodes) {
+    if (curr_break_op != bkpt.saved_opcodes)
       return Status::FromErrorString(
           "Original breakpoint trap is no longer in memory.");
     LLDB_LOG(log,
@@ -418,7 +425,6 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
                llvm::make_range(saved.begin(), saved.end()));
   }
 
-  m_software_breakpoints.erase(it);
   return Status();
 }
 

@labath
Copy link
Collaborator

labath commented Jul 15, 2025

You should be able to extend the unit test in unittests/Host/NativeProcessProtocolTest.cpp to cover this scenario.

@royitaqi
Copy link
Contributor Author

royitaqi commented Jul 15, 2025

You should be able to extend the unit test in unittests/Host/NativeProcessProtocolTest.cpp to cover this scenario.

Thank you, @labath! Added unit tests. Updating the PR description.

(Reflection: I should have checked the unit test file in the first place.)

@clayborg
Copy link
Collaborator

Looks good to me. @labath are you good with this now?

Copy link
Contributor

@dmpots dmpots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@royitaqi royitaqi merged commit a13712e into llvm:main Jul 16, 2025
9 checks passed
@royitaqi royitaqi deleted the fix-crash-in-remove-software-breakpoint branch August 5, 2025 16:19
JDevlieghere pushed a commit to swiftlang/llvm-project that referenced this pull request Oct 8, 2025
…lvm#148738)

# Lldb-server crash

We have seen stacks like the following in lldb-server core dumps:
```
[
  "__GI___pthread_kill at pthread_kill.c:46",
  "__GI_raise at raise.c:26",
  "__GI_abort at abort.c:100",
  "__assert_fail_base at assert.c:92",
  "__GI___assert_fail at assert.c:101",
  "lldb_private::NativeProcessProtocol::RemoveSoftwareBreakpoint(unsigned long) at /redacted/lldb-server:0"
]
```

# Hypothesis of root cause

In `NativeProcessProtocol::RemoveSoftwareBreakpoint()`
([code](https://github.com/llvm/llvm-project/blob/19b2dd9d798c124406b0124a1b8debb711675281/lldb/source/Host/common/NativeProcessProtocol.cpp#L359-L423)),
a `ref_count` is asserted and reduced. If it becomes zero, the code
first go through a series of memory reads and writes to remove the
breakpoint trap opcode and to restore the original process code, then,
if everything goes fine, removes the entry from the map
`m_software_breakpoints` at the end of the function.

However, if any of the validations for the above reads and writes goes
wrong, the code returns an error early, skipping the removal of the
entry. This leaves the entry behind with a `ref_count` of zero.

The next call to `NativeProcessProtocol::RemoveSoftwareBreakpoint()` for
the same breakpoint[*] would violate the assertion about `ref_count > 0`
([here](https://github.com/llvm/llvm-project/blob/19b2dd9d798c124406b0124a1b8debb711675281/lldb/source/Host/common/NativeProcessProtocol.cpp#L365)),
which would cause a crash.

[*] We haven't found a *regular* way to repro such a next call in lldb
or lldb-dap. This is because both of them remove the breakpoint from
their internal list when they get any response from the lldb-server (OK
or error). Asking the client to delete the breakpoint a second time
doesn't trigger the client to send the `$z` gdb packet to lldb-server.
We are able to trigger the crash by sending the `$z` packet directly,
see "Manual test" below.

# Fix

Lift the removal of the map entry to be immediately after the decrement
of `ref_count`, before the early returns. This ensures that the asserted
case will never happen. The validation errors can still happen, and
whether they happen or not, the breakpoint has been removed from the
perspective of the lldb-server (same as that of lldb and lldb-dap).

# Manual test & unit test

See PR.

(cherry picked from commit a13712e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants