[lldb] Fix a crash in lldb-server during RemoveSoftwareBreakpoint() #148738

royitaqi · 2025-07-14T22:18:33Z

Lldb-server crash

We have seen stacks like the following in lldb-server core dumps:

[
  "__GI___pthread_kill at pthread_kill.c:46",
  "__GI_raise at raise.c:26",
  "__GI_abort at abort.c:100",
  "__assert_fail_base at assert.c:92",
  "__GI___assert_fail at assert.c:101",
  "lldb_private::NativeProcessProtocol::RemoveSoftwareBreakpoint(unsigned long) at /redacted/lldb-server:0"
]

Hypothesis of root cause

In NativeProcessProtocol::RemoveSoftwareBreakpoint() (code), a ref_count is asserted and reduced. If it becomes zero, the code first go through a series of memory reads and writes to remove the breakpoint trap opcode and to restore the original process code, then, if everything goes fine, removes the entry from the map m_software_breakpoints at the end of the function.

However, if any of the validations for the above reads and writes goes wrong, the code returns an error early, skipping the removal of the entry. This leaves the entry behind with a ref_count of zero.

The next call to NativeProcessProtocol::RemoveSoftwareBreakpoint() for the same breakpoint[*] would violate the assertion about ref_count > 0 (here), which would cause a crash.

[*] We haven't found a regular way to repro such a next call in lldb or lldb-dap. This is because both of them remove the breakpoint from their internal list when they get any response from the lldb-server (OK or error). Asking the client to delete the breakpoint a second time doesn't trigger the client to send the $z gdb packet to lldb-server. We are able to trigger the crash by sending the $z packet directly, see "Manual test" below.

Fix

Lift the removal of the map entry to be immediately after the decrement of ref_count, before the early returns. This ensures that the asserted case will never happen. The validation errors can still happen, and whether they happen or not, the breakpoint has been removed from the perspective of the lldb-server (same as that of lldb and lldb-dap).

Manual test

Manually verified by adding the following code snippet into this patch, before and after the removal of the map entry in RemoveSoftwareBreakpoint(), simulating an early return of errors.

// Replace the following 0x value with the breakpoint's address.
bool should_error = addr == 0x55555555513b;
if (should_error)
  return Status::FromErrorString("DEBUG");

Before (simulating the current behavior, i.e. without patch): The 2nd attempt to remove the same breakpoint results in SIGABRT.

(lldb) br del 1
lldb             <  21> send packet: $z0,55555555513b,1#d6
lldb             <   7> read packet: $E09#ae
1 breakpoints deleted; 0 breakpoint locations disabled.

(lldb) process plugin packet send z0,55555555513b,1#d6
lldb             <  24> send packet: $z0,55555555513b,1#d6#93
lldb             <  11> send packet: $qEcho:1#5b
  packet: z0,55555555513b,1#d6
response:
error: UNIMPLEMENTED
Process 1363559 exited with status = -1 (0xffffffff) lldb-server died with signal SIGABRT

After (simulating the future behavior, i.e. with patch): The 2nd attempt to remove the same breakpoint results in E09 as expected, not crash.

(lldb) br del 1
lldb             <  21> send packet: $z0,55555555513b,1#d6
lldb             <   7> read packet: $E09#ae
1 breakpoints deleted; 0 breakpoint locations disabled.

(lldb) process plugin packet send z0,55555555513b,1#d6
lldb             <  24> send packet: $z0,55555555513b,1#d6#93
lldb             <   7> read packet: $E09#ae
  packet: z0,55555555513b,1#d6
response: E09

Unit test

Added two tests.

The happy path: A breakpoint is added once and removed twice. The second removal should return an error, because the breakpoint should have been removed from m_software_breakpoints.
A corner case: A breakpoint is added once and removed twice, but the ReadMemory() in the first removal returns unexpected opcode. The second removal should return an error (and not crash with an assertion failure), because the breakpoint should have been removed from m_software_breakpoints.

The second test case crashes with assertion error without patch and succeeds with patch.

llvmbot · 2025-07-14T22:19:06Z

@llvm/pr-subscribers-lldb

Author: None (royitaqi)

Changes

Lldb-server crash

We have seen stacks like the following in lldb-server core dumps:

[
  "__GI___pthread_kill at pthread_kill.c:46",
  "__GI_raise at raise.c:26",
  "__GI_abort at abort.c:100",
  "__assert_fail_base at assert.c:92",
  "__GI___assert_fail at assert.c:101",
  "lldb_private::NativeProcessProtocol::RemoveSoftwareBreakpoint(unsigned long) at /redacted/lldb-server:0"
]

Hypothesis of root cause

In NativeProcessProtocol::RemoveSoftwareBreakpoint() (code), a ref_count is asserted and reduced. If it becomes zero, the code first go through a series of memory reads and writes to remove the breakpoint trap opcode and to restore the original process code, then, if everything goes fine, removes the entry from the map m_software_breakpoints at the end of the function.

However, if any of the validations for the above reads and writes goes wrong, the code returns an error early, skipping the removal of the entry. This leaves the entry behind with a ref_count of zero.

The next call to NativeProcessProtocol::RemoveSoftwareBreakpoint() for the same breakpoint (*) would violate the assertion about ref_count > 0 (here), which would cause a crash.

* We haven't found a way to repro such a next call. This is because both lldb and lldb-dap removes the breakpoint from their internal list when they get any response from the lldb-server (OK or error). Asking the client to delete the breakpoint a second time doesn't trigger the client to send the "$z" gdb packet (remove breakpoint) to lldb-server.

Fix

Lift the removal of the map entry to be immediately after the decrement of ref_count, before the early returns. This ensures that the asserted case will never happen. The validation errors can still happen, and whether they happen or not, the breakpoint has been removed from the perspective of the lldb-server (same as that of lldb and lldb-dap).

Test

We haven't found a way to test this change. Manual testing via lldb and lldb-dap doesn't work because the above reason (see the last paragraph).

For unit testing, we couldn't find a way to trigger the early returns via writing random bytes into the process' memory to override the breakpoint opcodes.

Full diff: https://github.com/llvm/llvm-project/pull/148738.diff

1 Files Affected:

(modified) lldb/source/Host/common/NativeProcessProtocol.cpp (+12-6)

diff --git a/lldb/source/Host/common/NativeProcessProtocol.cpp b/lldb/source/Host/common/NativeProcessProtocol.cpp
index 405acbb5662d6..196f54b93538d 100644
--- a/lldb/source/Host/common/NativeProcessProtocol.cpp
+++ b/lldb/source/Host/common/NativeProcessProtocol.cpp
@@ -366,12 +366,19 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
   if (--it->second.ref_count > 0)
     return Status();
 
+  // Remove the entry from m_software_breakpoints rightaway, so that we don't
+  // leave behind an entry with ref_count == 0 in case one of the following
+  // conditions returns an error. The breakpoint is moved so that it can be
+  // accessed below.
+  SoftwareBreakpoint bkpt = std::move(it->second);
+  m_software_breakpoints.erase(it);
+
   // This is the last reference. Let's remove the breakpoint.
   Status error;
 
   // Clear a software breakpoint instruction
-  llvm::SmallVector<uint8_t, 4> curr_break_op(
-      it->second.breakpoint_opcodes.size(), 0);
+  llvm::SmallVector<uint8_t, 4> curr_break_op(bkpt.breakpoint_opcodes.size(),
+                                              0);
 
   // Read the breakpoint opcode
   size_t bytes_read = 0;
@@ -382,10 +389,10 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
         "addr=0x%" PRIx64 ": tried to read %zu bytes but only read %zu", addr,
         curr_break_op.size(), bytes_read);
   }
-  const auto &saved = it->second.saved_opcodes;
+  const auto &saved = bkpt.saved_opcodes;
   // Make sure the breakpoint opcode exists at this address
-  if (llvm::ArrayRef(curr_break_op) != it->second.breakpoint_opcodes) {
-    if (curr_break_op != it->second.saved_opcodes)
+  if (llvm::ArrayRef(curr_break_op) != bkpt.breakpoint_opcodes) {
+    if (curr_break_op != bkpt.saved_opcodes)
       return Status::FromErrorString(
           "Original breakpoint trap is no longer in memory.");
     LLDB_LOG(log,
@@ -418,7 +425,6 @@ Status NativeProcessProtocol::RemoveSoftwareBreakpoint(lldb::addr_t addr) {
                llvm::make_range(saved.begin(), saved.end()));
   }
 
-  m_software_breakpoints.erase(it);
   return Status();
 }

labath · 2025-07-15T06:30:50Z

You should be able to extend the unit test in unittests/Host/NativeProcessProtocolTest.cpp to cover this scenario.

royitaqi · 2025-07-15T22:00:36Z

You should be able to extend the unit test in unittests/Host/NativeProcessProtocolTest.cpp to cover this scenario.

Thank you, @labath! Added unit tests. Updating the PR description.

(Reflection: I should have checked the unit test file in the first place.)

clayborg · 2025-07-15T22:11:14Z

Looks good to me. @labath are you good with this now?

lldb/unittests/Host/NativeProcessProtocolTest.cpp

dmpots

LGTM

labath

Thanks.

lldb/unittests/Host/NativeProcessProtocolTest.cpp

…lvm#148738) # Lldb-server crash We have seen stacks like the following in lldb-server core dumps: ``` [ "__GI___pthread_kill at pthread_kill.c:46", "__GI_raise at raise.c:26", "__GI_abort at abort.c:100", "__assert_fail_base at assert.c:92", "__GI___assert_fail at assert.c:101", "lldb_private::NativeProcessProtocol::RemoveSoftwareBreakpoint(unsigned long) at /redacted/lldb-server:0" ] ``` # Hypothesis of root cause In `NativeProcessProtocol::RemoveSoftwareBreakpoint()` ([code](https://github.com/llvm/llvm-project/blob/19b2dd9d798c124406b0124a1b8debb711675281/lldb/source/Host/common/NativeProcessProtocol.cpp#L359-L423)), a `ref_count` is asserted and reduced. If it becomes zero, the code first go through a series of memory reads and writes to remove the breakpoint trap opcode and to restore the original process code, then, if everything goes fine, removes the entry from the map `m_software_breakpoints` at the end of the function. However, if any of the validations for the above reads and writes goes wrong, the code returns an error early, skipping the removal of the entry. This leaves the entry behind with a `ref_count` of zero. The next call to `NativeProcessProtocol::RemoveSoftwareBreakpoint()` for the same breakpoint[*] would violate the assertion about `ref_count > 0` ([here](https://github.com/llvm/llvm-project/blob/19b2dd9d798c124406b0124a1b8debb711675281/lldb/source/Host/common/NativeProcessProtocol.cpp#L365)), which would cause a crash. [*] We haven't found a *regular* way to repro such a next call in lldb or lldb-dap. This is because both of them remove the breakpoint from their internal list when they get any response from the lldb-server (OK or error). Asking the client to delete the breakpoint a second time doesn't trigger the client to send the `$z` gdb packet to lldb-server. We are able to trigger the crash by sending the `$z` packet directly, see "Manual test" below. # Fix Lift the removal of the map entry to be immediately after the decrement of `ref_count`, before the early returns. This ensures that the asserted case will never happen. The validation errors can still happen, and whether they happen or not, the breakpoint has been removed from the perspective of the lldb-server (same as that of lldb and lldb-dap). # Manual test & unit test See PR. (cherry picked from commit a13712e)

[lldb] Fix a crash in lldb-server during RemoveSoftwareBreakpoint()

890eb98

royitaqi requested a review from JDevlieghere as a code owner July 14, 2025 22:18

llvmbot added the lldb label Jul 14, 2025

royitaqi requested review from clayborg, dmpots and serge-sans-paille and removed request for JDevlieghere July 14, 2025 22:18

clayborg approved these changes Jul 14, 2025

View reviewed changes

Add unit test

d15f269

royitaqi commented Jul 15, 2025

View reviewed changes

lldb/unittests/Host/NativeProcessProtocolTest.cpp Show resolved Hide resolved

dmpots approved these changes Jul 15, 2025

View reviewed changes

labath approved these changes Jul 16, 2025

View reviewed changes

lldb/unittests/Host/NativeProcessProtocolTest.cpp Show resolved Hide resolved

royitaqi merged commit a13712e into llvm:main Jul 16, 2025
9 checks passed

This was referenced Jul 23, 2025

test abhinavgaba/llvm-project#2

Closed

Add dataFence plugin interface abhinavgaba/llvm-project#3

Closed

royitaqi deleted the fix-crash-in-remove-software-breakpoint branch August 5, 2025 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lldb] Fix a crash in lldb-server during RemoveSoftwareBreakpoint() #148738

[lldb] Fix a crash in lldb-server during RemoveSoftwareBreakpoint() #148738

Uh oh!

royitaqi commented Jul 14, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jul 14, 2025

Lldb-server crash

Hypothesis of root cause

Fix

Test

Uh oh!

labath commented Jul 15, 2025

Uh oh!

royitaqi commented Jul 15, 2025 •

edited

Loading

Uh oh!

clayborg commented Jul 15, 2025

Uh oh!

Uh oh!

dmpots left a comment

Uh oh!

labath left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[lldb] Fix a crash in lldb-server during RemoveSoftwareBreakpoint() #148738

[lldb] Fix a crash in lldb-server during RemoveSoftwareBreakpoint() #148738

Uh oh!

Conversation

royitaqi commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lldb-server crash

Hypothesis of root cause

Fix

Manual test

Unit test

Uh oh!

llvmbot commented Jul 14, 2025

Lldb-server crash

Hypothesis of root cause

Fix

Test

Uh oh!

labath commented Jul 15, 2025

Uh oh!

royitaqi commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clayborg commented Jul 15, 2025

Uh oh!

Uh oh!

dmpots left a comment

Choose a reason for hiding this comment

Uh oh!

labath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

royitaqi commented Jul 14, 2025 •

edited

Loading

royitaqi commented Jul 15, 2025 •

edited

Loading