Skip to content

[lldb] Fix Linux core file tests hanging on Windows #142143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 30, 2025

Conversation

DavidSpickett
Copy link
Collaborator

After #141670, TestLinuxCore.py was timing out on our Windows on Arm bot.

Non-Linux core files were ok, as were Linux core files unless it was ppc64le, riscv64 or loongarch.

I eventually noticed that it was attempting to create PlatformLinux many times before trying PlatformAndroid, PlatformMac etc., which it should never need to do.

The tests passed on a Linux host too, to add to the mystery.

Turns out, all I needed to do was mark those architectures as supported in the PlatformLinux constructor. If they're not listed there we get stuck here:

    // Wait for a stopped event since we just posted one above...
	printf("waiting for process to stop...\n");
    lldb::EventSP event_sp;
    StateType state =
        WaitForProcessToStop(std::nullopt, &event_sp, true, listener_sp,
                             nullptr, true, SelectMostRelevantFrame);
    printf("process stopped\n");

Waiting for a stop event that never comes, because it appears we try to treat the core as a real process?

 DynamicLoaderPOSIXDYLD::virtual DynamicLoaderPOSIXDYLD::DidAttach pid 28147 executable '<null executable>', load_offset 0xffffffffffffffff
<...>
 Process::ShouldBroadcastEvent (000002ABC43FF4A0) Restarting process from state: stopped
 Process::PrivateResume() m_stop_id = 1, public state: unloaded private state: stopped
 Process::PrivateResume() got an error "error: elf-core does not support resuming processes".
 Process::ShouldBroadcastEvent (000002ABC43FF4A0) => new state: stopped, last broadcast state: invalid - NO

Some actionable feedback here would be nice, but all I care about for now is that the tests run again.

I have not added riscv32 as that appears to only be supported for Darwin at the moment (I expect someone will get burned by this when it is).

I think debug on these architectures worked if they were also the host arch, if someone tried to remote debug them, I think it would have failed.

After llvm#141670,
TestLinuxCore.py was timing out on our Windows on Arm bot.

Non-Linux core files were ok, as were Linux core files unless
it was ppc64le, riscv64 or loongarch.

I eventually noticed that it was attempting to create
PlatformLinux many times before trying PlatformAndroid,
PlatformMac etc., which it should never need to do.

The tests passed on a Linux host too, to add to the mystery.

Turns out, all I needed to do was mark those architectures
as supported in the PlatformLinux constructor.

If they're not listed there we get stuck here:
```
    // Wait for a stopped event since we just posted one above...
	printf("waiting for process to stop...\n");
    lldb::EventSP event_sp;
    StateType state =
        WaitForProcessToStop(std::nullopt, &event_sp, true, listener_sp,
                             nullptr, true, SelectMostRelevantFrame);
    printf("process stopped\n");
```
Waiting for a stop event that never comes, because it appears we
try to treat the core as a real process?
```
 DynamicLoaderPOSIXDYLD::virtual DynamicLoaderPOSIXDYLD::DidAttach pid 28147 executable '<null executable>', load_offset 0xffffffffffffffff
<...>
 Process::ShouldBroadcastEvent (000002ABC43FF4A0) Restarting process from state: stopped
 Process::PrivateResume() m_stop_id = 1, public state: unloaded private state: stopped
 Process::PrivateResume() got an error "error: elf-core does not support resuming processes".
 Process::ShouldBroadcastEvent (000002ABC43FF4A0) => new state: stopped, last broadcast state: invalid - NO
```

Some sort of actionable feedback here would be nice, but all I care
about for now is that the tests run again.

I have not added riscv32 as that appears to only be supported
for Darwin at the moment (though I expect someone will get
burned by this when it is).

I think debug on these architectures worked if they were also
the host arch, if someone tried to remote debug them, I think
it would fail.
@llvmbot llvmbot added the lldb label May 30, 2025
@DavidSpickett DavidSpickett merged commit 3745e05 into llvm:main May 30, 2025
8 of 11 checks passed
@llvmbot
Copy link
Member

llvmbot commented May 30, 2025

@llvm/pr-subscribers-lldb

Author: David Spickett (DavidSpickett)

Changes

After #141670, TestLinuxCore.py was timing out on our Windows on Arm bot.

Non-Linux core files were ok, as were Linux core files unless it was ppc64le, riscv64 or loongarch.

I eventually noticed that it was attempting to create PlatformLinux many times before trying PlatformAndroid, PlatformMac etc., which it should never need to do.

The tests passed on a Linux host too, to add to the mystery.

Turns out, all I needed to do was mark those architectures as supported in the PlatformLinux constructor. If they're not listed there we get stuck here:

    // Wait for a stopped event since we just posted one above...
	printf("waiting for process to stop...\n");
    lldb::EventSP event_sp;
    StateType state =
        WaitForProcessToStop(std::nullopt, &amp;event_sp, true, listener_sp,
                             nullptr, true, SelectMostRelevantFrame);
    printf("process stopped\n");

Waiting for a stop event that never comes, because it appears we try to treat the core as a real process?

 DynamicLoaderPOSIXDYLD::virtual DynamicLoaderPOSIXDYLD::DidAttach pid 28147 executable '&lt;null executable&gt;', load_offset 0xffffffffffffffff
&lt;...&gt;
 Process::ShouldBroadcastEvent (000002ABC43FF4A0) Restarting process from state: stopped
 Process::PrivateResume() m_stop_id = 1, public state: unloaded private state: stopped
 Process::PrivateResume() got an error "error: elf-core does not support resuming processes".
 Process::ShouldBroadcastEvent (000002ABC43FF4A0) =&gt; new state: stopped, last broadcast state: invalid - NO

Some actionable feedback here would be nice, but all I care about for now is that the tests run again.

I have not added riscv32 as that appears to only be supported for Darwin at the moment (I expect someone will get burned by this when it is).

I think debug on these architectures worked if they were also the host arch, if someone tried to remote debug them, I think it would have failed.


Full diff: https://github.com/llvm/llvm-project/pull/142143.diff

1 Files Affected:

  • (modified) lldb/source/Plugins/Platform/Linux/PlatformLinux.cpp (+3-1)
diff --git a/lldb/source/Plugins/Platform/Linux/PlatformLinux.cpp b/lldb/source/Plugins/Platform/Linux/PlatformLinux.cpp
index dd6490c7141e5..269105208a87a 100644
--- a/lldb/source/Plugins/Platform/Linux/PlatformLinux.cpp
+++ b/lldb/source/Plugins/Platform/Linux/PlatformLinux.cpp
@@ -138,7 +138,9 @@ PlatformLinux::PlatformLinux(bool is_host)
         {llvm::Triple::x86_64, llvm::Triple::x86, llvm::Triple::arm,
          llvm::Triple::aarch64, llvm::Triple::mips64, llvm::Triple::mips64,
          llvm::Triple::hexagon, llvm::Triple::mips, llvm::Triple::mips64el,
-         llvm::Triple::mipsel, llvm::Triple::msp430, llvm::Triple::systemz},
+         llvm::Triple::mipsel, llvm::Triple::msp430, llvm::Triple::systemz,
+         llvm::Triple::loongarch64, llvm::Triple::ppc64le,
+         llvm::Triple::riscv64},
         llvm::Triple::Linux);
   }
 }

@DavidSpickett DavidSpickett deleted the lldb-wincore branch May 30, 2025 13:14
@Jlalond
Copy link
Contributor

Jlalond commented May 30, 2025

Hey @DavidSpickett thanks for fixing the test.

I think you fixed it the right way, but I wanted to share my understanding of how this probably happened

In ProcessElfCore, we check the siginfo status of all the threads

  if (!siginfo_signal_found) {
    // If we don't have signal from SIGINFO use the signal from each threads
    // PRSTATUS note.
    if (prstatus_signal_found) {
      for (auto &thread_data : m_thread_data)
        thread_data.signo = thread_data.prstatus_sig;
    } else if (m_thread_data.size() > 0) {
      // If all else fails force the first thread to be SIGSTOP
      m_thread_data.begin()->signo =
          GetUnixSignals()->GetSignalNumberFromName("SIGSTOP");
    }
  }

So in our case, I've changed this behavior, where siginfo_signal_found will always be set to true if we've extracted any bytes from the PT_NOTE even if we can't use them later due to a lack of a platform class. Because we passed the siginfo bytes check we never populate the signo from the PRSTATUS

Then in ThreadElfCore I think we'd fail to calculate stop info because we don't have a valid SIGNO.

I think a few fixes here:

Currently, we set the stop reason even for signo = 0 so you get a confusing stopped with signal 0 in LLDB. We should always stop for ThreadElfCore, but only make stop info dependent on the actual siginfo.

This same infinite loop also can happen for Minidump, and I haven't had time to fix it. I think in general we can wrap all of this into a nice postmortem thread and have the postmortem thread know it should always stop, but optionally get the signal description.

CC: @labath

@labath
Copy link
Collaborator

labath commented Jun 2, 2025

This PR is correct. It would be nice to avoid this resume-core-file footgun, but I don't have an opinion on the proposed fix.

sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Jun 3, 2025
After llvm#141670, TestLinuxCore.py
was timing out on our Windows on Arm bot.

Non-Linux core files were ok, as were Linux core files unless it was
ppc64le, riscv64 or loongarch.

I eventually noticed that it was attempting to create PlatformLinux many
times before trying PlatformAndroid, PlatformMac etc., which it should
never need to do.

The tests passed on a Linux host too, to add to the mystery.

Turns out, all I needed to do was mark those architectures as supported
in the PlatformLinux constructor. If they're not listed there we get
stuck here:
```
    // Wait for a stopped event since we just posted one above...
	printf("waiting for process to stop...\n");
    lldb::EventSP event_sp;
    StateType state =
        WaitForProcessToStop(std::nullopt, &event_sp, true, listener_sp,
                             nullptr, true, SelectMostRelevantFrame);
    printf("process stopped\n");
```
Waiting for a stop event that never comes, because it appears we try to
treat the core as a real process?
```
 DynamicLoaderPOSIXDYLD::virtual DynamicLoaderPOSIXDYLD::DidAttach pid 28147 executable '<null executable>', load_offset 0xffffffffffffffff
<...>
 Process::ShouldBroadcastEvent (000002ABC43FF4A0) Restarting process from state: stopped
 Process::PrivateResume() m_stop_id = 1, public state: unloaded private state: stopped
 Process::PrivateResume() got an error "error: elf-core does not support resuming processes".
 Process::ShouldBroadcastEvent (000002ABC43FF4A0) => new state: stopped, last broadcast state: invalid - NO
```
Some actionable feedback here would be nice, but all I care about for
now is that the tests run again.

I have not added riscv32 as that appears to only be supported for Darwin
at the moment (I expect someone will get burned by this when it is).

I think debug on these architectures worked if they were also the host
arch, if someone tried to remote debug them, I think it would have
failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants