Skip to content

[asan][test] Disable asan_lsan_deadlock test #141145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 23, 2025

Conversation

ilovepi
Copy link
Contributor

@ilovepi ilovepi commented May 22, 2025

While the current test exercised the deadlock behavior prior to #131756,
deadlock still can occur intermittently. Since this results in test
flakes in CI, we disable this test until the locking behavior can be
fixed in the runtime. See #140646 for details.

Copy link
Contributor Author

ilovepi commented May 22, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@llvmbot
Copy link
Member

llvmbot commented May 22, 2025

@llvm/pr-subscribers-compiler-rt-sanitizer

Author: Paul Kirth (ilovepi)

Changes

While the current test checks that a deadlock doesn't occur, it also
assumes that lsan reporting happens at the end of the program as part of
the summary. However, it sometimes happens that due to ordering the
stack-buffer-overflow is reported differently. In these cases the error
message will be prefixed with 'ERROR:' instead of 'SUMMARY:'. Since the
verbiage of the reporting isn't the point of the test, we can relax the
check slightly to avoid test flake due to non-determinism in the
execution ordering of the program. Fixes #140646.


Full diff: https://github.com/llvm/llvm-project/pull/141145.diff

1 Files Affected:

  • (modified) compiler-rt/test/asan/TestCases/asan_lsan_deadlock.cpp (+1-1)
diff --git a/compiler-rt/test/asan/TestCases/asan_lsan_deadlock.cpp b/compiler-rt/test/asan/TestCases/asan_lsan_deadlock.cpp
index b856b12325bf6..d9a6f4c07f698 100644
--- a/compiler-rt/test/asan/TestCases/asan_lsan_deadlock.cpp
+++ b/compiler-rt/test/asan/TestCases/asan_lsan_deadlock.cpp
@@ -55,7 +55,7 @@ int main(int argc, char **argv) {
      * 1. ASan's thread registry lock (B) during the reporting 
      * 2. dl_iterate_phdr lock (A) during symbolization
      */
-    // CHECK: SUMMARY: AddressSanitizer: stack-buffer-overflow
+    // CHECK: AddressSanitizer: stack-buffer-overflow
     arr[argc] = 1; // Deliberate OOB access
   });
 

Copy link
Contributor Author

ilovepi commented May 22, 2025

Before landing I do want to get confirmation that the SUMMARY bit wasn't intended to be proof that the deadlock can't happen.

@fmayer
Copy link
Contributor

fmayer commented May 22, 2025

Before landing I do want to get confirmation that the SUMMARY bit wasn't intended to be proof that the deadlock can't happen.

I am pretty sure when it deadlocks it will just time out.

@ilovepi
Copy link
Contributor Author

ilovepi commented May 22, 2025

Before landing I do want to get confirmation that the SUMMARY bit wasn't intended to be proof that the deadlock can't happen.

I am pretty sure when it deadlocks it will just time out.

That's also my take, but wanted to confirm. :)

@fmayer
Copy link
Contributor

fmayer commented May 22, 2025

CC @thurstond (as build watcher for sanitizers)

@Camsyn
Copy link
Contributor

Camsyn commented May 22, 2025

Before landing I do want to get confirmation that the SUMMARY bit wasn't intended to be proof that the deadlock can't happen.

@ilovepi Thanks for your patch. I added this test case to verify a deadlock patch in PR #131756. Unfortunately, the "Summary:" prefix is indeed used to prove that the deadlock has been fixed.
When a deadlock occurs, the error report will be stuck in stack trace symbolizing, after "Error:..." is printed, at which point we will see the following error report:

=================================================================
==2437073==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f94afd00034 at pc 0x561adf3d716c bp 0x7f94aecfdd50 sp 0x7f94aecfdd48
WRITE of size 4 at 0x7f94afd00034 thread T2

If the deadlock is fixed, the "Summary:" will be printed after the stack trace, as follows:

=================================================================
==526272==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fe21c600034 at pc 0x55ad40f8d16c bp 0x7fe21b5fdd50 sp 0x7fe21b5fdd48
WRITE of size 4 at 0x7fe21c600034 thread T2
<STACK TRACE>
...
SUMMARY: AddressSanitizer: stack-buffer-overflow /home/cseadmin/test.cc:59:15 in main::$_0::operator()() const

Therefore, your current fix cannot tell whether the deadlock does occur.

I'm also confused as to why such a simple test case would cause flakiness. However, I managed to fail to reproduce it locally :(, and due to a lack of information, I don't know what the cause of this flakiness is (does the limited resources of the build bot cause ASan bug reporting to time out? or is there another deadlock path?).

@ilovepi
Copy link
Contributor Author

ilovepi commented May 22, 2025

@Camsyn I'm not sure I follow. The failures in our bots don't appear to be due to a timeout. I see some of those on the public build bot, but not on our CI https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8714404311534072321/overview (logs https://logs.chromium.org/logs/fuchsia/buildbucket/cr-buildbucket/8714404311534072321/+/u/clang/test/stdout).

I also wonder about a few things after looking at this test.

  1. Is your reasoning about SUMMARY reporting being a reliable signal correct?
    • I'm not sure it is, given that I have bots that don't seem to timeout in this test (at least I don't see that in the log).
  2. How do you ensure lock acquisitions overlap?
    • I don't think there is enough synchronization for you to guarantee any ordering. Depending on scheduling I could believe the thread you launch could finish before the main thread tries to enter its critical section. What happens then?
    • You probably need some kind of coordination between the threads, like maybe std::latch w/ jthread?

The thread that is doing the OOB access probably needs to be parked via std::latch until you're ready to start the critical sections. Otherwise I don't see what prevents the OOB thread from immediately completing in which case, you won't exercise the deadlock scenario. In fact, I wonder if the test just usually doesn't test the deadlock scenario because the launched thread completes quickly and then the summary is reported per expectation.

In any case, I don't think this test is giving a useful signal, so I guess we should disable it until it can be fixed.

@Camsyn
Copy link
Contributor

Camsyn commented May 22, 2025

@ilovepi

Is your reasoning about SUMMARY reporting being a reliable signal correct?

If all went well, then "SUMMARY" should have been printed after "[STACK TRACE]", which is the expected behavior of ASan's reporting -- if no deadlock occurs, ASan should give a complete bug report in this test case.

However, your CI report does not even have "[STACK TRACE]" in. I think this just means that the deadlock was triggered by the test case and was killed by the WatchDog thread when reaching its timeout threshold--60s.

How do you ensure lock acquisitions overlap?

As mentioned in PR #131756, we cannot finely control lock acquisition order under ASan (__asan::ScopedInErrorReport) and LSan (__lsan::__lsan_do_leak_check).
Therefore, as a test case outside of LSan and Asan, I can only make the critical sections of ASan(__asan::ScopedInErrorReport) and LSan (__lsan::__lsan_do_leak_check) are reached at roughly the same time, expecting to trigger the deadlock contained in __lsan::__lsan_do_leak_check (A - > B) and __asan::ScopedInErrorReport (B- > A) with a high probability.

Yes, even without the relevant fix, the test case does not reliably reproduce the deadlock (but when running on an idle system, it is NEARLY stable to trigger the bug). Anyway, for your instance, if the new spawn thread ends early before we enter __lsan_do_leak_check, it will only pass the test harmlessly as it should output "Summary:".


Overall, I think this test case clearly exposes the true deadlock flaw of ASan/LSan, and am not sure if it should be disabled.

@ilovepi
Copy link
Contributor Author

ilovepi commented May 23, 2025

if the deadlock still happens then this test only passes by accident. it either needs to deterministically fail or pass to be enabled. if it the deadlock always triggered, then we could mark it xfail until it was fixed. as is it gives no signal, and in fact seems to give a false sense that the issue is fixed.

@ilovepi ilovepi force-pushed the users/ilovepi/fix-deadlock-test branch from 09b9d1e to 561c4ba Compare May 23, 2025 01:52
@ilovepi ilovepi changed the title [asan][test] Fix flake in asan_lsan_deadlock test [asan][test] Disable asan_lsan_deadlock test May 23, 2025
@ilovepi ilovepi requested a review from fmayer May 23, 2025 05:10
While the current test exercised the deadlock behavior prior to #131756,
deadlock still can occur intermittently. Since this results in test
flakes in CI, we disable this test until the locking behavior can be
fixed in the runtime. See #140646 for details.
@ilovepi ilovepi force-pushed the users/ilovepi/fix-deadlock-test branch from 561c4ba to e88d9f3 Compare May 23, 2025 19:27
Copy link
Contributor Author

ilovepi commented May 23, 2025

I'm going to go ahead and merge this, since the failure just hit our CI again https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8714055276410139825/overview.

@Camsyn this test will still be in tree, so when we get the deadlock issue fixed (or we figure out a deterministic way to trigger the deadlock on every run) this can be re-enabled.

Copy link
Contributor Author

ilovepi commented May 23, 2025

Merge activity

  • May 23, 7:53 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • May 23, 7:55 PM UTC: @ilovepi merged this pull request with Graphite.

@ilovepi ilovepi merged commit 1590ef3 into main May 23, 2025
7 of 10 checks passed
@ilovepi ilovepi deleted the users/ilovepi/fix-deadlock-test branch May 23, 2025 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants