Skip to content

Fix infinite wait bug in test_pthread_proxying_cpp.cpp #18358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 13, 2022

Conversation

tlively
Copy link
Member

@tlively tlively commented Dec 12, 2022

The tests in test_pthread_proxying_cpp proxy functions that increment local
variables. For the async proxying functions, the proxying is followed by a wait
on a condition variable so the test can assert that the proxied work was
completed. However, the incrementing of the local variables was not protected by
the lock used with the condition variable, so it was possible for the variable
to be incremented and the condition variable notified after checking the wait
condition but before waiting. Since the condition variable would not be notified
again, the test would wait on the condition variable forever.

Fix the bug by taking the lock before incrementing the variable in tests where
this could cause problems.

Fixes #18353.

The tests in test_pthread_proxying_cpp proxy functions that increment local
variables. For the async proxying functions, the proxying is followed by a wait
on a condition variable so the test can assert that the proxied work was
completed. However, the incrementing of the local variables was not protected by
the lock used with the condition variable, so it was possible for the variable
to be incremented and the condition variable notified after checking the wait
condition but before waiting. Since the condition variable would not be notified
again, the test would wait on the condition variable forever.

Fix the bug by taking the lock before incrementing the variable in tests where
this could cause problems.

Fixes #18353.
@tlively tlively requested a review from kripken December 12, 2022 23:21
@tlively
Copy link
Member Author

tlively commented Dec 12, 2022

@tlively tlively requested a review from sbc100 December 13, 2022 00:12
tlively added a commit that referenced this pull request Dec 13, 2022
It had previously been disabled because of flakes on wasm2js, but those flakes
are probably fixed by #18358.
{
std::unique_lock<std::mutex> lock(mutex);
i = 3;
}
executor = std::this_thread::get_id();
cond.notify_one();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to hold the lock when calling notify too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's actually a pessimization to hold the lock while calling notify because then the other thread wakes up just to immediately block on acquiring the lock.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

{
std::unique_lock<std::mutex> lock(mutex);
i = 3;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite reading you description I've having trouble seeing how putting a lock around this one assignment fixes the issue. Would making i and atomic also fix the issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading https://stackoverflow.com/questions/17101922/do-i-have-to-acquire-lock-before-calling-condition-variable-notify-one I think i understand.. I had never known this about cond vars..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the lock is important because of the condition variable. The condition variable needs to atomically check the condition and block if the condition is not met. The mechanism for combining those into an atomic action is the lock, so if the condition can change without the lock being held, the condition variable doesn't work as intended.

@kripken
Copy link
Member

kripken commented Dec 13, 2022

I thought the idea was that the outside holds the mutex so that the thread blocks until the right time, but reading it again, I think I misunderstood it before as the outside grabs the mutex after firing off the async event, so they actually race. So I'm also not sure about this change, and not sure why it was failing before...

@tlively
Copy link
Member Author

tlively commented Dec 13, 2022

Right, since the proxying is async, the proxied work races with the following code on the proxying thread. To resolve the race, we use explicit condition variables to wait for the proxied work to be completed. The bug was that we weren't using the condition variables properly. Condition variables only work if the condition is protected by the same lock passed to wait, and in this case it was not.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 13, 2022

You can go ahead and land without waiting for the restarted sockets tests if you like.. that failure was clearly unrelated.

@tlively tlively merged commit c17542f into main Dec 13, 2022
@tlively tlively deleted the fix-proxying-test-deadlock branch December 13, 2022 00:57
tlively added a commit that referenced this pull request Dec 19, 2022
It had previously been disabled because of flakes on wasm2js, but those flakes
are probably fixed by #18358.
tlively added a commit that referenced this pull request Dec 19, 2022
It had previously been disabled because of flakes on wasm2js, but those flakes
are probably fixed by #18358.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: test_pthread_proxying_cpp
3 participants