Avoid acquiring another read lock while holding one to avoid potential deadlock #6200

jimmygchen · 2024-07-29T13:20:51Z

Issue Addressed

While testing PeerDAS I ran into a scenario that looked like a deadlock. The Lighthouse process was still live but stopped responding and logging anything.

The backtrace shows that one of the thread is waiting for a read lock

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000556ea43cf66d in parking_lot::raw_rwlock::RawRwLock::lock_shared_slow ()
#2  0x0000556ea5cbb969 in eth1::service::Service::earliest_block_timestamp ()
#3  0x0000556ea54fc61e in beacon_chain::eth1_chain::Eth1Chain<T,E>::eth1_data_for_block_production ()

while another one is waiting for a write lock, potentially here

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000556ea6ffc92d in parking_lot::raw_rwlock::RawRwLock::wait_for_readers ()
#2  0x0000556ea43cedf0 in parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow ()
#3  0x0000556ea5cbf35f in <futures_util::future::maybe_done::MaybeDone<Fut> as core::future::future::Future>::poll ()
#4  0x0000556ea5cb643b in <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll ()
#5  0x0000556ea5ccb06d in eth1::service::Service::do_update::{{closure}} ()

I'm not sure if I've got to the bottom of the problem yet, but I'm suspecting that the code here is causing the deadlock, as it tries to acquire a read lock while we already hold one in scope. If do_update attempts to acquire a write lock when the deposit cache read lock is in place, then it will wait for the read lock to release. However the code below attempts to acquire the read lock again while the old read lock is still in scope, causing the write lock to be stuck and a dead lock in this function:

lighthouse/beacon_node/eth1/src/service.rs

Line 1132 in 19b3ab3

head_block_number: self.inner.block_cache.read().highest_block_number(),

This part of the code hasn't changed in ages though, so I could be wrong or running into an edge case.

michaelsproul · 2024-07-29T14:09:07Z

I'd need to check the drop rules, but the compiler may insert a drop for block_cache as soon as it is no longer used. Although I'm not sure of this, and it is definitely better to be explicit

Nice find!!

michaelsproul · 2024-07-30T05:57:50Z

@mergify queue

mergify · 2024-07-30T05:57:53Z

queue

🛑 The pull request has been removed from the queue `default`

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

jimmygchen · 2024-07-30T07:21:29Z

@mergify requeue

dapplion

Nice find! But why did this showed up in das branch and not on stable?

jimmygchen · 2024-07-30T08:10:01Z

@mergify requeue

mergify · 2024-07-30T08:10:27Z

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

mergify · 2024-07-30T08:10:28Z

queue

🛑 The pull request has been removed from the queue `default`

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

jimmygchen · 2024-07-30T08:12:17Z

Nice find! But why did this showed up in das branch and not on stable?

My guess is this is a super rare edge case - its really hard to hit this because there's no long running operation between the two read locks - only simple time calculation and logging - and this could only happen if another thread attempts to acquire a write lock before the 2nd read lock is being acquired.

jimmygchen · 2024-07-30T08:12:54Z

@mergify refresh

mergify · 2024-07-30T08:12:59Z

refresh

✅ Pull request refreshed

jimmygchen · 2024-07-30T08:13:11Z

@mergify requeue

mergify · 2024-07-30T08:13:20Z

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

mergify · 2024-07-30T08:13:21Z

queue

🛑 The pull request has been removed from the queue `default`

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

michaelsproul · 2024-07-30T13:24:10Z

@mergify requeue

mergify · 2024-07-30T13:24:15Z

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

mergify · 2024-07-30T13:24:16Z

queue

🛑 The pull request has been removed from the queue `default`

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

michaelsproul · 2024-07-30T13:45:36Z

@mergify refresh

mergify · 2024-07-30T13:45:45Z

refresh

✅ Pull request refreshed

michaelsproul · 2024-07-30T13:46:22Z

@mergify queue

mergify · 2024-07-30T13:46:25Z

queue

🛑 The pull request has been removed from the queue `default`

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

michaelsproul · 2024-07-30T13:46:48Z

@mergify requeue

mergify · 2024-07-30T13:46:55Z

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

mergify · 2024-07-30T13:46:56Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 9b3b730

…l deadlock (sigp#6200) * Avoid acquiring another read lock to avoid potential deadlock.

jimmygchen added bug Something isn't working ready-for-review The code is ready for review labels Jul 29, 2024

michaelsproul changed the base branch from unstable to release-v5.3.0 July 30, 2024 05:45

michaelsproul added the v5.3.0 Q3 2024 release with database changes! label Jul 30, 2024

Avoid acquiring another read lock to avoid potential deadlock.

059e4d7

jimmygchen force-pushed the deposit-cache-lock branch from fc0a8b5 to 059e4d7 Compare July 30, 2024 05:52

michaelsproul approved these changes Jul 30, 2024

View reviewed changes

michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jul 30, 2024

dapplion approved these changes Jul 30, 2024

View reviewed changes

mergify bot merged commit 9b3b730 into sigp:release-v5.3.0 Jul 30, 2024
28 checks passed

AgeManning pushed a commit to AgeManning/lighthouse that referenced this pull request Sep 3, 2024

Avoid acquiring another read lock while holding one to avoid potentia…

c356488

…l deadlock (sigp#6200) * Avoid acquiring another read lock to avoid potential deadlock.

chong-he pushed a commit to chong-he/lighthouse that referenced this pull request Nov 26, 2024

Avoid acquiring another read lock while holding one to avoid potentia…

3c49b72

…l deadlock (sigp#6200) * Avoid acquiring another read lock to avoid potential deadlock.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid acquiring another read lock while holding one to avoid potential deadlock #6200

Avoid acquiring another read lock while holding one to avoid potential deadlock #6200

jimmygchen commented Jul 29, 2024 •

edited

Loading

michaelsproul commented Jul 29, 2024 •

edited

Loading

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024 •

edited

Loading

jimmygchen commented Jul 30, 2024

dapplion left a comment

jimmygchen commented Jul 30, 2024

mergify bot commented Jul 30, 2024

mergify bot commented Jul 30, 2024 •

edited

Loading

jimmygchen commented Jul 30, 2024 •

edited

Loading

jimmygchen commented Jul 30, 2024

mergify bot commented Jul 30, 2024

jimmygchen commented Jul 30, 2024

mergify bot commented Jul 30, 2024

mergify bot commented Jul 30, 2024 •

edited

Loading

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024

mergify bot commented Jul 30, 2024 •

edited

Loading

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024

mergify bot commented Jul 30, 2024 •

edited

Loading

Avoid acquiring another read lock while holding one to avoid potential deadlock #6200

Avoid acquiring another read lock while holding one to avoid potential deadlock #6200

Conversation

jimmygchen commented Jul 29, 2024 • edited Loading

Issue Addressed

michaelsproul commented Jul 29, 2024 • edited Loading

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024 • edited Loading

🛑 The pull request has been removed from the queue default

jimmygchen commented Jul 30, 2024

dapplion left a comment

Choose a reason for hiding this comment

jimmygchen commented Jul 30, 2024

mergify bot commented Jul 30, 2024

✅ This pull request will be re-embarked automatically

mergify bot commented Jul 30, 2024 • edited Loading

🛑 The pull request has been removed from the queue default

jimmygchen commented Jul 30, 2024 • edited Loading

jimmygchen commented Jul 30, 2024

mergify bot commented Jul 30, 2024

✅ Pull request refreshed

jimmygchen commented Jul 30, 2024

mergify bot commented Jul 30, 2024

✅ This pull request will be re-embarked automatically

mergify bot commented Jul 30, 2024 • edited Loading

🛑 The pull request has been removed from the queue default

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024

✅ This pull request will be re-embarked automatically

mergify bot commented Jul 30, 2024 • edited Loading

🛑 The pull request has been removed from the queue default

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024

✅ Pull request refreshed

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024

🛑 The pull request has been removed from the queue default

michaelsproul commented Jul 30, 2024

mergify bot commented Jul 30, 2024

✅ This pull request will be re-embarked automatically

mergify bot commented Jul 30, 2024 • edited Loading

✅ The pull request has been merged automatically

jimmygchen commented Jul 29, 2024 •

edited

Loading

michaelsproul commented Jul 29, 2024 •

edited

Loading

mergify bot commented Jul 30, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

mergify bot commented Jul 30, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

jimmygchen commented Jul 30, 2024 •

edited

Loading

mergify bot commented Jul 30, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

mergify bot commented Jul 30, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

🛑 The pull request has been removed from the queue `default`

mergify bot commented Jul 30, 2024 •

edited

Loading