Using rayon under a Mutex can lead to deadlocks #592

cuviper · 2018-09-05T00:44:50Z

https://users.rust-lang.org/t/using-rayon-in-code-locked-by-a-mutex/20119

@kornelski presented the following pseudo-code representing a possible deadlock:

fn mutex_and_par() {
   some_mutex.lock().par_iter().collect();
}

collection.par_iter().for_each(mutex_and_par);

I think the way this fails is something like:

Thread 1 starts processing one of the for_each(mutex_and_par) calls.
- grabs the lock
- splits up the par_iter() into a few jobs
- starts the collect() on one of those jobs
Another thread steals one of those jobs to help out
Thread 1 finishes its current job
- looks for the other jobs and sees they're stolen elsewhere
- goes to steal more work itself while it's waiting...
- finds another of the for_each(mutex_and_par) calls
- tries to grab the lock (again) ... DEADLOCK

If serialized, this code would never be a problem, but a work-stealing threadpool can cause it to implicitly recurse the lock. I'm not sure what we can do about this, or if there's any better advice we can give than just "don't call rayon under a mutex"...

The text was updated successfully, but these errors were encountered:

kornelski · 2018-09-05T02:04:34Z

Solutions that come to my mind:

Allow running a job/parallel iterator on a separate "private" threadpool. I guess current ThreadPool.install() won't cut it, as it's still global, and it'd merely delay the same cross-mutex stealing. For big chunks of work under a mutex wasting another set of threads would be an acceptable cost IMHO.
Have RayonAwareMutex. When code ran from a Rayon job tries to acquire the mutex, and the mutex is already locked, execution switches to another job instead of blocking. Maybe it could be generalized as rayon::yield() or such loop { if try_lock() { work(); break; } else {rayon::yield(); } }.

nikomatsakis · 2018-09-12T18:00:36Z

Mostly, my expectation is that the answer here would be "don't do that", but it is sort of disappointing, since we would like you to be able to add par_iter kind of 'for free' and have it work out.

That doesn't realistically work around blocking things though, and mutexes fit that model of course.

nikomatsakis · 2018-09-12T18:01:32Z

I think that using a ThreadPool::install inside the mutex would prevent the problem in this particular case, though..?

(Of course, you may not have control of all the relevant code.)

cuviper · 2018-09-12T18:07:09Z

Allow running a job/parallel iterator on a separate "private" threadpool. I guess current ThreadPool.install() won't cut it, as it's still global, and it'd merely delay the same cross-mutex stealing.

Why do you call this global? It should install in whatever threadpool you call it on. As long as that's distinct from the threadpool that grabbed the mutex, and there's nothing in it that calls back to the mutex's threadpool, I think it would work.

cuviper · 2018-09-12T18:11:27Z

Have RayonAwareMutex. When code ran from a Rayon job tries to acquire the mutex, and the mutex is already locked, execution switches to another job instead of blocking. Maybe it could be generalized as rayon::yield() or such loop { if try_lock() { work(); break; } else {rayon::yield(); } }.

I think this would only help if they had separate stacks, otherwise your try_lock loop will still be sitting on top of the context that's holding the lock. You'd basically have to switch stacks every time you steal a job -- which I think @Zoxc did in the rustc-rayon fork to solve similar cross-cutting dependencies.

kornelski · 2018-09-12T22:04:42Z

Why do you call this global?

Sorry, I misread it as ThreadPool::install() rather than ThreadPool::install(callback). I'll try that solution then.

cuviper · 2018-09-12T22:21:48Z

Oh, actually, I think ThreadPool::install won't solve it -- might even make the deadlock more sure to happen. #449 made it so calls into other threadpools won't block the current pool. So when your mutex-holder installs to another pool, it will immediately look for more local work while it waits.

kornelski · 2018-12-09T21:08:57Z

I've ran into this problem again (or rather, I've never fixed it and was lucky not to get a deadlock in the meantime)

I've tried:

with_mutex(|| {            
    rayon::ThreadPoolBuilder::new()
    .build()
    .unwrap()
    .install(|| {
        parallel_processing()
    })
})

but as you've said, it didn't work around the issue.

All threads from the global pool are waiting for the mutex (which is entirely expected in my case), but then all the threads from the second pool end up locked at:

std::sys::unix::condvar::Condvar::wait::hc2a73c745cc299a2 [inlined]
std::sys_common::condvar::Condvar::wait::h657cb2db59f167a7 [inlined]
std::sync::condvar::Condvar::wait::h53da7761c4d81005 [inlined]
rayon_core::sleep::Sleep::sleep::haa73a51ff31e5141
rayon_core::sleep::Sleep::no_work_found::hde4ba6d7ec74532c [inlined]
rayon_core::registry::WorkerThread::wait_until_cold::hfeafc7e947f56fc0

kornelski · 2019-04-16T13:28:39Z

Is there any hope of this being fixed in Rayon? I don't know if I should wait for a fix, or remove Rayon from the mutex-using part of the code.

cuviper · 2019-04-30T20:39:50Z

I don't have a sense of what the fix should be, nevermind giving any ETA on it.

kornelski · 2019-05-03T12:53:02Z

The fixes I suppose could solve it:

rayon::yield() that makes the current thread execute any pending rayon tasks on the current thread. I would use it something like loop {if mutex.try_lock() {…} else {rayon::yield()}}
rayon::Mutex that does the above, possibly with lesser chance of race conditions or spinning.
A version of ThreadPool::install that is truly separate and isolated.

cuviper · 2019-05-10T23:12:41Z

rayon::yield() that makes the current thread execute any pending rayon tasks on the current thread. I would use it something like loop {if mutex.try_lock() {…} else {rayon::yield()}}

That use would still be a deadlock-loop if the lock holder -- or anything the lock holder is waiting for -- is blocked by you on the call stack. They'll never complete until you return to them.

A version of ThreadPool::install that is truly separate and isolated.

I think this is feasible.

cuviper · 2019-06-27T00:15:40Z

If you're really only reading the mutexed data in the thread pool, you could also consider a RwLock, as that does allow reentrant readers.

naim94a · 2020-03-29T18:35:38Z

If this isn't going to be fixed, I would suggest documenting it as a known issue and adding a warning to par_* documentation.

I just spent nearly two hours debugging a dead-lock before finding this issue 😢

nhukc · 2024-06-18T23:56:30Z

I was able to get something like this working by adding a feature for full_blocking thread pools. This prevents the #449 behavior when a pool is created with the full_blocking feature.

use std::sync::{Arc, Mutex};
use rayon::ThreadPoolBuilder;
use rayon::iter::IntoParallelRefIterator;
use rayon::iter::ParallelIterator;

fn mutex_and_par(mutex: Arc<Mutex<Vec<i32>>>, blocking_pool: &rayon::ThreadPool) {
    // Lock the mutex and collect items using the full blocking thread pool
    let vec = mutex.lock().unwrap();
    let result: Vec<i32> = blocking_pool.install(|| vec.par_iter().cloned().collect());
    println!("{:?}", result);
}

#[test]
fn test_issue592() {
    let collection = vec![1, 2, 3, 4, 5];
    let mutex = Arc::new(Mutex::new(collection));

    let blocking_pool = ThreadPoolBuilder::new().full_blocking().num_threads(4).build().unwrap();

    let dummy_collection: Vec<i32> = (1..=100).collect();
    dummy_collection.par_iter().for_each(|_| {
        mutex_and_par(mutex.clone(), &blocking_pool);
    });
}

See for more details:
#1175

kornelski mentioned this issue Jan 28, 2019

Avoid mixing Rayon with Mutex shssoichiro/oxipng#163

Merged

cuviper mentioned this issue Jul 5, 2019

FR: Parallel to sequential iterator #210

Open

cuviper mentioned this issue Sep 5, 2019

Deadlock? when using par_bridge() #690

Closed

cuviper mentioned this issue Oct 7, 2019

How to parallelize this with Rayon? #699

Closed

cuviper mentioned this issue Nov 11, 2019

Prefix scans #393

Open

cuviper mentioned this issue Apr 2, 2020

spawn is badly named #743

Open

cuviper mentioned this issue Jun 15, 2020

Crash in rayon_core::latch::TickleLatch::set #768

Closed

rhysnewell mentioned this issue Jun 17, 2020

Using rayon alongside mutex can cause random deadlocks rhysnewell/Lorikeet#1

Closed

menshikh-iv mentioned this issue Jul 12, 2021

Deadlock on index building granne/granne#18

Open

ritchie46 mentioned this issue Dec 2, 2021

Item to Item func pola-rs/valves#14

Merged

cuviper mentioned this issue Dec 3, 2021

Blocking I/O with rayon threadpool could lead to deadlock. #904

Closed

kvinwang mentioned this issue Feb 27, 2023

node-rpc-ext: Remove rayon (Fix deadlock) Phala-Network/phala-blockchain#1174

Merged

andrews05 mentioned this issue Jun 13, 2023

Parallel mode hangs when invoked from Rayon global thread pool shssoichiro/oxipng#517

Closed

cuviper mentioned this issue Aug 31, 2023

calling buffer.par_sort_unstable_by_key from a task calls the task itself #1083

Closed

This was referenced Nov 26, 2023

[Do not merge] Benchmark Tokio executor vs Rayon sigp/milhouse#31

Open

Detect Rayon deadlocks BurtonQin/lockbud#55

Open

michaelsproul mentioned this issue Nov 27, 2023

Decide on a policy for Rayon usage sigp/lighthouse#4952

Open

cuviper mentioned this issue Nov 28, 2023

cooperative yield in ThreadPool::install() causes unexpected behavior in nested pools #1105

Open

HarukaMa mentioned this issue Feb 7, 2024

[Bug] Deadlock with rayon usage AleoNet/snarkOS#3063

Open

cuviper mentioned this issue Jun 14, 2024

Thread pool without work stealing #1174

Open

JoeyBF mentioned this issue Jun 23, 2024

Add full_blocking feature to thread pool #1175

Open

cuviper mentioned this issue Jul 12, 2024

Yield locks #1181

Closed

IceTDrinker mentioned this issue Jul 16, 2024

BorrowMutError: 'already borrowed' while executing CompactFheBool operations within a rayon parallel iterator zama-ai/tfhe-rs#993

Open

HyperCodec mentioned this issue Sep 25, 2024

Rare hang HyperCodec/neat#58

Open

arn-backpack mentioned this issue Oct 24, 2024

Deadlocking with rwlock #1205

Open

Liyixin95 mentioned this issue Nov 13, 2024

deadlock when reading parquet one disk pola-rs/polars#19751

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using rayon under a Mutex can lead to deadlocks #592

Using rayon under a Mutex can lead to deadlocks #592

cuviper commented Sep 5, 2018

kornelski commented Sep 5, 2018 •

edited

Loading

nikomatsakis commented Sep 12, 2018

nikomatsakis commented Sep 12, 2018

cuviper commented Sep 12, 2018 •

edited

Loading

cuviper commented Sep 12, 2018

kornelski commented Sep 12, 2018

cuviper commented Sep 12, 2018

kornelski commented Dec 9, 2018

kornelski commented Apr 16, 2019

cuviper commented Apr 30, 2019

kornelski commented May 3, 2019

cuviper commented May 10, 2019 •

edited

Loading

cuviper commented Jun 27, 2019

naim94a commented Mar 29, 2020 •

edited

Loading

nhukc commented Jun 18, 2024 •

edited

Loading

Using rayon under a Mutex can lead to deadlocks #592

Using rayon under a Mutex can lead to deadlocks #592

Comments

cuviper commented Sep 5, 2018

kornelski commented Sep 5, 2018 • edited Loading

nikomatsakis commented Sep 12, 2018

nikomatsakis commented Sep 12, 2018

cuviper commented Sep 12, 2018 • edited Loading

cuviper commented Sep 12, 2018

kornelski commented Sep 12, 2018

cuviper commented Sep 12, 2018

kornelski commented Dec 9, 2018

kornelski commented Apr 16, 2019

cuviper commented Apr 30, 2019

kornelski commented May 3, 2019

cuviper commented May 10, 2019 • edited Loading

cuviper commented Jun 27, 2019

naim94a commented Mar 29, 2020 • edited Loading

nhukc commented Jun 18, 2024 • edited Loading

kornelski commented Sep 5, 2018 •

edited

Loading

cuviper commented Sep 12, 2018 •

edited

Loading

cuviper commented May 10, 2019 •

edited

Loading

naim94a commented Mar 29, 2020 •

edited

Loading

nhukc commented Jun 18, 2024 •

edited

Loading