Skip to content

[SYCL] Removed mutex leading to deadlock #889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 5, 2019

Conversation

KarachunIvan
Copy link
Contributor

Imagine the situation, when there is a global buffer, and several threads
trying to write some data in it. Each thread creates a host accessor to the
buffer and writes its id into the memory.
When a host accessor is created, SYCL RT creates two commands:
UpdateHostRequirement and EmptyCommand.
So SYCL execution graph will look like this:

... -> [EmpCmd T2] -> [UpdReq T2] -> [EmpCmd T1] -> [UpdReq T1] -> [AllocaBuf]

An EmptyCommand acts like mutex: it blocks all operations execution which were
created after host accessor creation while host accessors is "alive".
Let's assume that thread #2 first started an execution by enqueueing UpdReq T2
command. But the command cannot be started since its execution is blocked by
EmpCmd T1. Since thread #2 locked the mutex in Scheduler::waitForEvent,
thread #1 hangs on this mutex, which leads to deadlock.

Signed-off-by: Ivan Karachun ivan.karachun@intel.com

constexpr size_t threadCount = 4;
std::size_t data[size];
TestDeadLock<std::size_t, 1> Task(data, size);
Task.execute(threadCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose a deadlock happens.
Will the test framework deal with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I guess the task might be killed by lit infrastructure due to execution timeout.

@romanovvlad
Copy link
Contributor

Imagine the situation, when there is a global buffer, and several threads
trying to write some data in it. Each thread creates a host accessor to the
buffer and writes its id into the memory.
When a host accessor is created, SYCL RT creates two commands:
UpdateHostRequirement and EmptyCommand.
So SYCL execution graph will look like this:

... -> [EmpCmd T2] -> [UpdReq T2] -> [EmpCmd T1] -> [UpdReq T1] -> [AllocaBuf]

An EmptyCommand acts like mutex: it blocks all operations execution which were
created after host accessor creation while host accessors is "alive".
Let's assume that thread #2 first started an execution by enqueueing UpdReq T2
command. But the command cannot be started since its execution is blocked by
EmpCmd T1. Since thread #2 locked the mutex in Scheduler::waitForEvent,
thread #1 hangs on this mutex, which leads to deadlock.

Signed-off-by: Ivan Karachun ivan.karachun@intel.com

Description contains to many details not really related to the issue.
Suggest simplifying to something like this:

  1. thread1: add nodes to the graph for host accessor A1 to the buffer B
  2. thread2: add nodes to the graph for host accessor A2 to the buffer B
  3. thread2: wait for host accessor A2 nodes to complete
  4. thread1: wait for host accessor A1 nodes to complete

wait on step 3 blocks the mutex, but waits because for host accessor A1 to be destructed.
wait on step 4 cannot complete because the mutex is locked.

The deadlock appeared under following circumstances:
1) thread1: adds nodes to the graph for host accessor A1 to the buffer B;
2) thread2: adds nodes to the graph for host accessor A2 to the buffer B;
3) thread2: waits for host accessor A2 nodes to complete;
4) thread1: waits for host accessor A1 nodes to complete.

On step 3 thread2 locks a mutex in `Scheduler::waitForEvent`
and waits for destruction of host accessor A1.
Actions on step 4 cannot be completed because thread1 waits for the
mutex to be unlocked.

Signed-off-by: Ivan Karachun <ivan.karachun@intel.com>
@KarachunIvan KarachunIvan force-pushed the ivankara/private/thread_safety branch from e93584e to 5326535 Compare December 2, 2019 12:32
@KarachunIvan
Copy link
Contributor Author

Imagine the situation, when there is a global buffer, and several threads
trying to write some data in it. Each thread creates a host accessor to the
buffer and writes its id into the memory.
When a host accessor is created, SYCL RT creates two commands:
UpdateHostRequirement and EmptyCommand.
So SYCL execution graph will look like this:
... -> [EmpCmd T2] -> [UpdReq T2] -> [EmpCmd T1] -> [UpdReq T1] -> [AllocaBuf]
An EmptyCommand acts like mutex: it blocks all operations execution which were
created after host accessor creation while host accessors is "alive".
Let's assume that thread #2 first started an execution by enqueueing UpdReq T2
command. But the command cannot be started since its execution is blocked by
EmpCmd T1. Since thread #2 locked the mutex in Scheduler::waitForEvent,
thread #1 hangs on this mutex, which leads to deadlock.
Signed-off-by: Ivan Karachun ivan.karachun@intel.com

Description contains to many details not really related to the issue.
Suggest simplifying to something like this:

  1. thread1: add nodes to the graph for host accessor A1 to the buffer B
  2. thread2: add nodes to the graph for host accessor A2 to the buffer B
  3. thread2: wait for host accessor A2 nodes to complete
  4. thread1: wait for host accessor A1 nodes to complete

wait on step 3 blocks the mutex, but waits because for host accessor A1 to be destructed.
wait on step 4 cannot complete because the mutex is locked.

Commit message was updated.

Signed-off-by: Ivan Karachun <ivan.karachun@intel.com>
@KarachunIvan KarachunIvan force-pushed the ivankara/private/thread_safety branch from 5326535 to 09c03e1 Compare December 2, 2019 12:42
Signed-off-by: Ivan Karachun <ivan.karachun@intel.com>
Signed-off-by: Ivan Karachun <ivan.karachun@intel.com>
@KarachunIvan KarachunIvan force-pushed the ivankara/private/thread_safety branch from 080a729 to 16d0b02 Compare December 4, 2019 10:19
Copy link
Contributor

@s-kanaev s-kanaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


template <typename Func, typename... Args>
void enqueue(Func func, Args... args) {
MThreadPool.push_back(std::thread(func, args...));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor

Suggested change
MThreadPool.push_back(std::thread(func, args...));
MThreadPool.emplace_back(std::forward(func), std::forward(args)...);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied

Signed-off-by: Ivan Karachun <ivan.karachun@intel.com>
@KarachunIvan KarachunIvan force-pushed the ivankara/private/thread_safety branch from 18da51d to b55d511 Compare December 4, 2019 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants