Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: parallel task runtime #22631

Closed
wants to merge 4 commits into from
Closed

WIP: parallel task runtime #22631

wants to merge 4 commits into from

Conversation

kpamnany
Copy link
Contributor

@kpamnany kpamnany commented Jun 30, 2017

This replaces the existing fork-join threading infrastructure with a parallel task runtime (partr) that implements parallel depth first scheduling. This model fully supports nested parallelism.

The default remains the original threading code. Enable partr by setting JULIA_PARTR := 1 in your Make.user.

The core idea is simple -- Julia tasks can now be run by any thread. The task scheduler attempts to order task execution depth-first for provably better cache efficiency, and for true nested parallelism.

However, as tasks are an existing thing in Julia and used in a number of places, we're first introducing the infrastructure that will enable parallel tasks with this PR, keeping (hopefully) the serial semantics of the existing task interface. This PR does not introduce any new interface calls for parallel tasks -- those will be in future PRs.

All test-cases pass with JULIA_PARTR off (as they should). With JULIA_PARTR on, all test cases are currently passing on Linux and OS-X.

Cc: @JeffBezanson, @vtjnash, @yuyichao, @ViralBShah, @vchuravy, @anton-malakhov.

@kpamnany kpamnany added the multithreading Base.Threads and related functionality label Jun 30, 2017
@jtravs
Copy link
Contributor

jtravs commented Jun 30, 2017

Any chance you could give a very simple example of what the interface would like like in user code?

@kpamnany
Copy link
Contributor Author

The Julia interface is not designed yet. While elaborations are possible, the essence of the interface is similar to Cilk so something like:

t1 = @spawn foo(1, 2) # foo(1, 2) will run asynchronously, possibly in another thread
res1 = @sync t1 # res1 will get the return value of foo(1, 2)
t2 = @parfor (+) i = 1:10 # iterations may run in parallel
  i-1
end
res2 = @sync t2 # res2 = 45

@tknopp
Copy link
Contributor

tknopp commented Jun 30, 2017

@kpamnany: Does this mean I can have a background thread/task running parallel to the foreground thread. In other words: Can I run a thread asynchronously?

@kpamnany
Copy link
Contributor Author

@tknopp: yes, that's what spawn will do. If you spawn a task and there's more than one thread, it will start running right away. It will continue to run until a yield point (another spawn, a sync, a parfor, or an explicit yield).

while (jl_atomic_load_acquire(&tiarg->state) == TI_THREAD_INIT)
jl_cpu_pause();

// Assuming the functions called below doesn't contain unprotected GC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"doesn't" -> "don't"?

STATIC_INLINE uint64_t cong(uint64_t max, uint64_t unbias, uint64_t *seed)
{
while ((*seed = 69069 * (*seed) + 362437) > unbias)
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semicolon on separate line as a ghost of the empty loop body or unintentional extra space?

src/partr.c Outdated

init_started_thread();

// Assuming the functions called below doesn't contain unprotected GC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"doesn't" -> "don't" here as well?

@ViralBShah
Copy link
Member

Seems worthwhile to experiment with the Projects feature for this one.

@davidanthoff
Copy link
Contributor

Just out of curiosity, is this something that might make it into 1.0?

@ViralBShah
Copy link
Member

We will try to get it into 1.0 if it is ready before feature freeze, but not hold 1.0. I am personally hopeful that it will be ready by 1.0. Hope that helps.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 11, 2017

If it can't make it into 1.0 in complete form, we can at least include it in experimental form and try our damndest to leave room for it in the 1.x series – we really don't want to have to wait until 2.0 for full-on threading support.

@amitmurthy
Copy link
Contributor

@kpamnany, will the spawned tasks be able to perform asynchronous IO via the libuv event loop ? Is the plan to have the main Julia thread run the event loop and perform all compute only tasks in separate threads?

Currently, in the Distributed module incoming requests are executed in separate tasks, and each invocation can also make additional remote calls. This leads to any IO being blocked when a worker is busy on a compute. Will it be possible for incoming requests to be executed via a threadpool and IO calls if any to be internally routed to the main Julia thread running the event loop?

@kpamnany
Copy link
Contributor Author

@amitmurthy: who runs the event loop (and how) is a good question. As a general statement, irrespective of Julia, unless you reserve a thread for I/O, it is possible for requests to be serviced late/very slowly. But you don't always have/want a thread to reserve. Ideally, this should be a program choice.

Executing tasks are not preempted. The API entry points (spawn, sync, and parfor) may cause the calling task to yield. However, this runtime allows for sticky tasks, i.e. tasks that only run on the thread that started them. Sticky tasks do not yield in spawn and parfor. So, you can create a sticky I/O task and drive the event loop from it. It's pretty straightforward to allow tasks to perform asynchronous I/O requests, but it isn't obvious how to get completion notifications. I'm not entirely sure how to do this right now but @vtjnash and @JeffBezanson have probably thought this out in greater detail (they suggested sticky tasks).

Clearly it would be a useful enhancement to this runtime to add the ability to trigger a task based on an event. But that gets us into having to define events, and decide semantics for event mux/demux and that opens many questions -- are there system events? Can multiple tasks be triggered by the same event? How about the conjunction or disjunction of multiple events? Not sure we should go down this rabbit hole right now.

@JeffBezanson
Copy link
Member

Our existing Tasks can already be triggered by events, so we're already in the rabbit hole. We can't fully leave this up to applications; we need to make some default choice for people.

@StefanKarpinski
Copy link
Member

It seems like the default should probably be to have a sticky I/O thread since most applications don't need all of the threads. For really high performance situations where one wants to defer I/O until the I/O thread wakes up, we should probably have people opt into that.

@amitmurthy
Copy link
Contributor

Googled a bit on integrating libuv and multithreading. See
http://docs.libuv.org/en/v1.x/threading.html, http://docs.libuv.org/en/v1.x/async.html
and https://nikhilm.github.io/uvbook/threads.html#threads

I would like to try out the following simple model in parallel to the work being done
in this PR.

  1. The main Julia thread continues to handle all I/O and the event infrastructure.
  2. Provide an API to spawn a Julia 0-arg function from a thread selected from a threadpool. Lets call this a compute thread.
  3. The compute thread forwards all I/O and event handling calls (sleep, Timers, notify, etc.) to the main Julia thread.
  4. I/O request forwarding from compute_thread -> main_thread is done via a multiple-writer, single-reader queue and uv_async_send to notify the event loop. All compute threads push their I/O requests onto this queue which is processed by the main_thread (running the event loop)
  5. compute_threads are notified of I/O completion events via a regular system condition variable (uv_cond_t and uv_cond_signal). One condition variable per compute_thread.
  6. Julia code running on a compute_thread is therefore not a Julia Task in the regular sense. It is just Julia code running in a separate thread. All calls requiring libuv facilities are forwarded to the main_thread and the compute_thread waits for its completion (on a system condition variable).
  7. The fact that the Julia I/O API is designed to be a blocking interface (while being fully event driven and asynchronous under the hood) makes this model much easier to implement.

At the very least it will help in getting a handle on libuv event loop integration in a multi-threaded environment.

@JeffBezanson
Copy link
Member

Thanks @amitmurthy that sounds basically good. I suspect this can work with normal Tasks, though. When a Task (running on any thread) wants to do I/O, it queues its request and yields. When the I/O completes, the requesting Task can be restarted as usual.

int last_arriver(arriver_t *, int);
void *reduce(arriver_t *, reducer_t *, void *(*rf)(void *, void *), void *, int);
#endif // JULIA_ENABLE_PARTR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these should maybe be in a scheduler.h; they probably shouldn't be called by miscellaneous run time system code.

@JeffBezanson
Copy link
Member

@kpamnany jl_switchto(task) should work for starting and resuming a task. Fortunately we are already storing per-thread stack address information, so if all tasks are put in sticky queues this might work now.

@rveltz
Copy link

rveltz commented Sep 27, 2017

Any idea to make it to a package first?

@anton-malakhov
Copy link

Hi folks! #iamintel here to help Kiran to push multi-threading forward as he's transitioned to other projects. He offered me to work on libuv-related stuff while he's finishing some other parts.

@amitmurthy, are you working on the approach you suggested on Jul 19th?
It looks good enough though it is still vulnerable to the situation when main thread gets blocked in compute-intensive user code thus all the I/O and events stuck during this time. Moving event loop into a separate dedicated I/O thread would work but it'd introduce additional overheads to single-threaded case if it always packs and sends requests to the other threads.
Studying go-lang runtime, I'd like to follow their flexibility in scheduling event polling to any threads available. Unfortunately, it is not possible with current libuv implementation but it is still possible to run multiple loops in parallel as they recommend for multi-threading. libuv can also evolve to support multithreading better and as result, an implementation explicitly communicating to a single uv_loop would require deep refactoring again. Thus, I'd suggest to start with uv_loop per thread and, as the next step, implement 'loop stealing/syncing' mechanism which would allow a request to migrate from an originating thread to another one by stealing or mailing the whole uv_loop instance or the underneath handles. Do you see any issue with this design?
Of course, if you already implemented your idea, I can work on something more important for the next release. We can also continue on Slack's #parallel

@ViralBShah
Copy link
Member

@amitmurthy is off grid for a the next couple of weeks.

@c42f
Copy link
Member

c42f commented Oct 24, 2018

Shall I merge #29791 in here?

@kpamnany
Copy link
Contributor Author

Looks like a big patch, and this is a big patch too with some time pressure to merge. @JeffBezanson can make the call.

@Keno
Copy link
Member

Keno commented Oct 25, 2018

@kpamnany note that this branch has conflicts with master. @c42f's PR resolves those conflicts.

@kpamnany
Copy link
Contributor Author

Ah, I see now, it is already on master so it has to be merged. It'd be best if we could get runtests to complete successfully on this branch first before adding new code to fix though. Unless the new code can help?

@c42f
Copy link
Member

c42f commented Oct 25, 2018

The new code is likely to help in the circumstance that you're loosing stack traces of exceptions thrown in tasks due to context switching. Other than that, it only resolves the conflicts with master.

@JeffBezanson
Copy link
Member

Ok I think I found the next problem. wait/isready/n_avail depend on the length of c.putters, but the partr code doesn't use that array any more. Will change it to use the state of cond_put.

@JeffBezanson
Copy link
Member

Now other tests pass but there is a mystery crash in the embedding test. 😡

src/task.c Outdated
static void record_backtrace(jl_ptls_t ptls) JL_NOTSAFEPOINT
{
// storing bt_size in ptls ensures roots in bt_data will be found
ptls->bt_size = rec_backtrace(ptls->bt_data, JL_MAX_BT_SIZE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is nice to have factored out (especially for having a place to put the note about rooting). I only removed it for symmetry with the equivalent code which had removed it in partr.c. We should probably just call record_backtrace there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is fine, I was just minimizing the diff for now.

@kpamnany
Copy link
Contributor Author

Such good progress! The Win* failures are mystifying -- I see a lot of fatal: cannot change to '/cygdrive/c/projects/julia': No such file or directory messages?

We're still going to turn JULIA_PARTR off before merge I would think?

@kcajf
Copy link
Contributor

kcajf commented Dec 5, 2018

I don't understand all the complexity and current state of this PR, so please forgive for sounding impatient or pushy, but I was just wondering what the chances of seeing some part of this released in the soon upcoming 1.1? Are there any easy tasks that someone (like me), who perhaps isn't very familiar with the language implementation side of Julia, work on to help push things along?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Dec 5, 2018

The remaining blocker is unresolved bugs, mainly on Windows. So, if you have a Windows system (or anything else, actually), you could clone this branch, build it and run all the tests. If there are crashes, try to debug them. However, I don't suspect that will be particularly easy, but help is welcomed.

Note that you probably need to also be on #30186 or one of the other Channel API revision branches. I'm a bit unclear on which one should be on at this point.

@JeffBezanson
Copy link
Member

This branch still ought to work on its own, at least with 1 thread.

kpamnany and others added 3 commits December 5, 2018 14:37
Added partr code. Abstracted interface to threading infrastructure.
@JeffBezanson
Copy link
Member

win32: generate_precompile hanging
win64: a node 1 test is hanging, either precompile, SharedArrays, or Distributed

@kpamnany
Copy link
Contributor Author

Refactored into #30806, #30838 and #31398.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.