Description
In a setup where:
- There are multiple worker processes (possibly on different servers in a cluster).
- There are multiple threads in each process (possibly different amounts in different servers).
It isn't trivial to create such a setup - one needs to tweak launching worker processes to be multi-threaded. It would be easy if there was a command-line flag for julia
that specified the number of threads, requested in JuliaLang/julia#34309. But it is still possible to create such a setup today with a bit of effort, and it is useful as all the threads in each worker process benefit from automatic shared memory "everything", rather than being restricted to constructs such as SharedArray
. Of course this means one needs to be careful.
In such a scenario, the current behavior is very clear:
- A
@threads
loop uses the threads of the current (main or worker) process. - A
@distributed
loop andpmap
use a single thread in each worker process.
This has the advantage of simplicity and clarity. It also allows using a nested @threads
in each iteration of @distributed
or pmap
to utilize all the threads in all the machines.
However, it would also be useful to have @distributed_threads
and pmap_threads
.
A @distributed_threads
would statically allocate the same number of iterations for each thread across all the machines - that is, will allocate more iterations to worker processes with more threads, and then internally use @threads
to execute these on each of the worker process threads. This would be the natural extension of @distributed
, which uses static allocation of iterations to processes.
A pmap_threads
would dynamically allocate tasks to each thread across all machines. The batch size, if specified, will individually apply to each thread. It might be useful to add a second batch group size (a positive number of batches) such that each worker process would get a whole group of batches at once, and use the threads to execute the smaller batches, to reduce the amount of cross-process coordination required. This would be the natural extension of pmap
which uses dynamic allocation of iterations to processes.