Skip to content

Commit b79a0c2

Browse files
committed
up
1 parent ebb0e63 commit b79a0c2

File tree

1 file changed

+200
-1
lines changed

1 file changed

+200
-1
lines changed

website/lecture5.md

Lines changed: 200 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -445,6 +445,11 @@ julia> sum_multi_good(1:1_000_000)
445445

446446
```
447447

448+
### Real Life Example of Data Race
449+
450+
In this [research project - [private repo]](https://github.com/floswald/FMig.jl/pull/17/commits/119b34bbf621374be187a023399d74a5e16c3934) we encountered a situation very similar to the above.
451+
452+
448453
## 3. Distributed Computing
449454

450455
The [manual](https://docs.julialang.org/en/v1/manual/distributed-computing/) is again very helpful here.
@@ -470,5 +475,199 @@ So, by default we have process number 1 (which is the one where we type stuff in
470475

471476
Distributed programming in Julia is built on two primitives: remote references and remote calls. A remote reference is an object that can be used from any process to refer to an object stored on a particular process. A remote call is a request by one process to call a certain function on certain arguments on another (possibly the same) process.
472477

473-
Remote references come in two flavors: Future and RemoteChannel.
478+
In the manual you can see a low level API which allows you to directly call a function on a remote worker, but that's most of the time not what you want. We'll concentrate on the higher-level API here. One big issue here is:
479+
480+
### Code and Data Availability
481+
482+
We must ensure that the code we want to execute is available on the process that runs the computation. That sounds fairly obvious. But now try to do this. First we define a new function on the REPL, and we call it on the master, as usual:
483+
484+
485+
```julia
486+
julia> function new_rand(dims...)
487+
return 3 * rand(dims...)
488+
end
489+
new_rand (generic function with 1 method)
490+
491+
julia> new_rand(2,2)
492+
2×2 Matrix{Float64}:
493+
0.407347 2.23388
494+
1.29914 0.985813
495+
```
496+
497+
Now, we want to `spawn` running of that function on `any` available process, and we immediately `fetch` it to trigger execution:
498+
499+
```
500+
julia> fetch(@spawnat :any new_rand(2,2))
501+
ERROR: On worker 3:
502+
UndefVarError: `#new_rand` not defined
503+
Stacktrace:
504+
```
505+
506+
It seems that worker 3, where the job was sent with `@spawnat`, does not know about our function `new_rand`. 🧐
507+
508+
Probably the best approach to this is to define your functions inside a module, as we already discussed. This way, you will find it easy to share code and data across worker processes. Let's define this module in a file in the current directory. let's call it `DummyModule.jl`:
509+
510+
```julia
511+
module DummyModule
512+
513+
export MyType, new_rand
514+
515+
mutable struct MyType
516+
a::Int
517+
end
518+
519+
function new_rand(dims...)
520+
return 3 * rand(dims...)
521+
end
522+
523+
println("loaded") # just to check
524+
525+
end
526+
```
527+
528+
Restart julia with `-p 2`. Now, to load this module an all processes, we use the `@everywhere` macro. In short, we have this situation:
529+
530+
531+
532+
```julia
533+
floswald@PTL11077 ~/compecon> ls
534+
DummyModule.jl
535+
floswald@PTL11077 ~/compecon> julia -p 2
536+
537+
julia> @everywhere include("DummyModule.jl")
538+
loaded # message from process 1
539+
From worker 2: loaded # message from process 2
540+
From worker 3: loaded # message from process 3
541+
542+
julia>
543+
544+
```
545+
546+
Now in order to use the code, we need to bring it into scope with `using`. Here is the master process:
547+
548+
```julia
549+
julia> using .DummyModule # . for locally defined package
550+
551+
julia> MyType(9)
552+
MyType(9)
553+
554+
julia> fetch(@spawnat 2 MyType(9))
555+
ERROR: On worker 2:
556+
UndefVarError: `MyType` not defined
557+
Stacktrace:
558+
559+
julia> fetch(@spawnat 2 DummyModule.MyType(7))
560+
Main.DummyModule.MyType(7)
561+
```
562+
563+
Also, we can execute a function on a worker. Notice, `remotecall_fetch` is like `fetch(remotecall(...))`, but more efficient:
564+
565+
```julia
566+
# calls function new_rand from the DummyModule
567+
# on worker 2
568+
# 3,3 are arguments passed to `new_rand`
569+
julia> remotecall_fetch(DummyModule.new_rand, 2, 3, 3)
570+
3×3 Matrix{Float64}:
571+
0.097928 1.93653 1.16355
572+
1.58353 1.40099 0.511078
573+
2.11274 1.38712 1.71745
574+
```
575+
576+
### Data Movement
577+
578+
I would recommend
579+
580+
* to keep data movement to a minimum
581+
* avoid global variables
582+
* define all required data within the `module` you load on the workers, such that each of them has access to all required data. This may not be feasible if you require huge amounts of input data.
583+
* Example: [private repo again]
584+
585+
### Example Setup Real World Project
586+
587+
Suppose we have the following structure on an HPC cluster.
588+
589+
590+
```bash
591+
592+
floswald@PTL11077 ~/.j/d/LandUse (rev2)> tree -L 1
593+
.
594+
├── Manifest.toml
595+
├── Project.toml
596+
├── slurm_runner.run
597+
├── run.jl
598+
├── src
599+
├── test
600+
601+
```
602+
603+
with this content for the file `run.jl`:
604+
605+
606+
```julia
607+
using Distributed
608+
609+
println("some welcome message from master")
610+
611+
# add 10 processes from running master
612+
# notice that we start them in the same project environment!
613+
addprocs(10, exeflags = "--project=.")
614+
615+
# make sure all packages are available everywhere
616+
@everywhere using Pkg
617+
@everywhere Pkg.instantiate()
618+
619+
# load code for our application
620+
@everywhere using LandUse
621+
622+
623+
LandUse.bigtask(some_arg1 = 10, some_arg2 = 4)
624+
```
625+
626+
The corresponding submit script for the HPC scheduler (SLURM in this case) would then just call this file:
627+
628+
```bash
629+
#!/bin/bash
630+
#SBATCH --job-name=landuse
631+
#SBATCH --output=est.out
632+
#SBATCH --error=est.err
633+
#SBATCH --partition=ncpulong
634+
#SBATCH --nodes=1
635+
#SBATCH --cpus-per-task=11 # same number as we addproc'ed + 1
636+
#SBATCH --mem-per-cpu=4G # memory per cpu-core
637+
638+
julia --project=. run.jl
639+
```
640+
641+
642+
### JuliaHub
643+
644+
The best alternative out there IMHO is juliahub. Instead of `run.jl`, you'd have this instead:
645+
646+
```julia
647+
using Distributed
648+
using JSON3
649+
using CSV
650+
using DelimitedFiles
651+
652+
@everywhere using bk
653+
654+
results = bk.bigjob() # runs a parallel map over workers with `pmap` or similar.
655+
656+
# bigjob writes plots and other stuff to path_results
657+
658+
# oputput path
659+
path_results = "$(@__DIR__)/outputs"
660+
mkpath(path_results)
661+
ENV["RESULTS_FILE"] = path_results
662+
663+
# write text results to JSON (numbers etc)
664+
open("results.json", "w") do io
665+
JSON3.pretty(io, results)
666+
end
667+
668+
ENV["RESULTS"] = JSON3.write(results)
669+
```
670+
671+
### Parallel Map and Loops
474672

673+
This is most relevant use case in most of our applications.

0 commit comments

Comments
 (0)