@@ -179,7 +179,7 @@ this on `utop`.
179
179
180
180
# let pool = Task.setup_pool ~num_additional_domains:3
181
181
val pool : Task.pool = <abstr>
182
- ```
182
+ ```
183
183
We have created a new task pool with three new domains. The parent domain is
184
184
also part of this pool, thus making it a pool of four domains. After the pool is
185
185
setup, we can use this pool to execute all tasks we want to run in parallel. The
@@ -285,7 +285,7 @@ to be executed.
285
285
Parallel for also has an optional parameter ` chunk_size ` . It determines the
286
286
granularity of tasks when executing them on multiple domains. If no parameter
287
287
is given for ` chunk size ` , a default chunk size is determined which performs
288
- well in most cases. Only if the default chunk size doesn't work well, it is
288
+ well in most cases. Only if the default chunk size doesn't work well, is it
289
289
recommended to experiment with different chunk sizes. The ideal ` chunk_size `
290
290
depends on a combination of factors:
291
291
@@ -297,7 +297,7 @@ iterations divided by the number of cores. On the other hand, if the amount of
297
297
time taken is different for every iteration, the chunks should be smaller. If
298
298
the total number of iterations is a sizeable number, a ` chunk_size ` like 32 or
299
299
16 is safe to use, whearas if the number of iterations is low, like say 10, a
300
- ` chunk_size ` of 1 would perform best.
300
+ ` chunk_size ` of 1 would perform best.
301
301
302
302
* ** Machine:** Optimal chunk size varies across machines and it is recommended
303
303
to experiment with a range of values to find out what works best on yours.
@@ -350,14 +350,14 @@ let parallel_matrix_multiply_3 pool m1 m2 m3 =
350
350
let t = Array.make_matrix size size 0 in (* stores m1*m2 *)
351
351
let res = Array.make_matrix size size 0 in
352
352
353
- Task.parallel_for pool ~chunk_size:(size/num_domains) ~ start:0 ~finish:(size - 1) ~body:(fun i ->
353
+ Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i ->
354
354
for j = 0 to size - 1 do
355
355
for k = 0 to size - 1 do
356
356
t.(i).(j) <- t.(i).(j) + m1.(i).(k) * m2.(k).(j)
357
357
done
358
358
done);
359
359
360
- Task.parallel_for pool ~chunk_size:(size/num_domains) ~ start:0 ~finish:(size - 1) ~body:(fun i ->
360
+ Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i ->
361
361
for j = 0 to size - 1 do
362
362
for k = 0 to size - 1 do
363
363
res.(i).(j) <- res.(i).(j) + t.(i).(k) * m3.(k).(j)
@@ -505,7 +505,7 @@ The above example would be essentially blocking indefinitely because the `send`
505
505
does not have a corresponding receive. If we instead create a bounded channel
506
506
with buffer size n, it can store up to [ n] objects in the channel without a
507
507
corresponding receive, exceeding which the sending would block. We can try it
508
- with the same example as above just by changing the buffer size to 1.
508
+ with the same example as above just by changing the buffer size to 1.
509
509
510
510
``` ocaml
511
511
open Domainslib
@@ -611,7 +611,7 @@ let _ =
611
611
worker (update results) ();
612
612
Array.iter Domain.join domains;
613
613
Array.iter (Printf.printf "%d ") results
614
- ```
614
+ ```
615
615
616
616
We have created an unbounded channel ` c ` which will act as a store for all the
617
617
tasks. We'll pay attention to two functions here: ` create_work ` and ` worker ` .
@@ -659,7 +659,7 @@ that if a lot more time is spent outside the function we'd like to parallelise,
659
659
the maximum speedup we could achieve would be lower.
660
660
661
661
Profiling serial code can help us discover the hotspots where we might want to
662
- introduce parallelism.
662
+ introduce parallelism.
663
663
664
664
```
665
665
Samples: 51K of event 'cycles:u', Event count (approx.): 28590830181
@@ -791,7 +791,7 @@ Shared Data Cache Line Table (2 entries, sorted on Total HITMs)
791
791
----------- Cacheline ---------- Total Tot ----- LLC Load Hitm ----- ---- Store Reference ---- --- Loa
792
792
Index Address Node PA cnt records Hitm Total Lcl Rmt Total L1Hit L1Miss Lc
793
793
0 0x7f2bf49d7dc0 0 11473 13008 94.23% 1306 1306 0 1560 595 965 ◆
794
- 1 0x7f2bf49a7b80 0 271 368 5.48% 76 76 0 123 76 47
794
+ 1 0x7f2bf49a7b80 0 271 368 5.48% 76 76 0 123 76 47
795
795
```
796
796
797
797
As evident from the report, there's quite a lot of false sharing happening in
@@ -953,7 +953,7 @@ So far we have only found that there is an imbalance in task distribution
953
953
in the code, we'll need to change our code accordingly to make the task
954
954
distribution more balanced, which could increase the speedup.
955
955
956
- ---
956
+ ---
957
957
958
958
Performace debugging can be quite tricky at times. If you could use some help in
959
959
debugging your Multicore OCaml code, feel free to create an issue in the
0 commit comments