Skip to content

Commit 5013139

Browse files
committed
Nudge people to the default chunk_size setting
1 parent 4c9da3e commit 5013139

File tree

3 files changed

+15
-15
lines changed

3 files changed

+15
-15
lines changed

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,7 @@ this on `utop`.
179179
180180
# let pool = Task.setup_pool ~num_additional_domains:3
181181
val pool : Task.pool = <abstr>
182-
```
182+
```
183183
We have created a new task pool with three new domains. The parent domain is
184184
also part of this pool, thus making it a pool of four domains. After the pool is
185185
setup, we can use this pool to execute all tasks we want to run in parallel. The
@@ -285,7 +285,7 @@ to be executed.
285285
Parallel for also has an optional parameter `chunk_size`. It determines the
286286
granularity of tasks when executing them on multiple domains. If no parameter
287287
is given for `chunk size`, a default chunk size is determined which performs
288-
well in most cases. Only if the default chunk size doesn't work well, it is
288+
well in most cases. Only if the default chunk size doesn't work well, is it
289289
recommended to experiment with different chunk sizes. The ideal `chunk_size`
290290
depends on a combination of factors:
291291

@@ -297,7 +297,7 @@ iterations divided by the number of cores. On the other hand, if the amount of
297297
time taken is different for every iteration, the chunks should be smaller. If
298298
the total number of iterations is a sizeable number, a `chunk_size` like 32 or
299299
16 is safe to use, whearas if the number of iterations is low, like say 10, a
300-
`chunk_size` of 1 would perform best.
300+
`chunk_size` of 1 would perform best.
301301

302302
* **Machine:** Optimal chunk size varies across machines and it is recommended
303303
to experiment with a range of values to find out what works best on yours.
@@ -350,14 +350,14 @@ let parallel_matrix_multiply_3 pool m1 m2 m3 =
350350
let t = Array.make_matrix size size 0 in (* stores m1*m2 *)
351351
let res = Array.make_matrix size size 0 in
352352
353-
Task.parallel_for pool ~chunk_size:(size/num_domains) ~start:0 ~finish:(size - 1) ~body:(fun i ->
353+
Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i ->
354354
for j = 0 to size - 1 do
355355
for k = 0 to size - 1 do
356356
t.(i).(j) <- t.(i).(j) + m1.(i).(k) * m2.(k).(j)
357357
done
358358
done);
359359
360-
Task.parallel_for pool ~chunk_size:(size/num_domains) ~start:0 ~finish:(size - 1) ~body:(fun i ->
360+
Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i ->
361361
for j = 0 to size - 1 do
362362
for k = 0 to size - 1 do
363363
res.(i).(j) <- res.(i).(j) + t.(i).(k) * m3.(k).(j)
@@ -505,7 +505,7 @@ The above example would be essentially blocking indefinitely because the `send`
505505
does not have a corresponding receive. If we instead create a bounded channel
506506
with buffer size n, it can store up to [n] objects in the channel without a
507507
corresponding receive, exceeding which the sending would block. We can try it
508-
with the same example as above just by changing the buffer size to 1.
508+
with the same example as above just by changing the buffer size to 1.
509509

510510
```ocaml
511511
open Domainslib
@@ -611,7 +611,7 @@ let _ =
611611
worker (update results) ();
612612
Array.iter Domain.join domains;
613613
Array.iter (Printf.printf "%d ") results
614-
```
614+
```
615615

616616
We have created an unbounded channel `c` which will act as a store for all the
617617
tasks. We'll pay attention to two functions here: `create_work` and `worker`.
@@ -659,7 +659,7 @@ that if a lot more time is spent outside the function we'd like to parallelise,
659659
the maximum speedup we could achieve would be lower.
660660

661661
Profiling serial code can help us discover the hotspots where we might want to
662-
introduce parallelism.
662+
introduce parallelism.
663663

664664
```
665665
Samples: 51K of event 'cycles:u', Event count (approx.): 28590830181
@@ -791,7 +791,7 @@ Shared Data Cache Line Table (2 entries, sorted on Total HITMs)
791791
----------- Cacheline ---------- Total Tot ----- LLC Load Hitm ----- ---- Store Reference ---- --- Loa
792792
Index Address Node PA cnt records Hitm Total Lcl Rmt Total L1Hit L1Miss Lc
793793
0 0x7f2bf49d7dc0 0 11473 13008 94.23% 1306 1306 0 1560 595 965 ◆
794-
1 0x7f2bf49a7b80 0 271 368 5.48% 76 76 0 123 76 47
794+
1 0x7f2bf49a7b80 0 271 368 5.48% 76 76 0 123 76 47
795795
```
796796

797797
As evident from the report, there's quite a lot of false sharing happening in
@@ -953,7 +953,7 @@ So far we have only found that there is an imbalance in task distribution
953953
in the code, we'll need to change our code accordingly to make the task
954954
distribution more balanced, which could increase the speedup.
955955

956-
---
956+
---
957957

958958
Performace debugging can be quite tricky at times. If you could use some help in
959959
debugging your Multicore OCaml code, feel free to create an issue in the

code/task/matrix_multiplication_multicore.ml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@ open Domainslib
22

33
let num_domains = try int_of_string Sys.argv.(1) with _ -> 1
44
let n = try int_of_string Sys.argv.(2) with _ -> 1024
5-
let chunk_size = try int_of_string Sys.argv.(3) with _ -> (n/num_domains)
5+
let chunk_size = try int_of_string Sys.argv.(3) with _ -> 0
66

77
let parallel_matrix_multiply pool a b =
88
let i_n = Array.length a in
99
let j_n = Array.length b.(0) in
1010
let k_n = Array.length b in
1111
let res = Array.make_matrix i_n j_n 0 in
1212

13-
Task.parallel_for pool ~chunk_size:chunk_size ~start:0 ~finish:(i_n - 1) ~body:(fun i ->
13+
Task.parallel_for pool ~chunk_size ~start:0 ~finish:(i_n - 1) ~body:(fun i ->
1414
for j = 0 to j_n - 1 do
1515
for k = 0 to k_n - 1 do
1616
res.(i).(j) <- res.(i).(j) + a.(i).(k) * b.(k).(j)

code/task/three_matrix_multiplication.ml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,21 @@ open Domainslib
22

33
let num_domains = try int_of_string Sys.argv.(1) with _ -> 1
44
let n = try int_of_string Sys.argv.(2) with _ -> 1024
5-
let chunk_size = try int_of_string Sys.argv.(3) with _ -> (n/num_domains)
5+
let chunk_size = try int_of_string Sys.argv.(3) with _ -> 0
66

77
let parallel_matrix_multiply_3 pool m1 m2 m3 =
88
let size = Array.length m1 in
99
let t = Array.make_matrix size size 0 in (* stores m1*m2 *)
1010
let res = Array.make_matrix size size 0 in
1111

12-
Task.parallel_for pool ~chunk_size:(size/num_domains) ~start:0 ~finish:(size - 1) ~body:(fun i ->
12+
Task.parallel_for pool ~chunk_size ~start:0 ~finish:(size - 1) ~body:(fun i ->
1313
for j = 0 to size - 1 do
1414
for k = 0 to size - 1 do
1515
t.(i).(j) <- t.(i).(j) + m1.(i).(k) * m2.(k).(j)
1616
done
1717
done);
1818

19-
Task.parallel_for pool ~chunk_size:(size/num_domains) ~start:0 ~finish:(size - 1) ~body:(fun i ->
19+
Task.parallel_for pool ~chunk_size ~start:0 ~finish:(size - 1) ~body:(fun i ->
2020
for j = 0 to size - 1 do
2121
for k = 0 to size - 1 do
2222
res.(i).(j) <- res.(i).(j) + t.(i).(k) * m3.(k).(j)

0 commit comments

Comments
 (0)