Description
Since 5.0 we have observed that a combination of Domain
s and Thread
s can cause either segfaults or dead/live-locks in the MingW Windows port. We have observed the issue when testing both backends (native code and bytecode) but it seems easier to trigger in bytecode mode. We suspect that both kinds of failures may be caused by the same underlying problem.
The test itself generates a combination of Domain
s and Thread
s as a dependency tree, encoded as a record of arrays.
For the generation-part, there's a QCheck dependency (for now).
To recreate:
- install a 5.1 MingW ocaml
- install the
dune
andqcheck-core
packages - clone this branch: https://github.com/ocaml-multicore/multicoretests/tree/reproduce-threadomain
dune build src/threadomain/threadomain.bc
while _build/default/src/threadomain/threadomain.bc -v -s 377546401; do :; done
The last line of the above, simply repeats a bytecode version of the test until failure:
$ while _build/default/src/threadomain/threadomain.bc -v -s 377546401; do :; done
random seed: 377546401
generated error fail pass / total time test name
[✓] 3 0 0 3 / 3 6.0s Mash up of threads and domains
================================================================================
success (ran 1 tests)
[...]
random seed: 377546401
generated error fail pass / total time test name
[ ] 2 0 0 2 / 3 2.3s Mash up of threads and domainsSegmentation fault
A live(or dead)lock is observed by no progress happening (and no QCheck callbacks executed to update the test status), after 2secs or so:
random seed: 377546401
generated error fail pass / total time test name
[✓] 3 0 0 3 / 3 6.2s Mash up of threads and domains
================================================================================
success (ran 1 tests)
random seed: 377546401
generated error fail pass / total time test name
[ ] 2 0 0 2 / 3 2.3s Mash up of threads and domains
For a while we have observed these timeouts and crashes occasionally on this test in our CI, but have struggled to cook up reproduction steps: ocaml-multicore/multicoretests#203
To get a sense of the behaviour here's a summary of 5 runs to get a sense of the behaviour:
- segfault on iteration 1
- dead/live-lock on iteration 6
- segfault on iteration 18
- dead/live-lock on iteration 5
- segfault on iteration 18
Above I use the seed 377546401 which works on my machine/setup.
I initially found this particular seed by running the same loop with random seeds:
while _build/default/src/threadomain/threadomain.bc -v ; do :; done
Eventually this crashed on the 22th iteration on random seed: 377546401
which made me pass that with -s 377546401
.
To recreate others may have more luck following the same process rather than simply using the same seed.
Credit to @shym for having written this nice torture instrument 😄