Skip to content

Adjust PBTs based on recommended_domain_count #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 1, 2023

Conversation

jmid
Copy link
Contributor

@jmid jmid commented May 23, 2023

I noticed the CI run of #110 had 32-bit Failure("failed to allocate domain") in test/task_one_dep.ml

I believe this is because the test does not take into account the available cores on the target machine.
This PR therefore adjusts the test to do so.

I hope this is sufficient to please the (32-bit) CI so that we can avoid disabling certain platforms like #111 proposes 🤞

@jmid
Copy link
Contributor Author

jmid commented May 23, 2023

On the positive side, this roughly halves the Windows CI time.
On the negative side there was still a "failed to allocate domain" error...
In principle we could just adjust the tested property to make this acceptable behaviour.

I believe the 5.0 x86_32-bit ocaml-ci error is due to using (modes native) - which will not be supported.
Since dune should choose the "best mode available" simply deleting these lines should be sufficient to avoid the error.
However I'm wondering whether the native lines were added to disable the tests on non-native platforms? 🤔

@jmid
Copy link
Contributor Author

jmid commented May 23, 2023

I believe the 5.0 x86_32-bit ocaml-ci error is due to using (modes native) - which will not be supported.

I now see 5.0-ppc64 and 5.0-s390x are going to fail for the same reason.
Perhaps the proper fix would be to disable the tests (not the package) on bytecode-only architectures?

@Sudha247
Copy link
Contributor

Thanks for taking a look at this @jmid!

However I'm wondering whether the native lines were added to disable the tests on non-native platforms?

I think native modes were added for performance reasons (@kayceesrk can correct me if I'm wrong) - it should be okay to let dune decide what to run. Would that fix the failures on bytecode-only platforms?

@kayceesrk
Copy link
Contributor

I think native modes were added for performance reasons (@kayceesrk can correct me if I'm wrong) - it should be okay to let dune decide what to run. Would that fix the failures on bytecode-only platforms?

I don't think it is useful to run performance-sensitive workloads in bytecode mode. And, domainslib is for performance. I can't imagine bytecode runs for anything using domainslib being useful.

@kayceesrk
Copy link
Contributor

That said, if you just want compatibility, you can remove to (modes native).

@jmid
Copy link
Contributor Author

jmid commented May 24, 2023

Would that fix the failures on bytecode-only platforms?

It would at least fix the Build failed errors due to Error: No rule found for test/off_by_one.exe such as the ones in https://ocaml.ci.dev/github/ocaml-multicore/domainslib/commit/480029b010decd3e913cfcb3c9c2669327a4b9b3/variant/debian-11-5.0_s390x_opam-2.1

Now, with the s390x native backend just merged yesterday, that will save us from having to roll a new Domainslib release to support this reestablished native platform when it get released with 5.1 or 5.2.

It will not fix the Failed to allocate domain error that we are still seeing.
I suspect this might be caused by dune runtest running parallel tests in parallel 🤔

There is something odd to me with all tests except the QCheck ones disabled in bytecode mode.
Disabling all the tests (not the package) on bytecode-only architectures strikes me as more consistent and future proof.
It should be possible to do so in the opam-file.

@Sudha247
Copy link
Contributor

It will not fix the Failed to allocate domain error that we are still seeing.
I suspect this might be caused by dune runtest running parallel tests in parallel

Hm, possibly. What you mean is running parallel tests in parallel ends up spawning more than the limit - wondering if this could happen in native runs too?

Disabling all the tests (not the package) on bytecode-only architectures strikes me as more consistent and future proof.

I'm not sure why we should support architectures where we know the testsuite doesn't go through. Is it for the sake of completeness or am I missing something?

@kayceesrk
Copy link
Contributor

If there is an easy way to disable 32-bit builds, ppc64, s390x in the ocaml-ci, we should do it. s390x will land in 5.1. ppc64 sometime in the future.

@Sudha247
Copy link
Contributor

If there is an easy way to disable 32-bit builds, ppc64, s390x in the ocaml-ci, we should do it.

#111 does it - but in a restrictive way by making the package unavailable in all those platforms. I'm open to considering a less restrictive way to do it, if that's beneficial.

@jmid
Copy link
Contributor Author

jmid commented May 25, 2023

Great, I think fewer restrictions is key to enable broader adoption 👍

I've made 3 changes:

  • bd42572 accepts Failure "failed to allocate domain" from Domain.spawn which helps on 32-bit architectures where fewer domains are available
  • 89a03c0 removes the (modes native) annotations from the other tests as it causes ocaml-ci to signal test failures
  • b060a68 enables the two longest-running PBTs on only 64-bit architectures (32-bit ones are bytecode). They will trigger on 64-bit bytecode platforms though (e.g., ppc64 which we expect to eventually have its native backend restored)

Let's see how this fares on ocaml-ci... 🍿

@jmid
Copy link
Contributor Author

jmid commented May 25, 2023

CI summary (while I wait for the ppc64 run to complete):

  • test/backtrace.ml is failing on arm32, ppc64, s390x, x86_32
  • test/off_by_one.ml is failing with Failure "failed to allocate domain" on arm32, x86_32
  • test/LU_decomposition_multicore.ml is failing with with a debug runtime assertion failure on arm32:
### OCaml runtime: debug mode ###
[00] file runtime/fiber.c; line 182 ### Assertion failed: Stack_high(stack) - Stack_base(stack) == wosize || Stack_high(stack) - Stack_base(stack) == wosize + 1
  • The backtrace failure seems related to using bytecode. Here's some info from the option-returning backtrace_slots: https://v2.ocaml.org/releases/5.0/api/Printexc.html#1_Manipulationofbacktraceinformation I would have expected the test to be compiled with debug info -g by default under dune 🤔
  • The off_by_one failure is natural on 32-bit platforms with Max_domains = 16. It should be easy to accept this behaviour.
  • The decomposition failure sounds more serious...

@jmid
Copy link
Contributor Author

jmid commented May 25, 2023

The ppc64 run ended up failing with a time out, so 0637fb1 also disables the two long running PBTs on ppc64 (64 bit bytecode for the moment).

07ca1ff adjusts test/off_by_one to accept a Failure "failed to allocate domain".

This leaves the backtrace option failure and the decomposition assertion failure.

@jmid
Copy link
Contributor Author

jmid commented May 26, 2023

After confirming locally I've now opened an issue for the LU_decomposition assertion failure on arm32 here: ocaml/ocaml#12267

@jmid
Copy link
Contributor Author

jmid commented May 26, 2023

I would welcome input on the backtrace (bytecode) failure, where it seems Printexc.get_raw_backtrace returns None.

@Sudha247
Copy link
Contributor

We don't know yet if the backtrace test failure is a compiler bug. We have an issue for the LU decomposition failure. Anyways, I think we can disable backtrace (and open an issue if needed) and LU_decompositon tests on bytecode-only platforms and merge this PR.

@jmid
Copy link
Contributor Author

jmid commented May 31, 2023

I've now dug a bit further and it seems to be a dune issue on bytecode only switches: ocaml/dune#7845
I've also disabled the backtrace test on bytecode platforms in b10075a.

Let's see how this fares on the CI 🍿

@Sudha247
Copy link
Contributor

Sudha247 commented Jun 1, 2023

The CI is happy finally! Thanks for the work, Jan. Merging this.

@Sudha247 Sudha247 merged commit f4e08c0 into ocaml-multicore:main Jun 1, 2023
@jmid jmid deleted the pbt-adjustments branch June 1, 2023 09:32
Sudha247 added a commit to Sudha247/opam-repository that referenced this pull request Jul 18, 2023
CHANGES:

* Add parallel_find (ocaml-multicore/domainslib#90, @gasche)
* Update CI (ocaml-multicore/domainslib#93, @Sudha247)
* Optimisation to work-stealing (ocaml-multicore/domainslib#96, @art-w)
* Improve docs presentation (ocaml-multicore/domainslib#99, @metanivek)
* Property based tests (ocaml-multicore/domainslib#100, jmid)
* Task: avoid double handler installation (ocaml-multicore/domainslib#101, @gasche & @clef-men)
* Fix a benign data-race in Chan reported by ocaml-tsan (ocaml-multicore/domainslib#103, @art-w)
* Dune, opam, and GitHub Actions fixes (ocaml-multicore/domainslib#105, @MisterDA)
* domain local await support (ocaml-multicore/domainslib#107, @polytypic)
* Windows run on GitHub Actions (ocaml-multicore/domainslib#110, @Sudha247)
* Adjust PBTs based on recommended_domain_count (ocaml-multicore/domainslib#112, @jmid)
* Test condition tweaks (ocaml-multicore/domainslib#113, @jmid)
nberth pushed a commit to nberth/opam-repository that referenced this pull request Jun 18, 2024
CHANGES:

* Add parallel_find (ocaml-multicore/domainslib#90, @gasche)
* Update CI (ocaml-multicore/domainslib#93, @Sudha247)
* Optimisation to work-stealing (ocaml-multicore/domainslib#96, @art-w)
* Improve docs presentation (ocaml-multicore/domainslib#99, @metanivek)
* Property based tests (ocaml-multicore/domainslib#100, jmid)
* Task: avoid double handler installation (ocaml-multicore/domainslib#101, @gasche & @clef-men)
* Fix a benign data-race in Chan reported by ocaml-tsan (ocaml-multicore/domainslib#103, @art-w)
* Dune, opam, and GitHub Actions fixes (ocaml-multicore/domainslib#105, @MisterDA)
* domain local await support (ocaml-multicore/domainslib#107, @polytypic)
* Windows run on GitHub Actions (ocaml-multicore/domainslib#110, @Sudha247)
* Adjust PBTs based on recommended_domain_count (ocaml-multicore/domainslib#112, @jmid)
* Test condition tweaks (ocaml-multicore/domainslib#113, @jmid)
shonfeder pushed a commit to ocaml/opam-repository that referenced this pull request Apr 10, 2025
CHANGES:

* Upgrade to Saturn 1.0 (ocaml-multicore/domainslib#129, @Sudha247)
* Update README.md instruction to use OCaml 5.1.0 (ocaml-multicore/domainslib#123, @punchagan)
* Fix Saturn.Queue function (ocaml-multicore/domainslib#121, @Sudha247)
* Make parallel_scan work on noncommutative functions (ocaml-multicore/domainslib#118, @aytao)
* Test condition tweaks (ocaml-multicore/domainslib#113, @jmid)
* Adjust PBTs based on recommended_domain_count (ocaml-multicore/domainslib#112, @jmid)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants