Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrap is not reproducible wrt -j value #9507

Closed
bmwiedemann opened this issue Dec 14, 2023 · 9 comments · Fixed by #9735 or ocaml/opam-repository#25240
Closed

bootstrap is not reproducible wrt -j value #9507

bmwiedemann opened this issue Dec 14, 2023 · 9 comments · Fixed by #9735 or ocaml/opam-repository#25240
Assignees
Labels

Comments

@bmwiedemann
Copy link

While working on reproducible builds for openSUSE, I found that our ocaml-dune 3.12.1 package varies depending on the amount of parallelism.

Expected Behavior

Build output should be deterministic - as it still was in the previous version 3.11.1

Actual Behavior

Binaries produced by -j 1 and -j 4 vary even in length.
A diff of strings output looks thus:

 camlDune_rules__Package_db__fun_2964                                           
-camlOpamSystem__303
 camlDune_engine__Action_exec__fun_5160
 camlDune_engine__Action_builder0__code_begin
+camlStdune__Signal__3410
 camlDune_re__Core__witness_2586
 camlDune_rules__Modules__fun_4126
+camlNotty_uucp_data__16639
 camlDune_rpc_impl__fun_970
[...snipped...]

one created a _boot/compiled_ml_files and the other a _boot/mods_list. Build logs also differ a lot.

Reproduction

  • PR with a reproducing test:
  1. on openSUSE or Debian
#!/bin/sh
osc co openSUSE:Factory/ocaml-dune && cd $_ || exit 1
for N in 1 2 ; do
    osc build -j $N --vm-type=kvm --noservice --no-preinstallimage --keep-pkg=RPMS.$N
    unrpm RPMS.$N/ocaml-dune-*.x86_64.rpm
    strings usr/bin/dune > $N.strings
done
diff -u {1,2}.strings

Specifications

  • Version of dune (output of dune --version): 3.12.1
  • Version of ocaml (output of ocamlc --version): 4.14.1
  • Operating system (distribution and version): openSUSE Tumbleweed 20231213

Additional information

@bmwiedemann
Copy link
Author

-ocaml boot/bootstrap.ml --verbose ${jobs}                                      
+ocaml boot/bootstrap.ml

did not improve the outcome, but I found that the below helps to avoid the race:

-ocaml boot/bootstrap.ml --verbose ${jobs}                                      
+ocaml boot/bootstrap.ml -j 1

(--verbose did not make a difference)

@emillon
Copy link
Collaborator

emillon commented Dec 18, 2023

Thanks for the report. At first I thought it was something like #9152, but it looks like this particular issue is about dune's bootstrap process itself. I'll have a look.

@emillon emillon self-assigned this Dec 18, 2023
@emillon
Copy link
Collaborator

emillon commented Dec 21, 2023

I had a look at this. The core of the issue is this check:

https://github.com/ocaml/dune/blob/3.12.1/boot/duneboot.ml#L1194

For -j 1, there's a special case where instead of running the custom concurrency engine, we instead build using a single command. I guess that this is done for performance reasons, since this goes from 37s to 28s on my machine.

Build output should be deterministic - as it still was in the previous version 3.11.1

That seemed weird since I don't think we touched that recently, so I did some testing with the following test:

#!/usr/bin/env bash
rm -rf _boot
ocaml boot/bootstrap.ml --verbose -j 1 2>/dev/null
cp _boot/dune.exe dune1.exe
rm -rf _boot
ocaml boot/bootstrap.ml --verbose -j 8 2>/dev/null
cp _boot/dune.exe dune8.exe
if cmp dune1.exe dune8.exe; then
    echo "identical"
elif (( $(stat -c%s dune1.exe)==$(stat -c%s dune8.exe)));then
  echo "same size"
else
  echo "completely different"
fi

For 3.6.0, 3.11.1 and 3.12.1 this prints "completely different", so I think that this has been the case since the new bootstrap process got introduced in #2854. And I verified that removing the optimization prints "identical".

As for how to fix this: this is easy to add (remove the if concurrency = 1 test) but this causes a tradeoff between performance and reproducibility, and I'm not sure how we should make the choice.

  • we could try to make both build plans create the same binary. TBH, I don't think that's possible. I don't think that there are huge concerns that the binaries are going to act differently, but they're built from different sequences of commands.
  • we could say it's not an issue. -j1 and -jx return different results, but -j1 will all return the same thing and -jx will always return the same thing (I don't think that's the right answer)
  • we could remove the optimization
  • otherwise, we need a way to tweak the behaviour. let's say we add a -prefer flag, -prefer performance will use the single command build, and -prefer reproducibility will use the sequential build (using the concurrency engine running at most 1 job). which one should be the default, then?
    • if performance is the default (equivalent to today), it means that downstream users including opensuse need to adapt their build instructions to get reproducibility
    • if reproducible is the default, this means that users doing sequential builds of dune will need a way to get that extra juice, but it's unlikely they can plumb that to opam (assuming they're using opam).

Having written that, I think that the easiest way for now is to introduce a -reproducible flag to the bootstrap process to disable the optimization, so that -reproducible -j 1 and -reproducible -j 4 will create the same binary. If this sounds like a good idea we can try to make that the default later. What do you think?

@rgrinberg
Copy link
Member

rgrinberg commented Dec 21, 2023

Getting rid of the optimization sounds like the simplest thing tbh. It's problematic in other contexts - such as on systems with limited memory. A flag like -reproducible is another way of saying -avoid-buggy-behavior, which doesn't sound like a good interface.

@rgrinberg
Copy link
Member

If building in a single command is important to someone, we could always move it to a separate -build-single-command option.

@bmwiedemann
Copy link
Author

You are right. I re-ran the test with 3.11.1 and it produced the same variations today.
A default -reproducible option sounds good.

I checked the different (j1/j4) commands used to build dune.exe and besides -no-alias-deps -w -49-6 the diff is between the ordering in -args mods_list vs -args compiled_ml_files - so I wonder if both produced the same output with a consistently ordered input list.

@emillon
Copy link
Collaborator

emillon commented Dec 22, 2023

I checked the difference between the binaries using https://github.com/noseglasses/elf_diff. In the separately-built version, the compiler managed to do some optimizations and some functions in OpamFormula are different.
I tried to add some flags to make sure that the outputs match but it seems impossible.
So yes, let's get rid of the optimization by default. We still need the single build command for windows anyway.

emillon added a commit to emillon/dune that referenced this issue Dec 22, 2023
Fixes ocaml#9507

In `duneboot.ml`, there are 2 strategies to build `dune.exe`:
- separate compilation (build all modules and link them together in
  several commands)
- single command compilation (pass all modules to the compiler driver)

Separate compilation can be executed in parallel when `-j` is set
accordingly, which speeds up the build.
But when only one job can be executed at once (`-j 1`), it is faster to
use a single command.

So, we would switch to single command build when `-j 1` was passed.
However, this is surprising because it produces a different binary
depending on the number of parallel jobs (see ocaml#9507). So this disables
this optimization by default; it can be enabled by hand by passing
`--use-single-command`.

(NB: this does not apply to Windows, which always uses single command
compilation)
emillon added a commit to emillon/dune that referenced this issue Dec 22, 2023
Fixes ocaml#9507

In `duneboot.ml`, there are 2 strategies to build `dune.exe`:
- separate compilation (build all modules and link them together in
  several commands)
- single command compilation (pass all modules to the compiler driver)

Separate compilation can be executed in parallel when `-j` is set
accordingly, which speeds up the build.
But when only one job can be executed at once (`-j 1`), it is faster to
use a single command.

So, we would switch to single command build when `-j 1` was passed.
However, this is surprising because it produces a different binary
depending on the number of parallel jobs (see ocaml#9507). So this disables
this optimization by default; it can be enabled by hand by passing
`--use-single-command`.

(NB: this does not apply to Windows, which always uses single command
compilation)

Signed-off-by: Etienne Millon <me@emillon.org>
@emillon emillon changed the title possible build-time race bootstrap is not reproducible wrt -j value Dec 22, 2023
@rgrinberg
Copy link
Member

We still need the single build command for windows anyway.

Could you remind me why? Seems odd and probably means that we can't build dune on a windows box with low memory.

@emillon
Copy link
Collaborator

emillon commented Jan 3, 2024

I actually don't know. I opened #9613 and it doesn't seem to be necessary.

@rgrinberg rgrinberg added the bug label Jan 8, 2024
emillon added a commit to emillon/dune that referenced this issue Jan 8, 2024
Fixes ocaml#9507

In `duneboot.ml`, there are 2 strategies to build `dune.exe`:
- separate compilation (build all modules and link them together in
  several commands)
- single command compilation (pass all modules to the compiler driver)

Separate compilation can be executed in parallel when `-j` is set
accordingly, which speeds up the build.
But when only one job can be executed at once (`-j 1`), it is faster to
use a single command.

So, we would switch to single command build when `-j 1` was passed.
However, this is surprising because it produces a different binary
depending on the number of parallel jobs (see ocaml#9507). So this disables
this optimization by default; it can be enabled by hand by passing
`--use-single-command`.

(NB: this does not apply to Windows, which always uses single command
compilation)

Signed-off-by: Etienne Millon <me@emillon.org>
emillon added a commit to emillon/dune that referenced this issue Jan 15, 2024
Fixes ocaml#9507

The bootstrap process can use 2 strategies:
- parallel: run compile commands in parallel and link the rest
- single-command: run ocamlopt with a long list of arguments

The single-command strategy is used if win32 or if `-j 1` is set
(implicitly or explicitly).

One problem is that: parallel and single-command create different
binaries (ocaml#9507).

The assumption was that single-command would be faster than `-j 1` on
Linux and any parallel build on Windows. However, from a quick benchmark
the assumption is true on Linux but false on Windows.

The tiny savings in the case of Linux where only a single core is
available are not enough to justify the extra code path and
reproducibility gotcha.

Signed-off-by: Etienne Millon <me@emillon.org>
emillon added a commit to emillon/dune that referenced this issue Jan 16, 2024
Fixes ocaml#9507

The bootstrap process can use 2 strategies:
- parallel: run compile commands in parallel and link the rest
- single-command: run ocamlopt with a long list of arguments

The single-command strategy is used if win32 or if `-j 1` is set
(implicitly or explicitly).

One problem is that: parallel and single-command create different
binaries (ocaml#9507).

The assumption was that single-command would be faster than `-j 1` on
Linux and any parallel build on Windows. However, from a quick benchmark
the assumption is true on Linux but false on Windows.

The tiny savings in the case of Linux where only a single core is
available are not enough to justify the extra code path and
reproducibility gotcha.

Signed-off-by: Etienne Millon <me@emillon.org>
emillon added a commit that referenced this issue Jan 16, 2024
Fixes #9507

The bootstrap process can use 2 strategies:
- parallel: run compile commands in parallel and link the rest
- single-command: run ocamlopt with a long list of arguments

The single-command strategy is used if win32 or if `-j 1` is set
(implicitly or explicitly).

One problem is that: parallel and single-command create different
binaries (#9507).

The assumption was that single-command would be faster than `-j 1` on
Linux and any parallel build on Windows. However, from a quick benchmark
the assumption is true on Linux but false on Windows.

The tiny savings in the case of Linux where only a single core is
available are not enough to justify the extra code path and
reproducibility gotcha.

Signed-off-by: Etienne Millon <me@emillon.org>
emillon added a commit to emillon/opam-repository that referenced this issue Feb 9, 2024
CHANGES:

### Added

- Introduce a `(dynamic_include ..)` stanza. This is like `(include foo)` but
  allows `foo` to be the target of a rule. Currently, there are some
  limitations on the stanzas that can be generated. For example, public
  executables, libraries are currently forbidden. (ocaml/dune#9913, @rgrinberg)

- Introduce `$ dune promotion list` to print the list of available promotions.
  (ocaml/dune#9705, @moyodiallo)

- If Sherlodoc is installed, add a search bar in generated HTML docs (ocaml/dune#9772,
  @EmileTrotignon)

- Add `only_sources` field to `copy_files` stanza (ocaml/dune#9827, fixes ocaml/dune#9709,
  @jchavarri)

- The `(foreign_library)` stanza now supports the `(enabled_if)` field. (ocaml/dune#9914,
  @nojb)

### Fixed

- Fix `$ dune install -p` incorrectly recognizing packages that are supposed to
  be filtered (ocaml/dune#9879, fixes ocaml/dune#4814, @rgrinberg)

- subst: correctly handle opam files in opam/ subdirectory (ocaml/dune#9895, fixes ocaml/dune#9862,
  @emillon)

- Odoc private rules are not set up if a library is not available due to
  `enabled_if` (ocaml/dune#9897, @rgrinberg and @jchavarri)

### Changed

- When dune language 3.14 is enabled, resolve the binary in `(run %{bin:..}
  ..)` from where the binary is built. (ocaml/dune#9708, @rgrinberg)

- boot: remove single-command bootstrap. This was an alternative bootstrap
  strategy that was used in certain conditions. Removal makes the bootstrap a
  bit slower on Linux when only a single core is available, but bootstrap is
  now reproducible in all cases. (ocaml/dune#9735, fixes ocaml/dune#9507, @emillon)
emillon added a commit to emillon/opam-repository that referenced this issue Feb 12, 2024
CHANGES:

### Added

- Introduce a `(dynamic_include ..)` stanza. This is like `(include foo)` but
  allows `foo` to be the target of a rule. Currently, there are some
  limitations on the stanzas that can be generated. For example, public
  executables, libraries are currently forbidden. (ocaml/dune#9913, @rgrinberg)

- Introduce `$ dune promotion list` to print the list of available promotions.
  (ocaml/dune#9705, @moyodiallo)

- If Sherlodoc is installed, add a search bar in generated HTML docs (ocaml/dune#9772,
  @EmileTrotignon)

- Add `only_sources` field to `copy_files` stanza (ocaml/dune#9827, fixes ocaml/dune#9709,
  @jchavarri)

- The `(foreign_library)` stanza now supports the `(enabled_if)` field. (ocaml/dune#9914,
  @nojb)

### Fixed

- Fix `$ dune install -p` incorrectly recognizing packages that are supposed to
  be filtered (ocaml/dune#9879, fixes ocaml/dune#4814, @rgrinberg)

- subst: correctly handle opam files in opam/ subdirectory (ocaml/dune#9895, fixes ocaml/dune#9862,
  @emillon)

- Odoc private rules are not set up if a library is not available due to
  `enabled_if` (ocaml/dune#9897, @rgrinberg and @jchavarri)

### Changed

- When dune language 3.14 is enabled, resolve the binary in `(run %{bin:..}
  ..)` from where the binary is built. (ocaml/dune#9708, @rgrinberg)

- boot: remove single-command bootstrap. This was an alternative bootstrap
  strategy that was used in certain conditions. Removal makes the bootstrap a
  bit slower on Linux when only a single core is available, but bootstrap is
  now reproducible in all cases. (ocaml/dune#9735, fixes ocaml/dune#9507, @emillon)
nberth pushed a commit to nberth/opam-repository that referenced this issue Jun 18, 2024
CHANGES:

### Added

- Introduce a `(dynamic_include ..)` stanza. This is like `(include foo)` but
  allows `foo` to be the target of a rule. Currently, there are some
  limitations on the stanzas that can be generated. For example, public
  executables, libraries are currently forbidden. (ocaml/dune#9913, @rgrinberg)

- Introduce `$ dune promotion list` to print the list of available promotions.
  (ocaml/dune#9705, @moyodiallo)

- If Sherlodoc is installed, add a search bar in generated HTML docs (ocaml/dune#9772,
  @EmileTrotignon)

- Add `only_sources` field to `copy_files` stanza (ocaml/dune#9827, fixes ocaml/dune#9709,
  @jchavarri)

- The `(foreign_library)` stanza now supports the `(enabled_if)` field. (ocaml/dune#9914,
  @nojb)

### Fixed

- Fix `$ dune install -p` incorrectly recognizing packages that are supposed to
  be filtered (ocaml/dune#9879, fixes ocaml/dune#4814, @rgrinberg)

- subst: correctly handle opam files in opam/ subdirectory (ocaml/dune#9895, fixes ocaml/dune#9862,
  @emillon)

- Odoc private rules are not set up if a library is not available due to
  `enabled_if` (ocaml/dune#9897, @rgrinberg and @jchavarri)

### Changed

- When dune language 3.14 is enabled, resolve the binary in `(run %{bin:..}
  ..)` from where the binary is built. (ocaml/dune#9708, @rgrinberg)

- boot: remove single-command bootstrap. This was an alternative bootstrap
  strategy that was used in certain conditions. Removal makes the bootstrap a
  bit slower on Linux when only a single core is available, but bootstrap is
  now reproducible in all cases. (ocaml/dune#9735, fixes ocaml/dune#9507, @emillon)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment