Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dune process hangs on macos #8083

Closed
jchavarri opened this issue Jun 29, 2023 · 33 comments · Fixed by ocaml/opam-repository#24073
Closed

dune process hangs on macos #8083

jchavarri opened this issue Jun 29, 2023 · 33 comments · Fixed by ocaml/opam-repository#24073
Milestone

Comments

@jchavarri
Copy link
Collaborator

jchavarri commented Jun 29, 2023

Recently, we updated Ahrefs codebase to the latest version of Dune (to be more specific, commit 71024ef) from the previous version we were using from back in April (7bb6de7), which was working fine.

After upgrading, all our Linux CI agents work perfectly fine, but on the macOS agents we have noticed that dune processes stay running forever, with CPU usage over 90%.

Here is some example of a dune process running for 55+minutes and with CPU at 100%:

$ ps aux | grep dune
USER              PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
user              3759 100.0  0.0 34302304    448   ??  R     3:20PM  55:12.57 /Users/user/builds/macvm100/ahrefs/monorepo-extra/_opam/bin/dune build -p dune-private-libs -j 7 @install
user              3737   0.0  1.1 34302560  93844   ??  S     3:19PM   0:04.48 /Users/user/builds/macvm100/ahrefs/monorepo-extra/_opam/bin/dune build -p dune-private-libs -j 7 @install
user             36569   0.0  0.0 34122844    856 s000  S+    4:15PM   0:00.00 grep dune

Looking at different occurrences of the issue, I couldn't find any pattern, on the packages where it appears. What I can say is that it happens on different versions of macos. In particular, the one in the agent used in the command above is 12.3.1, but it also happens on 12.6.1 and 13.0.1.

I have tried to gather some information about what the hanging dune process is exactly doing. Calling sample 3759 with the process id gathered with ps aux above shows a lot of nested callbacks with camlDune_engine__Process__fun_4514, not sure if this is expected. You can find the full output of that command here:

sample-3759.txt

Is there anything I could do to help diagnose the problem?

@jchavarri
Copy link
Collaborator Author

Here is another result from running ps aux and then sample on one pid from a different case, in a different agent, in case it helps, multiple processes can be seen due to different retries (seems they don't get killed properly on timeout either):

ps aux | grep dune
USER              PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
user            51794  54.8  0.0  4369616    140   ??  R    Wed12PM 714:10.99 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p extunix -j 7 @install
user            63680  49.1  0.0  4381524    140   ??  R    Wed12PM 713:24.35 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen -j 7 @install
user            82840  49.0  0.0  4389020    128   ??  R    Tue08PM 1571:55.95 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p base_quickcheck -j 7
user            70042  48.9  0.0  4413384    128   ??  R    Tue08PM 1574:25.93 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p odoc -j 7
user            65598  48.9  0.0  4385204    140   ??  R    Tue08PM 1576:08.01 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlgraph -j 7 @install
user            43954  48.9  0.0  4417832    128   ??  R    Tue08PM 1575:22.95 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p merlin-lib -j 7
user            60861  48.8  0.0  4384232    140   ??  R    Wed12PM 704:43.24 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p base_quickcheck -j 7
user            73751  48.7  0.0  4420104    416 s000  R+    8:32AM 185:55.48 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 --promote-install-files=false @install
user            55107  40.5  0.0  4410496    424   ??  R    Wed03PM 603:09.13 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user            27690  40.3  0.0  4414336    428   ??  R    Wed03PM 612:21.65 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user            69702  40.2  0.0  4376748    140   ??  R    Tue08PM 1570:39.99 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p extunix -j 7 @install
user            37138  40.0  0.0  4377368    384   ??  R    Wed03PM 608:47.87 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen-www -j 7
user            63794  39.6  0.0  4377456    140   ??  R    Wed12PM 713:28.93 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p atdgen-www -j 7
user            60278  39.5  0.0  4377096    132   ??  R    Wed01PM 647:21.11 /Users/user/builds/macvm202/ahrefs/dune-trim/_opam/.opam-switch/build/dune.20230627.git/./_boot/dune.exe build dune.install --release --profile dune-bootstrap -j 7
user            79056  39.0  0.0  4383864    128   ??  R    Tue08PM 1573:04.25 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ppx_deriving -j 7
user            42147  37.3  0.0  4418300    140   ??  R    Tue08PM 1577:02.57 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 @install
user            40195  36.7  0.0  4381080    424   ??  R    Wed03PM 608:59.14 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p devkit -j 7 @install
user            43079  36.6  0.0  4410240    384   ??  R     5:35PM 550:35.02 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p ocamlformat -j 7 @install
user            83921  36.3  0.0  4377224    400   ??  R    Wed03PM 592:26.90 /Users/user/builds/macvm202/ahrefs/dune-trim/_opam/.opam-switch/build/dune.20230627.git/./_boot/dune.exe build dune.install --release --profile dune-bootstrap -j 7
user            44929  35.5  0.0  4419452    140   ??  R    Wed12PM 714:35.21 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 @install
user            73733   0.1  0.8  4419836  64652 s000  S+    8:32AM   0:34.97 /Users/user/builds/macvm202/ahrefs/monorepo-extra/_opam/bin/dune build -p camomile -j 7 --promote-install-files=false @install
user            10302   0.0  0.0  4334020    736 s001  R+    4:29PM   0:00.00 grep dune

sample-27690.txt

@nojb
Copy link
Collaborator

nojb commented Jun 29, 2023

Bisecting is probably your best bet. Before doing that you may want to test commit 886ff3b that caused some issues (on Windows, but you never know).

@jchavarri
Copy link
Collaborator Author

Ok, I am trying to bisect now with older commits, it will take a while as the issue doesn't happen always.

Before doing that you may want to test commit 886ff3b that caused some issues (on Windows, but you never know).

Hm, I tested this commit and its parent, and both seemed to work fine. Then I noticed the commit is not actually on the main branch, is that right?

@nojb
Copy link
Collaborator

nojb commented Jun 30, 2023

Hm, I tested this commit and its parent, and both seemed to work fine. Then I noticed the commit is not actually on the main branch, is that right?

Yes, it is, but it is not the same commit because it was cherry-picked to the release 3.8 branch.

@rgrinberg
Copy link
Member

rgrinberg commented Jun 30, 2023

Are these builds hanging in watch mode?

Could you run dtruss -p to see which syscalls are being invoked?

@jchavarri
Copy link
Collaborator Author

jchavarri commented Jun 30, 2023

Are these builds hanging in watch mode?

They happen in regular, non-watch mode, mostly when installing many packages with opam (where dune is invoked a lot). I am not sure it might happen in watch mode, but I haven't been able to reproduce in watch mode, and I have not received any reports from other developers about that mode.

Could you run dtruss -p to see which syscalls are being invoked?

$ sudo dtruss -p 24983
dtrace: system integrity protection is on, some features will not be available

SYSCALL(args) 		 = return
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90)		 = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90)		 = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90)		 = 0 0
....

continues forever (around 10-15 new lines / second).

@rgrinberg
Copy link
Member

I don't even understand how this is possible:

select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0

From the manual:

select() returns the number of ready descriptors that are contained in the
descriptor sets, or -1 if an error occurred. If the time limit expires,
select() returns 0

How can the time limit expire if I set it to block indefinitely.

I also don't understand why they stay running forever. If select is constantly returning, it should eventually be shut off by the finalizer.

@rgrinberg
Copy link
Member

Try this patch, see if it makes a difference:

diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml
index 501a5d2f5..c8f465b70 100644
--- a/src/dune_async_io/async_io.ml
+++ b/src/dune_async_io/async_io.ml
@@ -242,13 +242,7 @@ let with_io scheduler f =
           ~f:(fun () -> select_loop t)
           ~finally:(fun () -> Mutex.unlock t.mutex))
   in
-  Fiber.Var.set t_var t (fun () ->
-      Fiber.finalize f ~finally:(fun () ->
-          Mutex.lock t.mutex;
-          t.running <- false;
-          interrupt t;
-          Mutex.unlock t.mutex;
-          Fiber.return ()))
+  Fiber.Var.set t_var t (fun () -> f)
 
 let with_ f =
   let+ t = Fiber.Var.get_exn t_var in

@jchavarri
Copy link
Collaborator Author

I can confirm that I've been able to run multiple successful installations of packages (1000+ pkgs are installed each time, most of them using dune) without issues by pinning the version of Dune to e1f3e19 (the commit before 12a0268).

So, while we can't confirm that 12a0268 is the offender, I think it's quite sure that something changed between that commit and 71024ef that caused the regression.

Try this patch, see if it makes a difference:

I will try this next.

@jchavarri
Copy link
Collaborator Author

Maybe related? https://beesbuzz.biz/code/5739-The-problem-with-select-vs-poll

Fwiw we set ulimit -n to a very high value (65536) on the CI agents where we are seeing the problems.

@rgrinberg
Copy link
Member

OCaml's select will error if we go over the fd limit.

Also, dune doesn't use that many fd's concurrently anyway so it's unlikely we'll ever run in the issues in that article.

Finally, if that was the problem, it wouldn't be macos specific.

Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing Obj.magic.

@jchavarri
Copy link
Collaborator Author

Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing Obj.magic.

You mean something like this?

diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml
index bbfa01011..a7d6aa4c0 100644
--- a/src/dune_async_io/async_io.ml
+++ b/src/dune_async_io/async_io.ml
@@ -190,6 +190,10 @@ let rec select_loop t =
          [t.pipe_write] will interrupt this select *)
       Unix.select read write [] (-1.0)
     in
+    print_endline
+      (Printf.sprintf "Readers: %s\nWriters: %s"
+         (String.concat ~sep:", " (List.map ~f:Obj.magic readers))
+         (String.concat ~sep:", " (List.map ~f:Obj.magic writers)));
     assert (ex = []);
     (* Before we acquire the lock, it's possible that new tasks were added.
        This is fine. *)

I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.

@emillon
Copy link
Collaborator

emillon commented Jun 30, 2023

Also, dune doesn't use that many fd's concurrently anyway so it's unlikely we'll ever run in the issues in that article.

If I understand correctly, the problem with select is not only that you can't use it with many fds; it's also that you can't use it with fds that have a high value. So it can be an issue even if there's no fd leak or many concurrently open fds.

@nojb
Copy link
Collaborator

nojb commented Jun 30, 2023

select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x700005DD7F90) = 0 0

Another odd thing: there are two uses of Unix.select in Dune, and both have a non-empty set of read descriptors, which is inconsistent with these calls.

@rr0gi
Copy link

rr0gi commented Jun 30, 2023

(it could be sleep implemented via empty select)

@rgrinberg
Copy link
Member

You mean something like this?


diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml

index bbfa01011..a7d6aa4c0 100644

--- a/src/dune_async_io/async_io.ml

+++ b/src/dune_async_io/async_io.ml

@@ -190,6 +190,10 @@ let rec select_loop t =

          [t.pipe_write] will interrupt this select *)

       Unix.select read write [] (-1.0)

     in

+    print_endline

+      (Printf.sprintf "Readers: %s\nWriters: %s"

+         (String.concat ~sep:", " (List.map ~f:Obj.magic readers))

+         (String.concat ~sep:", " (List.map ~f:Obj.magic writers)));

     assert (ex = []);

     (* Before we acquire the lock, it's possible that new tasks were added.

        This is fine. *)

I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.

Yes that's correct. I would add a similar log message before the select as well.

@rgrinberg
Copy link
Member

(it could be sleep implemented via empty select)

I think that's correct. We have a background thread to handle timers that polls at 10hz.

Mystery solved why we're calling select so much.

Still no idea what's burning all the cpu though.

@jchavarri can you try attaching lldb and giving us all the backtraces

@jchavarri
Copy link
Collaborator Author

There were two dune processes installing melange at the point when things got stuck, so I took info from both. If there's some other commands I can run with lldb please lmk, i have no idea what i'm doing essentially:

$ ps aux | grep dune
j                14097  99.1  0.0 34353444    748 s000  R+    5:33PM   6:52.41 dune build -p melange -j 15 @install
j                30372  99.0  0.0 34258868    120   ??  R    10:33AM 364:30.91 dune build -p melange-json -j 15 @install
j                30102   0.0  0.0 34256820   4772   ??  S    10:33AM   0:05.25 dune build -p melange-json -j 15 @install
j                46493   0.0  0.0 34129560    588 s008  R+    5:43PM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox dune
j                13495   0.0  0.0 34353444   5004 s000  S+    5:33PM   0:01.94 dune build -p melange -j 15 @install

$ lldb -p 14097
(lldb) process attach --pid 14097
dune was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 14097 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x68)
    frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
Target 0: (dune) stopped.

Executable module set to "/Users/j/Development/github/ahrefs/monorepo/_opam/bin/dune".
Architecture set to: x86_64h-apple-macosx-.
(lldb) frame info
frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x68)
  * frame #0: 0x0000000108610108 dune`caml_thread_reinitialize at st_stubs.c:409:51 [opt]
    frame #1: 0x00007ff81a5f002f libsystem_pthread.dylib`_pthread_atfork_child_handlers + 69
    frame #2: 0x00007ff81a4df645 libsystem_c.dylib`fork + 84
    frame #3: 0x0000000108619926 dune`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=17, v_use_vfork=1, v_setpgid=4440459752) at spawn_stubs.c:439:43 [opt]
    frame #4: 0x0000000108535f20 dune`camlSpawn__spawn_inner_993 + 176
    frame #5: 0x00000001084a982d dune`camlDune_engine__Process__fun_4561 + 1501
    frame #6: 0x0000000108189c48 dune`caml_apply2 + 40
    frame #7: 0x000000010852a465 dune`camlFiber__Scheduler__exec_815 + 53
    frame #8: 0x000000010852b02f dune`camlFiber__Scheduler__fun_1357 + 79
    frame #9: 0x00000001085253ba dune`camlFiber__loop_386 + 74
    frame #10: 0x00000001084b1fbb dune`camlDune_engine__Scheduler__run_3164 + 171
    frame #11: 0x00000001084b243d dune`camlDune_engine__Scheduler__run_and_cleanup_3346 + 13
    frame #12: 0x00000001084b3e78 dune`camlDune_engine__Scheduler__fun_5457 + 408
    frame #13: 0x00000001085934f8 dune`camlCmdliner_term__fun_633 + 104
    frame #14: 0x000000010858e677 dune`camlCmdliner_eval__run_parser_537 + 39
    frame #15: 0x000000010858f729 dune`camlCmdliner_eval__eval_value_inner_1711 + 601
    frame #16: 0x000000010818a338 dune`camlMain__entry + 1592
    frame #17: 0x0000000108180259 dune`caml_program + 11913
    frame #18: 0x000000010864c7e1 dune`caml_start_program + 73
    frame #19: 0x0000000108623ea2 dune`caml_startup_common(argv=0x00007ff7b7d833c8, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
    frame #20: 0x0000000108623f1b dune`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
    frame #21: 0x0000000108623f14 dune`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172 [opt]
    frame #22: 0x0000000108623f14 dune`caml_main(argv=<unavailable>) at startup_nat.c:179 [opt]
    frame #23: 0x0000000108623f8c dune`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #24: 0x00007ff81a2ba310 dyld`start + 2432

$ lldb -p 13495
(lldb) process attach --pid 13495
Process 13495 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
libsystem_kernel.dylib`read:
->  0x7ff81a5aefe6 <+10>: jae    0x7ff81a5aeff0            ; <+20>
    0x7ff81a5aefe8 <+12>: movq   %rax, %rdi
    0x7ff81a5aefeb <+15>: jmp    0x7ff81a5af9ef            ; cerror
    0x7ff81a5aeff0 <+20>: retq
Target 0: (dune) stopped.

Executable module set to "/Users/j/Development/github/ahrefs/monorepo/_opam/bin/dune".
Architecture set to: x86_64h-apple-macosx-.
(lldb) frame info
frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff81a5aefe6 libsystem_kernel.dylib`read + 10
    frame #1: 0x00000001086199a4 dune`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=17, v_use_vfork=1, v_setpgid=4440459752) at spawn_stubs.c:456:18 [opt]
    frame #2: 0x0000000108535f20 dune`camlSpawn__spawn_inner_993 + 176
    frame #3: 0x00000001084a982d dune`camlDune_engine__Process__fun_4561 + 1501
    frame #4: 0x0000000108189c48 dune`caml_apply2 + 40
    frame #5: 0x000000010852a465 dune`camlFiber__Scheduler__exec_815 + 53
    frame #6: 0x000000010852b02f dune`camlFiber__Scheduler__fun_1357 + 79
    frame #7: 0x00000001085253ba dune`camlFiber__loop_386 + 74
    frame #8: 0x00000001084b1fbb dune`camlDune_engine__Scheduler__run_3164 + 171
    frame #9: 0x00000001084b243d dune`camlDune_engine__Scheduler__run_and_cleanup_3346 + 13
    frame #10: 0x00000001084b3e78 dune`camlDune_engine__Scheduler__fun_5457 + 408
    frame #11: 0x00000001085934f8 dune`camlCmdliner_term__fun_633 + 104
    frame #12: 0x000000010858e677 dune`camlCmdliner_eval__run_parser_537 + 39
    frame #13: 0x000000010858f729 dune`camlCmdliner_eval__eval_value_inner_1711 + 601
    frame #14: 0x000000010818a338 dune`camlMain__entry + 1592
    frame #15: 0x0000000108180259 dune`caml_program + 11913
    frame #16: 0x000000010864c7e1 dune`caml_start_program + 73
    frame #17: 0x0000000108623ea2 dune`caml_startup_common(argv=0x00007ff7b7d833c8, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
    frame #18: 0x0000000108623f1b dune`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
    frame #19: 0x0000000108623f14 dune`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172 [opt]
    frame #20: 0x0000000108623f14 dune`caml_main(argv=<unavailable>) at startup_nat.c:179 [opt]
    frame #21: 0x0000000108623f8c dune`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #22: 0x00007ff81a2ba310 dyld`start + 2432

@jchavarri
Copy link
Collaborator Author

the process got stuck with the patch suggested in #8083 (comment), so that doesn't seem to fix it.

@jchavarri
Copy link
Collaborator Author

Here's the info with bt all:

lldb.txt

@anmonteiro
Copy link
Collaborator

anmonteiro commented Jul 3, 2023

This is reproducible inside the dune codebase on macOS + OCaml 4.14

$ make dev
$ git clone git@github.com:Gbury/dolmen.git

*edit dune-file to include dolmen in excluded dirs*

$ cd dolmen
$ ../_build/install/default/bin/dune build -p dolmen
... hangs

@anmonteiro
Copy link
Collaborator

anmonteiro commented Jul 3, 2023

Yes that's correct. I would add a similar log message before the select as well.

I added both these logs messages, and the 2nd one (after the select call) never prints. It looks like the select call never returns.

Adding the print before the select call yields just one reader with a fd of 5 (which, looking at the code, is t.pipe_read):

Readers: 5
Writers:

Next, I tried adding a 2 second (instead of -1.0, unbounded) wait to select, which causes the recursive function to be called again, but Dune still doesn't make progress in the build.

@anmonteiro
Copy link
Collaborator

Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing Obj.magic.

You mean something like this?

diff --git a/src/dune_async_io/async_io.ml b/src/dune_async_io/async_io.ml
index bbfa01011..a7d6aa4c0 100644
--- a/src/dune_async_io/async_io.ml
+++ b/src/dune_async_io/async_io.ml
@@ -190,6 +190,10 @@ let rec select_loop t =
          [t.pipe_write] will interrupt this select *)
       Unix.select read write [] (-1.0)
     in
+    print_endline
+      (Printf.sprintf "Readers: %s\nWriters: %s"
+         (String.concat ~sep:", " (List.map ~f:Obj.magic readers))
+         (String.concat ~sep:", " (List.map ~f:Obj.magic writers)));
     assert (ex = []);
     (* Before we acquire the lock, it's possible that new tasks were added.
        This is fine. *)

I want to be sure I log the right things, as it takes me ~1h for each one of these experiments.

Almost. you need string_of_int after calling Obj.magic.

@anmonteiro
Copy link
Collaborator

I just reverted #7418 and the hang is still there, so I'm ruling that out as the cause.

@anmonteiro
Copy link
Collaborator

anmonteiro commented Jul 3, 2023

90f1d2428b89a8053beaa152e435a5b0c758880f is the first bad commit
commit 90f1d2428b89a8053beaa152e435a5b0c758880f
Author: Rudi Grinberg <me@rgrinberg.com>
Date:   Tue Jun 13 10:56:28 2023 +0100

    feature: enable async sandboxing/digests (#7947)

    These seem to speed up builds in practice

    Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

 CHANGES.md                          | 3 +++
 src/dune_config/config.ml           | 4 ++--
 test/blackbox-tests/test-cases/dune | 3 +++
 3 files changed, 8 insertions(+), 2 deletions(-)

seems like the culprit is #7947

@anmonteiro
Copy link
Collaborator

anmonteiro commented Jul 3, 2023

@jchavarri can you try your repro with 3.9 and setting the following environment variable?

DUNE_CONFIG__BACKGROUND_SANDBOXES=disabled

This seems to be the root cause.

@rdavison
Copy link
Contributor

rdavison commented Jul 3, 2023

@jchavarri can you try your repro with 3.9 and setting the following environment variable?

DUNE_CONFIG__BACKGROUND_SANDBOXES=disabled

This seems to be the root cause.

Setting this variable worked for me.

@emillon emillon mentioned this issue Jul 6, 2023
9 tasks
@anmonteiro
Copy link
Collaborator

I don’t think this was fixed

@anmonteiro anmonteiro reopened this Jul 7, 2023
emillon added a commit to emillon/opam-repository that referenced this issue Jul 7, 2023
This is due to ocaml/dune#8083 which is mitigated in 3.9.1.
@emillon
Copy link
Collaborator

emillon commented Jul 7, 2023

Correct. There are mitigations in 3.9.1 but the right fix will come from #8090 probably.

@nojb
Copy link
Collaborator

nojb commented Jul 7, 2023

Correct. There are mitigations in 3.9.1 but the right fix will come from #8090 probably.

I'm following from a distance, but is the issue understood? If yes, can you say a few words about it? Thanks!

@anmonteiro
Copy link
Collaborator

Still not fixed :)

@anmonteiro anmonteiro reopened this Jul 24, 2023
@dkalinichenko-js
Copy link
Collaborator

Wait, why did GitHub close the issue? I just synced the Jane Street opam repository, which is not even under the OCaml organization.

@emillon
Copy link
Collaborator

emillon commented Jul 24, 2023

Maybe if you have rights on this repo it's enough to close it, I'm not sure.

@rgrinberg rgrinberg added this to the 3.10.0 milestone Aug 20, 2023
nberth pushed a commit to nberth/opam-repository that referenced this issue Jun 18, 2024
CHANGES:

- Disable background operations and threaded console on MacOS and other Unixes
  where we rely on fork. (ocaml/dune#8100, ocaml/dune#8121, fixes ocaml/dune#8083, @rgrinberg, @emillon)

- Initialize async IO thread lazily. (ocaml/dune#8122, @emillon)
nberth pushed a commit to nberth/opam-repository that referenced this issue Jun 18, 2024
This is due to ocaml/dune#8083 which is mitigated in 3.9.1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants