-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dune
process hangs on macos
#8083
Comments
Here is another result from running
|
Bisecting is probably your best bet. Before doing that you may want to test commit 886ff3b that caused some issues (on Windows, but you never know). |
Ok, I am trying to bisect now with older commits, it will take a while as the issue doesn't happen always.
Hm, I tested this commit and its parent, and both seemed to work fine. Then I noticed the commit is not actually on the |
Yes, it is, but it is not the same commit because it was cherry-picked to the release |
Are these builds hanging in watch mode? Could you run |
They happen in regular, non-watch mode, mostly when installing many packages with opam (where
continues forever (around 10-15 new lines / second). |
I don't even understand how this is possible:
From the manual:
How can the time limit expire if I set it to block indefinitely. I also don't understand why they stay running forever. If select is constantly returning, it should eventually be shut off by the finalizer. |
Try this patch, see if it makes a difference:
|
I can confirm that I've been able to run multiple successful installations of packages (1000+ pkgs are installed each time, most of them using dune) without issues by pinning the version of Dune to e1f3e19 (the commit before 12a0268). So, while we can't confirm that 12a0268 is the offender, I think it's quite sure that something changed between that commit and 71024ef that caused the regression.
I will try this next. |
Maybe related? https://beesbuzz.biz/code/5739-The-problem-with-select-vs-poll Fwiw we set |
OCaml's select will error if we go over the fd limit. Also, dune doesn't use that many fd's concurrently anyway so it's unlikely we'll ever run in the issues in that article. Finally, if that was the problem, it wouldn't be macos specific. Another piece of advice would be to print all the fd's returned by select when it's looping. You can print them as integers after doing |
You mean something like this?
I want to be sure I log the right things, as it takes me ~1h for each one of these experiments. |
If I understand correctly, the problem with select is not only that you can't use it with many fds; it's also that you can't use it with fds that have a high value. So it can be an issue even if there's no fd leak or many concurrently open fds. |
Another odd thing: there are two uses of |
(it could be sleep implemented via empty select) |
Yes that's correct. I would add a similar log message before the select as well. |
I think that's correct. We have a background thread to handle timers that polls at 10hz. Mystery solved why we're calling select so much. Still no idea what's burning all the cpu though. @jchavarri can you try attaching lldb and giving us all the backtraces |
There were two
|
the process got stuck with the patch suggested in #8083 (comment), so that doesn't seem to fix it. |
Here's the info with |
This is reproducible inside the dune codebase on macOS + OCaml 4.14
|
I added both these logs messages, and the 2nd one (after the Adding the print before the select call yields just one reader with a fd of 5 (which, looking at the code, is
Next, I tried adding a 2 second (instead of -1.0, unbounded) wait to |
Almost. you need |
I just reverted #7418 and the hang is still there, so I'm ruling that out as the cause. |
seems like the culprit is #7947 |
@jchavarri can you try your repro with 3.9 and setting the following environment variable?
This seems to be the root cause. |
Setting this variable worked for me. |
I don’t think this was fixed |
This is due to ocaml/dune#8083 which is mitigated in 3.9.1.
Correct. There are mitigations in 3.9.1 but the right fix will come from #8090 probably. |
I'm following from a distance, but is the issue understood? If yes, can you say a few words about it? Thanks! |
Still not fixed :) |
Wait, why did GitHub close the issue? I just synced the Jane Street opam repository, which is not even under the OCaml organization. |
Maybe if you have rights on this repo it's enough to close it, I'm not sure. |
CHANGES: - Disable background operations and threaded console on MacOS and other Unixes where we rely on fork. (ocaml/dune#8100, ocaml/dune#8121, fixes ocaml/dune#8083, @rgrinberg, @emillon) - Initialize async IO thread lazily. (ocaml/dune#8122, @emillon)
This is due to ocaml/dune#8083 which is mitigated in 3.9.1.
Recently, we updated Ahrefs codebase to the latest version of Dune (to be more specific, commit 71024ef) from the previous version we were using from back in April (7bb6de7), which was working fine.
After upgrading, all our Linux CI agents work perfectly fine, but on the macOS agents we have noticed that dune processes stay running forever, with CPU usage over 90%.
Here is some example of a dune process running for 55+minutes and with CPU at 100%:
Looking at different occurrences of the issue, I couldn't find any pattern, on the packages where it appears. What I can say is that it happens on different versions of macos. In particular, the one in the agent used in the command above is
12.3.1
, but it also happens on12.6.1
and13.0.1
.I have tried to gather some information about what the hanging
dune
process is exactly doing. Callingsample 3759
with the process id gathered withps aux
above shows a lot of nested callbacks withcamlDune_engine__Process__fun_4514
, not sure if this is expected. You can find the full output of that command here:sample-3759.txt
Is there anything I could do to help diagnose the problem?
The text was updated successfully, but these errors were encountered: