Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal job exception raised on pending jobs when reloading Fluxion modules #1215

Open
grondo opened this issue Jun 4, 2024 · 1 comment
Open

Comments

@grondo
Copy link
Contributor

grondo commented Jun 4, 2024

While reloading fluxion on elcap, several pending jobs were canceled with a fatal job exception such as:

[Jun04 14:42] exception type="alloc" severity=0 note="alloc denied due to type=\"match error\"" userid=765
[  +0.000608] clean

For reference, here's the logs at the time of module reload:

[Jun04 14:42] broker[0]: rmmod sched-fluxion-resource
[ +14.008927] sched-fluxion-resource[0]: responding to post-shutdown sched-fluxion-resource.cancel
[ +14.009019] broker[0]: module sched-fluxion-resource exited
[ +14.012128] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.014486] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.015532] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.045507] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.087013] broker[0]: rmmod resource
[ +14.087290] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.103970] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.104489] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.104968] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.105501] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.105973] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.106463] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.122417] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122435] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122442] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122447] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122451] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122456] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122461] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122465] sched-fluxion-qmanager[0]: responding to post-shutdown sched.disconnect
[ +14.122469] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122474] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122479] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122483] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122488] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122492] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.disconnect
[ +14.122496] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122500] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122505] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122510] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122514] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122518] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122529] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122534] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122538] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122543] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122546] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122550] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122554] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122558] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122563] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122580] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122585] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122590] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122594] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122599] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122603] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122608] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122612] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122635] sched-fluxion-qmanager[0]: responding to post-shutdown sched.cancel
[ +14.122639] sched-fluxion-qmanager[0]: responding to post-shutdown sched.cancel
[ +14.122642] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122648] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122652] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.139690] broker[0]: module sched-fluxion-qmanager exited
[ +14.139745] job-manager[0]: alloc: stop due to disconnect: Success
@grondo
Copy link
Contributor Author

grondo commented Jun 4, 2024

Note that in this particular case, we had to kill off flux module remove sched-fluxion-qmanager which was hanging due to the leaked alloc requests issue (can't find the issue right now, feel free to link it here if found)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant