Skip to content

Add wait_for slurmdbd port #129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 9, 2022
Merged

Add wait_for slurmdbd port #129

merged 3 commits into from
Mar 9, 2022

Conversation

m-bull
Copy link
Contributor

@m-bull m-bull commented Mar 4, 2022

Ensure that the slurmdbd service is accessible on its specified port after a restart before restarting any other services that
might depend on slurmdbd being accessible.

This fixes an issue where slurmctld is raised before slurmdbd is responding on its port, causing systemctl restart slurmctld to fail with the following message to syslog:

Mar  3 17:02:58 matt-slurm-control-0 slurmctld[63748]: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Mar  3 17:02:58 matt-slurm-control-0 slurmctld[63748]: error: Sending PersistInit msg: Connection refused
Mar  3 17:02:58 matt-slurm-control-0 slurmctld[63748]: fatal: You are running with a database but for some reason we have no TRES from it.  This should only happen if the database is down and you don't have any state files.
Mar  3 17:02:58 matt-slurm-control-0 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Mar  3 17:02:58 matt-slurm-control-0 systemd[1]: slurmctld.service: Failed with result 'exit-code'.

This wait_for approach is already taken when restarting the slurmctld daemon.

@m-bull m-bull added the bug Something isn't working label Mar 4, 2022
Ensure that the slurmdbd service is accessible on its specified
port after a restart before restarting any other services that
might depend on slurmdbd being accessible.
@m-bull m-bull force-pushed the fix/slurmdbd-restart branch from ae45894 to e9dac90 Compare March 7, 2022 12:33
Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the actual problem is a mis-ordering here, which currently does:

  • "Ensure slurmdbd is started and running" -> starts slurmdbd
  • "Flush handlers" -> always restarts slurmdbd, doesn't check it's up, restart slurmctld/d

Maybe under some (TBD) circs slurmdbd is not restarting fast enough. So the proposed check is fine as far as it goes, (although see specific comment too) but I think the flush should be moved BEFORE the "Ensure slurmdbd ..." task. That avoids the bounce (as is done for slurm{ctld/d} and might actually be enough to fix it, although I think the proposed check is good as it will catch an actual fail to start (rather than just a slow startup).

@m-bull
Copy link
Contributor Author

m-bull commented Mar 9, 2022

Tested new changes using Azimuth - problem is still fixed and appliances still provision!

@sjpb sjpb merged commit cc31876 into master Mar 9, 2022
@sjpb sjpb deleted the fix/slurmdbd-restart branch March 9, 2022 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants