-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[utils] Rewrite slurm.pl from scratch #4314
base: master
Are you sure you want to change the base?
Conversation
The new file calls sbatch passing the batch file on stdin, and waits for completion of the script without polling, rather by passing the --wait switch to sbatch. Slurm has a hardcoded polling schedule for waiting. Specifically, it checks if the job has completed for 2s, then increases the wait time by 2s until it maxes out at 10s. Because of this, even very short Kaldi batches often incur extra 5s wait delays on average. The following patch reduces the poll time to 1s without growth: https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch It has been tested to apply cleanly up until Slurm v20.1, and is unlikely to break in future. Please open a ticket in the https://github.com/burrmill/burrmill repository if your Slurm source does not patch. You do not need administrative access to the cluster; you can just build your own version of sbatch and place on the PATH. Internally it uses only Slurm RPC calls, and does not require many dependencies.
} | ||
|
||
if (exists $cli_options{"config"}) { | ||
$config = $cli_options{"config"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the original slurm, the $config is pre-set to "conf/slurm.conf", which results in automatically reading the config if it exists. It is also the default behavior of queue.pl, so it might be preferred to stick with that behavior also...
my $max_jobs_run; | ||
my $gpu = 0; | ||
|
||
my $config = "conf/slurm.conf"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see this....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix, thanks for spotting! What's your opinion on the default hardcoded configuration in the file: should we keep it, or drop and always require a config file?
My take on this is that no hardcoded defaults could be in any way sensible. Tuning a distributed configuration is always required, in my experience. I'd rather ship a default config file. IMO, the mere act of copying the file into conf/ will draw attention to it. I'll also comment it dearly.
Also, I want to make a pinned post to kaldi-help a few days before merging. Like, nearly right now. Stuff can break, and we don't know how many Slurm users are out there.
well, we usually make the hardcoded config to work on clsp cluster... YMMV
:)
BTW, if you have time, there is another issue with it, but I didn't find
time to trace it... seems like sometimes the return code of the tasks is
either ignored or not propagated outside the slurm
y.
…On Mon, Nov 2, 2020 at 10:03 AM kkm000 ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/wsj/s5/utils/parallel/slurm.pl
<#4314 (comment)>:
> -# option num_threads=* --cpus-per-task $0
-# option num_threads=1 # Do not add anything to qsub_opts
-# option max_jobs_run=* -tc $0
-# default gpu=0
-# option gpu=0 -p shared
-# option gpu=* -p gpu #this has to be figured out
-
-#print STDERR "$0 " . join(" ", @argv) . "\n";
-
-my $qsub_opts = "";
-my $sync = 0;
-my $num_threads = 1;
-my $max_jobs_run;
-my $gpu = 0;
-
-my $config = "conf/slurm.conf";
Will fix, thanks for spotting! What's your opinion on the default
hardcoded configuration in the file: should we keep it, or drop and always
require a config file?
My take on this is that no hardcoded defaults could be in any way
sensible. Tuning a distributed configuration is always required, in my
experience. I'd rather ship a default config file. IMO, the mere act of
copying the file into conf/ will draw attention to it. I'll also comment it
dearly.
Also, I want to make a pinned post to kaldi-help a few days before
merging. Like, nearly right now. Stuff can break, and we don't know how
many Slurm users are out there.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#4314 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX3B42HRRDDZGWFTNFDSN3C57ANCNFSM4TES74LA>
.
|
I do not know what is clsp. Yes, I want to release the thing ASAP, please try to track it down. How does not it propagate? What do you see? I need to know symptoms of the problem too, not only your hypothesis. You kinda started in the middle :) |
I meant the clsp cluster... :)
Essentially, I have noticed if the task fails and returns non-zero exit
code, the slurm "ignores" it and it still seems it returns zero.
At least I had a couple of tasks failing but the scripts were trying to
continue... And IIRC those were scripts that are tested, so it
shouldn't happen because of the scripts.
that's all I know for now... I won't be able to track it down today --
sorry -- I'm trying to catch up with some other stuff I was ignoring last
week
y.
…On Mon, Nov 2, 2020 at 10:13 AM kkm000 ***@***.***> wrote:
I do not know what is clsp.
Yes, I want to release the thing ASAP, please try to track it down. How
does not it propagate? What do you see? I need to know symptoms of the
problem too, not only your hypothesis. You kinda started in the middle :)
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#4314 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX7GMI2PXXG5AMP4PH3SN3D77ANCNFSM4TES74LA>
.
|
I'll enable more logging. Currently, I can see details only for jobs completed within 30 minutes, I do not even flush logs to disk. |
And since I do not know what is cslp, I obviously do not know what is cslp cluster. Is this cluster used by many Kaldi users? In a sense, enough to hardcode config? |
I guess I meant that any configuration that is functional at least at one
place is fine... people can copy-out the config and modify it as
conf/slurm.conf...
y.
…On Mon, Nov 2, 2020 at 10:29 AM kkm000 ***@***.***> wrote:
And since I do not know what is cslp, I obviously do not know what is cslp
cluster. Is this cluster used by many Kaldi users? In a sense, enough to
hardcode config?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#4314 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX35BVWDIG6VKBII73TSN3F4HANCNFSM4TES74LA>
.
|
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open. |
Remember that we recently merged #4675, you might want to incorporate any needed changes from there. |
I'm sitting on it because of internal Slurm changes. The script submits a job with the Another thing that polling is problematic, the more so the older is the controller version. They seriously took it on the lock contentions only from v20. |
I suppose this isn't critical if the existing script works OK. |
@dpovey, you're probably right. I'm trying to overengineer it so that it Slurms better than Slurm itself does. Probably no point. If it's 30s minimum, so be it. This version is much cleaner than the currently on master, good enough. Let me get back to it next week and release. |
The new version invokes
sbatch
passing the batch file on stdin, and waits for completion of the script without polling of flag files, rather by passing the--wait
switch tosbatch
(which is still doing the polling, but more efficiently, through an already established RPC channel with the active controller). In this mode,sbatch
prints the new job ID immediately upon submission, and then later exists with return code 0 on success, or prints an error message and exits with a non-zero code.Slurm's
sbatch
command has a hardcoded polling schedule for waiting. Specifically, it checks if the job has completed in 2s, then increases the wait time by 2s every poll cycle until it maxes out at 10s (i.e, the wait time is 2, 4, 6, 8, 10, 10.... seconds). Because of this, even very short Kaldi batches often incur extra 5s wait delays on average. The following patch reduces the poll time to 1s without growth:https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch
It has been tested to apply cleanly up until Slurm v20.1, and is unlikely to break in future, given it's a 2-line change. Anyway, please open an issue in the https://github.com/burrmill/burrmill repository if your Slurm source does not patch.
You do not need administrative access to the cluster; you can just build your own version of sbatch and place on the PATH. Internally it uses only Slurm RPC calls, and does not require many dependencies. The Slurm RPC library .so file must be available already (use
ldd $(which sbatch)
to locate); build against it with headers from the Slurm repository.Whether to discuss this with the cluster IT admin is, of course, your decision. Your arguments are that (a) Kaldi jobs are sometimes extremely short, sometimes 10 seconds, while it's not uncommon in the HPC world to submit multinode MPI jobs running for days and (b) the old flag-file polling was putting a heavier load on the system at any poll rate: Slurm RPCs are incomparably more efficient compared to placing flag files on an NFS or other shared FS.
Configuration I used:
I defined a Gres resource named 'cuda' with subtypes 'p100', 'v100', 'T4' etc in a single partition named 'gpu'. The CPU nodes were shaped with 10 vCPU (i.e., 5 jobs at a time) and 12.5 GB of RAM -- pretty small, but enough unless used for composing a large HCLG. This is what
--whole-nodes
is for: give a whole node to such tasks. Now they added C2 machines, but these could not be custom configured last time I checked. Take the smallest that could handle the HCLG-style stuff, either 4 or 8 vCPU. These do not offer a lot of benefit on FST tasks, but are real monsters on matrix ops.