-
Notifications
You must be signed in to change notification settings - Fork 41
Job Priority Plugin Implementation Notes
After a discussion of the advantages and drawbacks to creating an independent Flux module devoted to calculating job priority, we decided to make the job priority an optional plugin to the sched module.
The job priority plugin prototype that was created delivers the functionality described in the design document. The plugin offers an API of only three functions.
priority_setup() is called once after the plugin is loaded and before it is first used. It initializes the plugin and reads in the charge account / user information.
prioritize_jobs() is called at the top of every scheduling loop. It calculates and assigns a priority value to every pending job.
record_job_usage() is called for every job that terminates. The computing resource usage value (nominally cpu-seconds) is extracted from the job record and charged to the account under which the user's job ran.
The plugin implemented the plan that computes the job priority based on the weighted sum of a number of factors. The implementation followed the design to allow for any number of factors to contribute to the priority value. The 6 factors described in the design are implemented, but more could be added. The basic structure of each component to the formula includes the component's name, weight, and a function that will return the component's factor.
The calculation for the queue and quality of service (QoS) are stub functions for the time being. At this point, there is no formal decision on how to implement queues and QoS in Flux.
The wait-time, job-size, and user factors are implemented in a rudimentary way. This will suffice for the time being to demonstrate their utility.
The function that calculates the fair-share factor is the most complicated. It returns a factor between 0.0 and 1.0 that represents how much computing resources have been used by past user jobs that have charged the account associated with the job. This "fair-share" factor takes into consideration the number of shares of computing resources the user has purchased or been promised. A value of 0.5 represents an account that has accrued usage commensurate with the assigned shares. The 1.0 value represents an account that has yet to be charged any usage (no jobs have run charging that account), while values between 0.0 and 0.5 represent over-serviced accounts.
Following the plans of prior discussions, we have agreed to leverage Slurm's database to house the user/charge-account "associations". This will be a read-only operation; no job data will be written by Flux to the Slurm database.
The user/account hierarchy will be retrieved from the Slurm database and written to a file using a very specific format: "account", "shares", "parent account", and "user name" - one record per line with '|' as a field separator. This is done by invoking the following command:
sacctmgr -n -p show assoc cluster=$LCSCHEDCLUSTER format=account,share,parentn,user > associations
The creating of the associations file must be done manually or by a cron job. At this point, the priority plugin requires it to be present. The advantage of doing it this way is to always have a file to read even if the Slurm database is down of unreachable. Future enhancements call for a generic association record loader using site-specific adapter plugins for reading user/account/shares information. The invocation of the sacctmgr
command would then move to such an adapter. Or perhaps, Flux itself will someday include a facility to store user and account associations and shares.
There are a number of features that have not yet been implemented as of this writing.
The half-life decay of the accrued resource usage is not done. This is relatively straightforward. It will require each usage value to be reduced by a certain amount on a periodic basis.
Also, no work has been done for persistence. While the user/account/shares data can be read from the associations file on plugin loading, accrued usage values are not persisted. They will be zeroed out under the current implementation every time the priority plugin is loaded.
In addition, work is needed to create the RPC's that service client commands to display the factors contributing to each job's priority value (analogous to Slurm's sprio command) as well as a detailed display of the factors used to generate the fair-share factor (analogous to Slurm's sshare command).
Finally, tests need to be created that load the priority plugin, run jobs through the system, and verify the computed priority values.