-
Notifications
You must be signed in to change notification settings - Fork 15
rfc20: add nslots
to Rv1 spec
#466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc20: add nslots
to Rv1 spec
#466
Conversation
I do want to have the number of slots in there, but this actually exposes some things we probably need to fix. We need the number of slots, the shape of the slots and the location of them in the end. Looking at the branch using this it looks like slots per node is being calculated assuming the number on each node is constant for example, which is not necessarily true. It also leaves the shell to recalculate that shape. If this is useful in the interim while we figure out how to do that, I'm good with it and fluxion could pretty easily set it, but I'm not sure how much it buys us. |
I still think this would be useful in the interim, as the restrictions in Jobspec v1 mean that everything else is essentially trivially determined once the number of slots is included, so I believe that e.g. the shell or job-list services would be able to fully determine their necessary information without having to access the jobspec resources array for allocated jobs. Thus, even such a "simple" addition to Rv1 makes several other things much easier, and philosophically helps separate which tasks should be looking for information about the request in the jobspec (for pending jobs) vs tasks that should be using the concrete information in R once the job has been allocated (and not having to parse slot counts from the jobspec in all cases). In the longer term I completely agree that much more information will need to be included in the resource set in the future. My two cents would be to design a specification for a "canonical" R which could fully represent everything in a concretized canonical jobspec. And if it's still necessary to have a more limited Rv2 as a stepping stone, it would then be defined as a restricted subset of the canonical form, in the same way jobspec v1 is a restricted subset of canonical jobspec. My initial instinct would be to represent resources in R using exactly the same structure as the canonical jobspec resources array, only with concrete information now specifcied. i.e a |
That's a fair point. As I said I don't see any problem with it, it's easy enough to supply. @milroy do you have any thoughts here? |
These changes look good to me with respect to the interim need, and I'm fine with this PR as is. I agree with @trws that fluxion can be adapted to support The flexible match policy to be added by this PR: flux-framework/flux-sched#1296 can violate some of the assumptions in flux-framework/flux-core#6632 and will require more information about each slot to enable correct calculation. However, some of that difficulty can be alleviated by regularizing the slot shape by returning a single "best" shape per match. @zekemorton is working on that approach now. |
One thought I had while updating flux-framework/flux-core#6632 was that it would probably be better to ultimately have this as a mandatory field for a scheduler allocation, i.e. "A conforming scheduler SHALL include this in its allocations" (instead of SHOULD). It would still be an optional field in general, since not all resource sets are scheduler allocations, but this would mean that once it is implemented in fluxion the flux-core shell init could be further simplified to not have to worry about a future scheduler not setting As @garlick said in #402 (comment) "Move RFCs forward, let code catch up" 😁 |
Problem: the number of slots decided on by the scheduler is not currently recorded in its allocated R Add an (optional) nslots key to the execution dict of the Rv1 spec which SHALL be included in a scheduler allocation
c8f1c4f
to
8da980e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, looks good and lets get this merged!
I agree about making the nslots required so we can remove the conditional logic eventually in the job shell.
Problem: the number of slots decided on by the scheduler is not currently recorded in its allocated R
Add an (optional) nslots key to the execution dict of the Rv1 spec which SHOULD be included in a scheduler allocation
This update documents the design change proposed in flux-framework/flux-core#6632 (specifically this comment)if this idea seems like it would also be feasible to implement in
fluxion