-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deferred allocation #987
Comments
To make sure I understand the basics (without getting into too much complexity yet) of the deferred allocation capability, this is a three-part process:
Is that basically correct? |
I've confirmed that by manipulating the flux-sched/resource/traversers/dfu.cpp Line 277 in 35d3c96
at = 3600 in dfu_traverser_t::run and performing a match allocate :
resource-query> match allocate t/data/resource/jobspecs/basics/test001.yaml
---------------core35[1:x]
------------socket1[1:x]
---------node1[1:s]
------rack0[1:s]
---tiny0[1:s]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=RESERVED
INFO: SCHEDULED AT=3600
INFO: ============================= Of course, there will be a decent amount of development required to add new |
Actually that is not complicated.
As discussed with @grondo during last week's team meeting, we still need to decide how to proceed with this part. The current state of PR #1013 uses the optional flux-sched/resource/traversers/dfu_impl.hpp Line 104 in 90f8229
An example test jobspec looks like this: version: 9999
resources:
- type: cluster
count: 1
with:
- type: rack
count: 1
with:
- type: node
count: 1
with:
- type: slot
count: 1
label: default
with:
- type: socket
count: 1
with:
- type: core
count: 1
# a comment
attributes:
system:
duration: 3600
# optional deferred keys
deferred_start: 1800
deferred_from: 0
tasks:
- command: [ "app" ]
slot: default
count:
per_slot: 1 My sense is that while this may work well for automated submission it will be hard for manual submission. @jameshcorbett and @ryanday36 might have good input here. |
The problem is that you need to be able to define those attributes without writing a yaml file every time? We are working on a shape spec for resources - flux-framework/rfc#371 maybe we need the same for system attributes? Ping @trws |
Would the submit time (called |
There is already a facility for specifying system attributes on the command line of the submission commands (See documentation of |
That is a great idea. I was going to suggest something similar in that |
I think that I was also thinking more about what keyword would make sense for this. I'm leaning toward something more like 'reserve_time' or 'reserve_start', or maybe 'require_start' since it will raise an exception on the job if it can't start at that time. |
The
It would be nice to support something similar here. If we can add whatever option we call this to the jobspec RFC, then perhaps it would make sense to expose this as a similar option in the submission commands? Or, would it be too kludgy to add some kind of sentinel to |
@grondo why should we require users to figure out timestamps / timezones? Isn't it easier (or minimally should be an option) to provide relative times? E.g., what if you are doing some kind of flux proxy to an instance in a different timezone and then you get it wrong (or minimally have to convert which is a hairball I don't think we want to dive into). A suggestion - if begin time is already a thing (and indeed it's actually a time to begin) why not have a Reference for time pain: https://gist.github.com/timvisee/fcda9bbdff88d45cc9061606b4b923ca ⏲️ 😱 |
I'm confused. As shown above, the interface does not require users to actually specify the timestamp. The begin time can be specified as an offsite or absolute time or any other format supported by |
Oh I see, if you add + it is an offset? Sorry I'm just really stupid. |
I'll just see myself out, I'm not really helping anyone. |
I think I'm having one of those days myself FWIW. |
I didn't know about I just realized I obfuscated a crucial detail with I could certainly implement what @grondo suggested from the |
Problem: support a scheduling request for an allocation to occur at a specific time in the future.
Currently, a reservation of resources occurs as early as possible. However, for supporting workflows that benefit from running tasks across heterogeneous platforms, it is desired to synchronize multiple allocations across different child instances. Such that task 1-10 run on corona while task 11-20 "simultaneously" run on another cluster managed by Flux.
To support such use cases, two things are needed.
One is the deferred allocation capability, and the other is a means to query the allocation delay.
A parent instance can query its remote child instances to find out when is the earliest by which all the children can allocate requested resources. Then, it should be possible to allocate synchronously across instances.
Pushing the reservation time back should also consider back-filing.
To be clear, this is not the same as to try allocating at the earliest after a specific point in time.
I am not entirely sure if the existing issue #963 is the latter case or the same as this.
The text was updated successfully, but these errors were encountered: