-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job submissions are serialized and not interactively performant #1159
Comments
There are multiple issues going on here which may make job submissions appear to be serialized:
If the Related #1001
There is no "big lock", however the current version of flux-resource does need to query the scheduler for the scheduler state of resources. Since the scheduler is single threaded, it can only do one thing at once. This was solved by flux-framework/flux-core#5796. There are also some performance issues in The flux-resource performance fixes will be available in flux-core v0.61.0, which is scheduled to be released 2024-04-02. |
There are also some minor improvements in flux-sched v0.33.0. I think the system in question is still at v0.32.0. |
We now ingest jobs at ~450 jobs/s, and can schedule and start them at a stable rate of ~100/s on a consumer-grade laptop. The sched loop is also completely decoupled from submission, so even if scheduling is slowed down submission should remain fast. Also the resource status commands are both processed in a different thread and cached on large systems. Is this still an issue @adamdbertsch? |
Job submissions are serialized across a large system, and each job submission takes order 10s to complete. This means that submitting more than a handful of jobs can take minutes, and all other users are blocked from submitting their own jobs during this time.
It appears that the same is true for flux resource commands, and that they share the same "big lock" as flux job commands. A single flux job submission behind a flux resource list command on a large system took 32s to complete.
The text was updated successfully, but these errors were encountered: