Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job submissions are serialized and not interactively performant #1159

Closed
adamdbertsch opened this issue Mar 29, 2024 · 3 comments
Closed

job submissions are serialized and not interactively performant #1159

adamdbertsch opened this issue Mar 29, 2024 · 3 comments

Comments

@adamdbertsch
Copy link

Job submissions are serialized across a large system, and each job submission takes order 10s to complete. This means that submitting more than a handful of jobs can take minutes, and all other users are blocked from submitting their own jobs during this time.

It appears that the same is true for flux resource commands, and that they share the same "big lock" as flux job commands. A single flux job submission behind a flux resource list command on a large system took 32s to complete.

@grondo
Copy link
Contributor

grondo commented Mar 29, 2024

Job submissions are serialized across a large system, and each job submission takes order 10s to complete

There are multiple issues going on here which may make job submissions appear to be serialized:

  • the job feasibility validator plugin is contacting the scheduler to determine job feasibility. This request may be blocked behind a scheduling loop, and will in turn block the next scheduling loop
  • match policies in fluxion such as hinodex and lonodex have known performance issues. We should ensure the system is using firstnodex

If the firstnodex policy doesn't resolve the slow submission performance, we may want to disable the feasibility plugin until feasibility performance issues can be addressed by Fluxion developers.

Related #1001

It appears that the same is true for flux resource commands, and that they share the same "big lock" as flux job commands

There is no "big lock", however the current version of flux-resource does need to query the scheduler for the scheduler state of resources. Since the scheduler is single threaded, it can only do one thing at once.

This was solved by flux-framework/flux-core#5796.

There are also some performance issues in flux-resource itself which were addressed in flux-framework/flux-core#5823 and flux-framework/flux-core#5824.

The flux-resource performance fixes will be available in flux-core v0.61.0, which is scheduled to be released 2024-04-02.

@grondo
Copy link
Contributor

grondo commented Mar 29, 2024

There are also some minor improvements in flux-sched v0.33.0. I think the system in question is still at v0.32.0.

@trws
Copy link
Member

trws commented Jul 31, 2024

We now ingest jobs at ~450 jobs/s, and can schedule and start them at a stable rate of ~100/s on a consumer-grade laptop. The sched loop is also completely decoupled from submission, so even if scheduling is slowed down submission should remain fast. Also the resource status commands are both processed in a different thread and cached on large systems. Is this still an issue @adamdbertsch?

@trws trws closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants