-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job submission slows down on Hetchy #1001
Comments
By contrast, on Corona the timings are very consistent
|
Oh, I didn't quite catch that this was job submission that was slowing down here! That is quite unexpected and should not happen. In general, responses to a We'll have to get to the bottom of this! |
I did notice on rank 0 the broker was at 100% CPU. A perf report shows Fluxion using 98% of the cycles:
So my guess the slowness in job submission is due to the Note that the feasibility checks are reasonably fast, until you start running jobs, then this script reproduces the slowness just with the validator: # no running jobs
$ for i in `seq 1 10`; do /usr/bin/time --format="%e" sh -c 'flux mini run --dry-run -N1 hostname | flux job-validator --plugins=feasibility,jobspec --jobspec-only' ; done
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
{"errnum": 0}
0.48
# with running jobs
[flux@hetchy7:~]$ for i in `seq 1 10`; do /usr/bin/time --format="%e" sh -c 'flux mini run --dry-run -N1 hostname | flux job-validator --plugins=feasibility,
,jobspec --jobspec-only' ; done
{"errnum": 0}
2.04
{"errnum": 0}
2.65
{"errnum": 0}
3.19
{"errnum": 0}
3.72
{"errnum": 0}
4.24
{"errnum": 0}
9.65 |
To the bottom! |
Rabbit systems in general will, but at the moment Hetchy doesn't. Fluxion doesn't know anything about the rabbits. So that isn't the culprit. I'll test more when I get back from travel. |
Ah, thanks for that information. If there's nothing special about hetchy resource graph at the moment, then this issue has the potential to affect any system. I'll transfer this issue to flux-sched, because I'm fairly certain the flux-coral2 bits have nothing to do with the problem (I even removed the two jobtap plugins just to test). |
A good test may be to try reloading |
I started a test instance with the same R as configured on hetchy and could not reproduce the issue, so the cause here isn't the specific configuration of resources. Not sure how to debug the live system. |
In case it proves useful in recreating this, the current resource state and properties (queue) config is
|
This problem is reproducible by collecting some of the config from hetchy: [job-manager]
plugins = [
{ load = "perilog.so" },
{ load = "/opt/lib64/flux/job-manager/plugins/cray_pals_port_distributor.so", conf = { port-min = 11000, port-max = 12000 } },
{ load = "/opt/lib64/flux/job-manager/plugins/dws-jobtap.so" }
]
[policy.limits]
duration = "24h"
[queues.windom]
requires = ["windom"]
[queues.bardpeak]
requires = ["bardpeak"]
[resource]
path = "/etc/flux/system/R"
#exclude = "hetchy[7,12]"
norestrict = true
noverify = true
[sched-fluxion-qmanager]
# easy backfill
queue-policy = "easy"
[sched-fluxion-resource]
# node exclusive starting from low node ids
match-policy = "lonodex"
match-format = "rv1_nosched" system.R {
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0",
"children": {
"core": "0-63"
}
},
{
"rank": "1",
"children": {
"core": "0-127"
}
},
{
"rank": "2-16",
"children": {
"core": "0-127"
}
},
{
"rank": "17-18",
"children": {
"core": "0-63",
"gpu": "0-7"
}
}
],
"starttime": 0.0,
"expiration": 0.0,
"nodelist": [
"hetchy[7,12-29]"
],
"properties": {
"windom": "2-16",
"bardpeak": "17-18"
}
}
} Instructions:
|
Getting a similar result from
|
When I was looking at
I'm not sure it's relevant to this, but I can imagine the account-priority-update giving fluxion some garbage data when there's no accounting db, so I thought I'd mention it. I deleted the |
I don't think that could be it. The accounting scripts communicate with the |
FYI - I didn't get different results running perf with |
Here's a script that acts as a reproducer run out of a top-level flux-sched builddir: #!/bin/sh
flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux kvs put -r resource.R=- </etc/flux/system/R
flux config load < ./conf.toml
flux module load resource noverify
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue status
flux resource list
flux module list | grep sched
flux mini submit --queue=windom --cc=1-100 --setattr=exec.test.run_duration=1ms --quiet --watch --progress --jps hostname
[job-manager]
plugins = [
{ load = "perilog.so" },
{ load = "/opt/lib64/flux/job-manager/plugins/cray_pals_port_distributor.so", conf = { port-min = 11000, port-max = 12000 } },
{ load = "/opt/lib64/flux/job-manager/plugins/dws-jobtap.so" }
]
[policy.limits]
duration = "24h"
[queues.windom]
requires = ["windom"]
[queues.bardpeak]
requires = ["bardpeak"]
[resource]
path = "/etc/flux/system/R"
#exclude = "hetchy[7,12]"
norestrict = true
noverify = true
[sched-fluxion-qmanager]
# easy backfill
queue-policy = "easy"
[sched-fluxion-resource]
# node exclusive starting from low node ids
match-policy = "lonodex"
match-format = "rv1_nosched"
[tbon]
tcp_user_timeout = "2m" The reproducer can now be run like: $ FLUX_MODULE_PATH_PREPEND=$(pwd)/resource/modules/.libs flux start -s 1 ./issue#1001.sh
grondo@hetchy12:~/git/flux-sched$ FLUX_MODULE_PATH_PREPEND=$(pwd)/resource/modules/.libs flux start -s 1 ./issue#1001.sh
Failed to open drm root directory /sys/class/drm.: No such file or directory
Failed to open drm root directory /sys/class/drm.: No such file or directory
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE PROPERTIES NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-qmanager 6225784 c4a4497 0 R sched
sched-fluxion-resource 36422384 e68a19e 0 R
PD:0 R:0 CD:100 F:0 │███████████████████████████████████│100.0% 0.8 job/s As suggested by @trws, I set the queue-depth to 2 and re-ran the test, which has marked improvement $ grep queue-depth conf.toml
queue-params.queue-depth = 2
$ FLUX_MODULE_PATH_PREPEND=$(pwd)/resource/modules/.libs flux start -s 1 ./issue#1001.sh
Failed to open drm root directory /sys/class/drm.: No such file or directory
Failed to open drm root directory /sys/class/drm.: No such file or directory
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE PROPERTIES NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-qmanager 6225784 c4a4497 0 R sched
sched-fluxion-resource 36422384 e68a19e 0 R
PD:0 R:0 CD:100 F:0 │███████████████████████████████████│100.0% 28.5 job/s Note in the |
@trws made some good suggestions about further performance improvements for graph traversal during the last Fluxion hackathon. I tested them with the
milroy1@docker-desktop:/usr/src$ ./test.sh
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-22T22:38:11.782934Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-resource 37767648 16eaad7 0 R
sched-fluxion-qmanager 8646664 6d1d3e2 0 R sched
PD:0 R:0 CD:100 F:0 │████████████████████████████████████████████████████████|100.0% 2.5 job/s
milroy1@docker-desktop:/usr/src$ ./test.sh
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-22T22:34:21.454179Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-qmanager 8646664 6d1d3e2 0 R sched
sched-fluxion-resource 37739848 88cc2bf 0 R
PD:0 R:0 CD:100 F:0 │████████████████████████████████████████████████████████│100.0% 2.4 job/s
milroy1@docker-desktop:/usr/src$ ./test.sh
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-22T23:47:22.846822Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-qmanager 8646664 6d1d3e2 0 R sched
sched-fluxion-resource 38006288 a02a228 0 R
PD:0 R:0 CD:100 F:0 │████████████████████████████████████████████████████████│100.0% 1.6 job/s
milroy1@docker-desktop:/usr/src$ ./test.sh
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-22T23:55:15.292809Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-resource 37887904 e1cc9e0 0 R
sched-fluxion-qmanager 8646664 6d1d3e2 0 R sched
PD:0 R:0 CD:100 F:0 │████████████████████████████████████████████████████████│100.0% 2.0 job/s
milroy1@docker-desktop:/usr/src$ ./test.sh
flux-module: broker.rmmod sched-fluxion-qmanager: No such file or directory
flux-module: broker.rmmod sched-fluxion-resource: No such file or directory
2023-02-23T00:05:26.933375Z sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-qmanager 8646664 6d1d3e2 0 R sched
sched-fluxion-resource 37832352 9ac6420 0 R
PD:0 R:0 CD:100 F:0 │████████████████████████████████████████████████████████│100.0% 1.9 job/s |
Wow, almost no impact to negative impact. That's unfortunate, but really good to know. I found some things, thanks to help from @grondo, that might be impacting us from the Constraints. Once that bug is squashed we can take another pass on the performance. I'm guessing we're not getting a win from the unordered maps because of the hashing cost, which we could fix with pre-hashed strings or interned strings, might be worth doing the string rework and then circling back to the data types, even if that's more painful. =/ |
@milroy Are you running this in docker container? |
Yes, I've tested the reproducer directly on hetchy. In my tests the reproducer runs faster in the container running on my laptop. Hetchy and the container exhibit similar performance characteristics. |
Ok, so not faster in the container than hetchy for current Fluxion master (in fact, faster on hetchy), but really close: [milroy1@hetchy12:flux-sched]$ ./tests.sh
Failed to open drm root directory /sys/class/drm.: No such file or directory
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE PROPERTIES NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-resource 36405416 db97e26 0 R
sched-fluxion-qmanager 8496408 3022bef 0 R sched
PD:0 R:0 CD:100 F:0 │██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│100.0% 2.6 job/s Case 3 above on hetchy: [milroy1@hetchy12:flux-sched]$ ./tests.sh
Failed to open drm root directory /sys/class/drm.: No such file or directory
bardpeak: Scheduling is started
windom: Scheduling is started
bardpeak: Job submission is enabled
bardpeak: Scheduling is started
windom: Job submission is enabled
windom: Scheduling is started
STATE PROPERTIES NNODES NCORES NGPUS NODELIST
free 2 192 0 hetchy[7,12]
free windom 15 1920 0 hetchy[13-27]
free bardpeak 2 128 16 hetchy[28-29]
allocated 0 0 0
down 0 0 0
sched-fluxion-qmanager 8496408 3022bef 0 R sched
sched-fluxion-resource 37031616 df9e2f5 0 R
PD:0 R:0 CD:100 F:0 │██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│100.0% 1.8 job/s |
@jameshcorbett can we close this issue? PR #1007 fixes the job slowdown, but it doesn't comprehensively solve Fluxion performance problems. I can create a new issue to continue the investigation into general performance problems with Fluxion. |
I think this issue still applies since Hetchy (and all our systems) are using node exclusive policy? We're still at 10-20x slowdown. It might be nice to keep all the history in one issue. However, if you'd really like to create a new issue that's fine with me. |
Out of curiosity, has anyone run a perf test on the versions before we noticed this to see if we had a regression on lonodex or if it's something that just showed up? I know the hetchy config and queues are part of it, but if we had low but predictable performance with lonodex before we might not have seen it. |
Is the submission of sequential jobs still slowing down on Hetchy? So the example @jameshcorbett gave in the first comment where submission times increase from <1s to 24s is still occurring with |
Note as shown by results in #1009, this performance issue also occurs with or without node exclusive scheduling when moderate amounts of resources are involved in scheduling (in the examples, 2000 nodes). |
This should be addressed at this point, @jameshcorbett does this still repro for you in any cases? |
Nope, closing. |
The text was updated successfully, but these errors were encountered: