-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test impact of using rv1 vs rv1_nosched on instance performance #1009
Comments
Here's a first attempt at a parameter study that investigates different job sizes and counts with The suite of tests were launched on corona via the following
A couple of the parameter combinations caused an issue that may need to be investigated. The nodes/job=1000 cases for njobs=1000,2000, and 8000 all failed due to running out of space in Other instances reached a maximum RSS of ~2G, so I'm not yet sure what the issue was (more investigation needed!) Perhaps we are caching R in memory somewhere with the full scheduling key intact. Note in the data below a 1000 node exclusive R takes 17MiB - that is one R object 😲 . #!/bin/bash
MATCH_FORMAT=${MATCH_FORMAT:-rv1}
NJOBS=${NJOBS:-100}
NNODES=${NNODES:-16}
printf "MATCH_FORMAT=${MATCH_FORMAT} NJOBS=$NJOBS NODES/JOB=$NNODES\n"
flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux config load <<EOF
[sched-fluxion-qmanager]
queue-policy = "easy"
[sched-fluxion-resource]
match-format = "$MATCH_FORMAT"
[queues.debug]
requires = ["debug"]
[queues.batch]
requires = ["batch"]
[resource]
noverify = true
norestrict = true
[[resource.config]]
hosts = "test[0-1999]"
cores = "0-47"
gpus = "0-8"
[[resource.config]]
hosts = "test[0-1899]"
properties = ["batch"]
[[resource.config]]
hosts = "test[1900-1999]"
properties = ["debug"]
EOF
flux config get | jq '."sched-fluxion-resource"'
flux module load resource noverify monitor-force-up
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue start --all --quiet
flux resource list
t0=$(date +%s.%N)
flux submit -N$NNODES --queue=batch --cc=1-$NJOBS \
--setattr=exec.test.run_duration=1ms \
--quiet --wait hostname
ELAPSED=$(echo $(date +%s.%N) - $t0 | bc -l)
THROUGHPUT=$(echo $NJOBS/$ELAPSED | bc -l)
R_SIZE=$(flux job info $(flux job last) R | wc -c)
OBJ_COUNT=$(flux module stats content-sqlite | jq .object_count)
DB_SIZE=$(flux module stats content-sqlite | jq .dbfile_size)
printf "%-12s %5d %4d %8.2f %8.2f %12d %12d %12d\n" \
$MATCH_FORMAT $NJOBS $NNODES $ELAPSED $THROUGHPUT \
$R_SIZE $OBJ_COUNT $DB_SIZE Here's the initial results. The
|
Ran similar tests without node exclusive matching. All I did was change
|
I was going to try this same set of experiments with @garlick and I were wondering after the fact how this happens to work with Fluxion. It turns out that when Fluxion marks "all" resources down, it uses the instance size to mark ranks flux-sched/resource/modules/resource_match.cpp Lines 1114 to 1132 in 32f74d6
|
That sounds like a bug, if ironically a useful one for testing this. |
Agree. Issue opened: #1040 |
Ok, figured out the core
|
FYI - I edited the test script above to add the |
@grondo this might be a separate issue, but would it be possible to mock the state of nodes too? E.g. that some subset in the list is DOWN? The context here is for bursting - we want to mock nodes that don’t exist as DOWN and then provide some actually existing nodes (so hopefully we can accomplish similar without the entire thing being a mock!) |
The default state of nodes is down. Is there a situation where they need to be forced down after having been mocked up or actually up? (like shrinking back down)? Anyway not a scheduler issue per se so I'd suggest opening a flux-core issue (if there is an issue). |
If the default state is down, in these examples how do they fake run? Where is the logic happening that allows them to do that? |
This option: |
Yes! I derived that from here: https://github.com/flux-framework/flux-core/blob/49c59b57bb99852831745cd4cc1052eb56194365/src/modules/job-exec/testexec.c#L68 but I don't understand how it actually works to allow it to run (and schedule on nodes that don't actually exist) and not just determine that the resources are not available. I think maybe I'm asking about a level of detail deeper than that attribute? |
The default state of nodes in the scheduler is supposed to be down until the To really force resources up, the (assuming that is what you were actually asking about?) |
Ah so the above wasn't supposed to happen - without the bug the resources would remain down, is that correct? And the reason it was happening is here: Ok so assuming we set |
The resource.acquire protocol is described here: Basically scheduler starts up and asks resource "what resources do I have"? Resource says here's a pile of nodes/cores whatever and none of them are up. Oh, now two are up. On now four are up. (or all are up). Scheduler is simultaneously receiving "alloc" requests from the job manager asking for a resource allocation for pending jobs. So the scheduler's job is to decide which resources to allocate to the jobs requesting them. It should only allocate the resources that are up of course. Does that help? |
So in this case, all the scheduler knows is that the lead broker is up, and the lead broker is said to have all of the resources of the fake nodes (this part here saying that the resource spec can come from a config file, what we did):
So this response is just going to reflect what we put in the that broker config, and we don't verify any of them (that would be done with hwloc?) because we added :
And then because we are in this mock mode, there isn't an actual job run, it just schedules (assigns the job to some fake nodes), waits for the run duration, and then calls it completed? So does the command |
I think you got it! The actual command shouldn't matter. If you mix real and fake resources, the scheduler doesn't know which is which so it'll be fine if you are mocking execution, and sometimes fine and sometimes not if you aren't. |
I'll try this next! Thanks for answering my questions! |
This thread, and this question, is probably a good one to chat with @tpatki, @JaeseungYeom and maybe @milroy about too. We've talked about needing to work up a newer-generation simulator for some of the Fractale work if that happens, and this would fall under that pretty neatly. |
OK. I feel like we should take it out of this issue though as the original topic is pretty important and the results presented above are significant and deserve some attention. Maybe open a flux-core issue after discussing requirements? Could be a team meeting topic if needed? |
To that point, I'm trying to repro some of this, just to be sure, you got a lot of job-manager/jobtap errors right @grondo? This is what I'm getting for a small one for example:
|
No I don't see any of those jobtap errors in my runs. Looks like you are inside a docker container? Let me try in that environment (I was running on corona) |
@trws - I can't reproduce the errors above in the latest flux-sched docker container. I performed the following steps:
|
Thanks @grondo, I'm trying to get an environment where I can get a decent perf trace anyway, so I'm building one up in a long-term VM (the containers are proving a bit of a problem for getting perf to work, at all). Hopefully that will just take care of it, will see. |
@trws if you just need a small setup, the flux operator with kind works nicely (what I've been using to mess around). |
Trick is the mismatch between the container package versions and kernel versions I have handy, and the general difficulty of doing kernel event tracing in a container. Much as it's a bit of one-off work, doing tracing this way will save me time in the long run. |
Ok, let me know if you still see any issues. |
From scratch rebuild on an aarch64 debian bookworm VM with current versions of packages got rid of all the errors. Makes me want to know where they came from, but much better place to start. I do get a nasty warning turned error out of boost graph because of a new gcc-12 warning that's getting triggered despite it being in a system location. Not sure how that's getting past |
In a planning meeting, the idea of running with rv1 match format enabled in production was discussed as a stopgap solution for #991. However, the performance or other impact due to that change was not known. We should characterize any impact due to this configuration so we can make decisions based on results.
The text was updated successfully, but these errors were encountered: