-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial cancel not releasing rabbit resources (?) #1284
Comments
I reloaded the resource and fluxion modules and scheduling went back to working as expected at first, but then as I ran jobs they eventually became stuck in SCHED.
|
The issue seems to have been introduced between 0.36.1 and 0.37.0. |
I suspect this line is reached with
in which case an additional check and then goto done; is probably warranted. Do you have a reproducer for this issue?
|
I can reproduce in the flux-coral2 environment locally or on LC clusters, but there are a bunch of plugins loaded. The simplest thing I have is the following I think. It reproduces on the follow R+JGF (it may work more easily if you rename your docker container to have the hostname with jobspecs like: version: 9999
resources:
- type: slot
count: 1
label: default
with:
- type: ssd
count: 1
exclusive: true
- type: node
count: 1
exclusive: true
with:
- type: slot
label: task
count: 1
with:
- type: core
count: 1
# a comment
attributes:
system:
duration: 3600
tasks:
- command: [ "app" ]
slot: task
count:
per_slot: 1 However I don't know how to get it to ignore the |
Is this coming from the |
Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.
Problem: flux-framework/flux-sched/issues/1284 came up in production and was not caught beforehand because the testsuite never exhausts the available rabbit resources. Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust all the rabbit resources.
Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.
After our discussion with Tom, I'm almost certain this issue is related to the ssds not being mapped to a broker rank.
Are the rank values in this JGF representative of a production cluster (i.e., each rank is Here's my first crack at a reproducer ( #!/bin/bash
#
# Ensure fluxion cancels ssds as expected
#
log() { printf "issue#1284: $@\n" >&2; }
TEST_SIZE=2
log "Unloading modules..."
flux module remove sched-simple
flux module remove resource
flux config load <<EOF
[resource]
noverify = true
norestrict = true
path="${SHARNESS_TEST_SRCDIR}/R.json"
EOF
flux module load resource monitor-force-up
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux module list
flux module remove job-list
flux queue start --all --quiet
flux resource list
flux resource status
log "Running test jobs"
flux submit --flags=waitable \
--setattr=exec.test.run_duration=0.01 \
${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml
flux submit --flags=waitable \
--setattr=exec.test.run_duration=0.01 \
${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml
flux job wait -av
flux submit --flags=waitable \
--setattr=exec.test.run_duration=0.01 \
${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml
flux jobs -a With Sep 12 01:21:56.737938 UTC sched-fluxion-resource.err[0]: grow_resource_db_jgf: unpack_parent_job_resources: Invalid argument
Sep 12 01:21:56.737943 UTC sched-fluxion-resource.err[0]: update_resource_db: grow_resource_db: Invalid argument
Sep 12 01:21:56.737945 UTC sched-fluxion-resource.err[0]: update_resource: update_resource_db: Invalid argument
Sep 12 01:21:56.737990 UTC sched-fluxion-resource.err[0]: populate_resource_db_acquire: update_resource: Invalid argument
Sep 12 01:21:56.737991 UTC sched-fluxion-resource.err[0]: populate_resource_db: loading resources using resource.acquire
Sep 12 01:21:56.737996 UTC sched-fluxion-resource.err[0]: init_resource_graph: can't populate graph resource database
Sep 12 01:21:56.737997 UTC sched-fluxion-resource.err[0]: mod_main: can't initialize resource graph database Also, it seems like the first test should be to allocate the whole cluster and then see if another job can be scheduled after the resources are released. I think this is the jobspec we want: version: 9999
resources:
- type: slot
count: 1
label: default
with:
- type: rack
count: 1
with:
- type: ssd
count: 36
- type: node
count: 1
with:
- type: core
count: 10
# a comment
attributes:
system:
duration: 3600
tasks:
- command: [ "hostname" ]
slot: task
count:
per_slot: 1 Is this the right approach? Any ideas what may be wrong with |
Problem: there are no tests for issue flux-framework#1284. Add one.
@milroy I have a branch in my fork that repros the issue https://github.com/jameshcorbett/flux-sched/tree/issue-1284 Interestingly while fooling around with it I noticed that the issue only comes up if the jobspec has a top-level "slot". If instead it has "ssd" and "node" at the top level, it didn't seem to have the same problem. |
Problem: there are no tests for issue flux-framework#1284. Add one.
Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.
Problem: there are no tests for issue flux-framework#1284. Add one.
Problem: flux-framework/flux-sched/issues/1284 came up in production and was not caught beforehand because the testsuite never exhausts the available rabbit resources. Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust all the rabbit resources.
Problem: there are no tests for issue flux-framework#1284. Add one.
Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.
Problem: there are no tests for issue flux-framework#1284. Add one.
Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.
Ok @milroy I think I have an improved reproducer at https://github.com/jameshcorbett/flux-sched/tree/issue-1284 |
With this patch that @trws and I talked about
applied on to the branch linked above, I see errors like:
|
Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.
@jameshcorbett I cloned your fork and ran The logs indicate that the test is using However, enabling the jgf-based partial cancel unveils more problems. The default match format used by the test is not compatible with jgf-based partial cancel (it appears to be
I'll continue investigating and will report back as I find out more. |
Update: tests 5101 and 5102 succeed if I specify the test_expect_success 'an ssd jobspec can be allocated' '
flux module remove sched-simple &&
flux module remove resource &&
flux config load <<EOF &&
[resource]
noverify = true
norestrict = true
path="${SHARNESS_TEST_SRCDIR}/R.json"
[sched-fluxion-resource]
prune-filters = "cluster:ssd,rack:ssd"
EOF
<...> It appears that allowing |
OK, interesting. I will have to try it out. I just realized that we're switching the EAS clusters to the |
I was a bit pessimistic in the previous comment. I figured out a way to support the external rank partial cancellation with |
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Problem: flux-framework/flux-sched/issues/1284 came up in production and was not caught beforehand because the testsuite never exhausts the available rabbit resources. Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust all the rabbit resources.
Problem: flux-framework/flux-sched/issues/1284 came up in production and was not caught beforehand because the testsuite never exhausts the available rabbit resources. Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust all the rabbit resources.
Snipped results of
flux dmesg
on hetchy:Also I think I observed that rabbit resources are not released by the scheduler when jobs complete. For instance, I ran a one-node rabbit job, and then tried to submit another one only for it to become stuck in SCHED.
Any thoughts on what might be going on @milroy ?
The text was updated successfully, but these errors were encountered: