Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial cancel not releasing rabbit resources (?) #1284

Closed
jameshcorbett opened this issue Aug 27, 2024 · 15 comments · Fixed by #1292
Closed

Partial cancel not releasing rabbit resources (?) #1284

jameshcorbett opened this issue Aug 27, 2024 · 15 comments · Fixed by #1292

Comments

@jameshcorbett
Copy link
Member

jameshcorbett commented Aug 27, 2024

Snipped results of flux dmesg on hetchy:

2024-08-27T01:46:53.149076Z sched-fluxion-resource.err[0]: run_remove: dfu_traverser_t::remove (id=152883667495027712): mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149167Z sched-fluxion-resource.err[0]: ssd0.
2024-08-27T01:46:53.149175Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149181Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149186Z sched-fluxion-resource.err[0]: ssd1.
2024-08-27T01:46:53.149190Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149194Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149199Z sched-fluxion-resource.err[0]: ssd2.
2024-08-27T01:46:53.149204Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149208Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149218Z sched-fluxion-resource.err[0]: ssd3.
2024-08-27T01:46:53.149225Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149234Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149244Z sched-fluxion-resource.err[0]: ssd4.
2024-08-27T01:46:53.149251Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149257Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149267Z sched-fluxion-resource.err[0]: ssd5.
2024-08-27T01:46:53.149279Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149287Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149295Z sched-fluxion-resource.err[0]: ssd7.
2024-08-27T01:46:53.149303Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149309Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149316Z sched-fluxion-resource.err[0]: ssd6.
2024-08-27T01:46:53.149324Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149333Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149342Z sched-fluxion-resource.err[0]: ssd8.
2024-08-27T01:46:53.149349Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149355Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149363Z sched-fluxion-resource.err[0]: ssd9.
2024-08-27T01:46:53.149369Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149377Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149385Z sched-fluxion-resource.err[0]: ssd10.
2024-08-27T01:46:53.149391Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149397Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149405Z sched-fluxion-resource.err[0]: ssd11.
2024-08-27T01:46:53.149415Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149421Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149429Z sched-fluxion-resource.err[0]: ssd12.
2024-08-27T01:46:53.149436Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149443Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149451Z sched-fluxion-resource.err[0]: ssd13.
2024-08-27T01:46:53.149457Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149464Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149472Z sched-fluxion-resource.err[0]: ssd14.
2024-08-27T01:46:53.149480Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149487Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149495Z sched-fluxion-resource.err[0]: ssd15.
2024-08-27T01:46:53.149502Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149508Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149516Z sched-fluxion-resource.err[0]: ssd16.
2024-08-27T01:46:53.149522Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149528Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span aft
2024-08-27T01:46:53.149544Z sched-fluxion-resource.err[0]: partial_cancel_request_cb: remove fails due to match error (id=152883667495027712): Success
2024-08-27T01:46:53.150544Z sched-fluxion-qmanager.err[0]: remove: .free RPC partial cancel failed for jobid 152883667495027712: Invalid argument
2024-08-27T01:46:53.150564Z sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=parrypeak id=152883667495027712): Invalid argument
2024-08-27T01:50:11.281162Z sched-fluxion-qmanager.debug[0]: feasibility_request_cb: feasibility succeeded
2024-08-27T01:50:52.232045Z sched-fluxion-qmanager.debug[0]: feasibility_request_cb: feasibility succeeded

Also I think I observed that rabbit resources are not released by the scheduler when jobs complete. For instance, I ran a one-node rabbit job, and then tried to submit another one only for it to become stuck in SCHED.

Any thoughts on what might be going on @milroy ?

@flux-framework flux-framework deleted a comment Aug 27, 2024
@flux-framework flux-framework deleted a comment Aug 27, 2024
@jameshcorbett
Copy link
Member Author

I reloaded the resource and fluxion modules and scheduling went back to working as expected at first, but then as I ran jobs they eventually became stuck in SCHED.

[  +2.103872] job-manager[0]: scheduler: hello
[  +2.104034] job-manager[0]: scheduler: ready unlimited
[  +2.104099] sched-fluxion-qmanager[0]: handshaking with job-manager completed
[Aug27 12:59] sched-fluxion-resource[0]: find_request_cb: find succeeded
[ +15.464330] sched-fluxion-resource[0]: find_request_cb: find succeeded
[Aug27 13:00] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[Aug27 13:05] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[  +4.494890] job-manager[0]: housekeeping: fMjSa3tyHyZ started
[Aug27 13:07] job-manager[0]: housekeeping: fMjUstxpcZm started
[ +31.506371] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[Aug27 13:09] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +18.366731] job-manager[0]: housekeeping: fMjVuq84egF started
[Aug27 13:12] job-manager[0]: housekeeping: fMjSa3tyHyZ complete
[  +0.000729] sched-fluxion-resource[0]: run_remove: dfu_traverser_t::remove (id=153988281996936192): mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
[  +0.000771] sched-fluxion-resource[0]: ssd0.
[  +0.000777] sched-fluxion-resource[0]: Success.
[  +0.000787] sched-fluxion-resource[0]: partial_cancel_request_cb: remove fails due to match error (id=153988281996936192): Success
[  +0.000971] sched-fluxion-qmanager[0]: remove: .free RPC partial cancel failed for jobid 153988281996936192: Invalid argument
[  +0.000994] sched-fluxion-qmanager[0]: jobmanager_free_cb: remove (queue=parrypeak id=153988281996936192): Invalid argument

@jameshcorbett
Copy link
Member Author

The issue seems to have been introduced between 0.36.1 and 0.37.0.

@milroy
Copy link
Member

milroy commented Sep 4, 2024

I suspect this line is reached with mod_type == job_modify_t::PARTIAL_CANCEL:


in which case an additional check and then goto done; is probably warranted. Do you have a reproducer for this issue?

@jameshcorbett
Copy link
Member Author

I can reproduce in the flux-coral2 environment locally or on LC clusters, but there are a bunch of plugins loaded. The simplest thing I have is the following I think.

It reproduces on the follow R+JGF
R.json

(it may work more easily if you rename your docker container to have the hostname compute-01)

with jobspecs like:

version: 9999
resources:
  - type: slot
    count: 1
    label: default
    with:
    - type: ssd
      count: 1
      exclusive: true
    - type: node
      count: 1
      exclusive: true
      with:
      - type: slot
        label: task
        count: 1
        with:
        - type: core
          count: 1
# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "app" ]
    slot: task
    count:
      per_slot: 1

However I don't know how to get it to ignore the parse_jobspec: job ƒ2kTi2FyZ invalid jobspec; Unsupported resource type 'ssd' errors. In the flux-coral2 environment it isn't an issue because the jobspec is modified after submission.

@grondo
Copy link
Contributor

grondo commented Sep 4, 2024

However I don't know how to get it to ignore the parse_jobspec: job ƒ2kTi2FyZ invalid jobspec; Unsupported resource type 'ssd'

Is this coming from the job-list module? If so you can probably safely ignore it, or just unload job-list.

milroy added a commit to milroy/flux-sched that referenced this issue Sep 5, 2024
Problem: issue flux-framework#1284 identified a scenario where rabbits are not
released due to a traverser error during partial cancellation. The
traverser should skip the rest of the mod_plan function when an
allocation is found and mod_data.mod_type ==
job_modify_t::PARTIAL_CANCEL.

Add a goto statement to return 0 under this circumstance.
jameshcorbett added a commit to jameshcorbett/flux-coral2 that referenced this issue Sep 5, 2024
Problem: flux-framework/flux-sched/issues/1284 came up in production
and was not caught beforehand because the testsuite never exhausts
the available rabbit resources.

Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust
all the rabbit resources.
milroy added a commit to milroy/flux-sched that referenced this issue Sep 11, 2024
Problem: issue flux-framework#1284 identified a scenario where rabbits are not
released due to a traverser error during partial cancellation. The
traverser should skip the rest of the mod_plan function when an
allocation is found and mod_data.mod_type ==
job_modify_t::PARTIAL_CANCEL.

Add a goto statement to return 0 under this circumstance.
@milroy
Copy link
Member

milroy commented Sep 12, 2024

After our discussion with Tom, I'm almost certain this issue is related to the ssds not being mapped to a broker rank.

It reproduces on the follow R+JGF
R.json

Are the rank values in this JGF representative of a production cluster (i.e., each rank is -1)?

Here's my first crack at a reproducer (t/issues/t1284-cancel-ssds.sh):

#!/bin/bash
#
#  Ensure fluxion cancels ssds as expected
#

log() { printf "issue#1284: $@\n" >&2; }

TEST_SIZE=2

log "Unloading modules..."
flux module remove sched-simple
flux module remove resource

flux config load <<EOF
[resource]
noverify = true
norestrict = true
path="${SHARNESS_TEST_SRCDIR}/R.json"
EOF

flux module load resource monitor-force-up
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux module list
flux module remove job-list 
flux queue start --all --quiet
flux resource list
flux resource status

log "Running test jobs"
flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml
flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml

flux job wait -av

flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml

flux jobs -a

With R.json set as the R.json file you provided, @jameshcorbett. I'm getting an error when trying to initialize the resource graph from R.json:

Sep 12 01:21:56.737938 UTC sched-fluxion-resource.err[0]: grow_resource_db_jgf: unpack_parent_job_resources: Invalid argument
Sep 12 01:21:56.737943 UTC sched-fluxion-resource.err[0]: update_resource_db: grow_resource_db: Invalid argument
Sep 12 01:21:56.737945 UTC sched-fluxion-resource.err[0]: update_resource: update_resource_db: Invalid argument
Sep 12 01:21:56.737990 UTC sched-fluxion-resource.err[0]: populate_resource_db_acquire: update_resource: Invalid argument
Sep 12 01:21:56.737991 UTC sched-fluxion-resource.err[0]: populate_resource_db: loading resources using resource.acquire
Sep 12 01:21:56.737996 UTC sched-fluxion-resource.err[0]: init_resource_graph: can't populate graph resource database
Sep 12 01:21:56.737997 UTC sched-fluxion-resource.err[0]: mod_main: can't initialize resource graph database

Also, it seems like the first test should be to allocate the whole cluster and then see if another job can be scheduled after the resources are released. I think this is the jobspec we want:

version: 9999
resources:
  - type: slot
    count: 1
    label: default
    with:
    - type: rack
      count: 1
      with:
      - type: ssd
        count: 36
      - type: node
        count: 1
        with:
        - type: core
          count: 10
# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "hostname" ]
    slot: task
    count:
      per_slot: 1

Is this the right approach? Any ideas what may be wrong with R.json?

jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Sep 16, 2024
Problem: there are no tests for issue flux-framework#1284.

Add one.
@jameshcorbett
Copy link
Member Author

@milroy I have a branch in my fork that repros the issue https://github.com/jameshcorbett/flux-sched/tree/issue-1284

Interestingly while fooling around with it I noticed that the issue only comes up if the jobspec has a top-level "slot". If instead it has "ssd" and "node" at the top level, it didn't seem to have the same problem.

jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Sep 18, 2024
Problem: there are no tests for issue flux-framework#1284.

Add one.
milroy added a commit to milroy/flux-sched that referenced this issue Sep 19, 2024
Problem: issue flux-framework#1284 identified a scenario where rabbits are not
released due to a traverser error during partial cancellation. The
traverser should skip the rest of the mod_plan function when an
allocation is found and mod_data.mod_type ==
job_modify_t::PARTIAL_CANCEL.

Add a goto statement to return 0 under this circumstance.
jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Oct 1, 2024
Problem: there are no tests for issue flux-framework#1284.

Add one.
jameshcorbett added a commit to jameshcorbett/flux-coral2 that referenced this issue Oct 1, 2024
Problem: flux-framework/flux-sched/issues/1284 came up in production
and was not caught beforehand because the testsuite never exhausts
the available rabbit resources.

Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust
all the rabbit resources.
jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Oct 3, 2024
Problem: there are no tests for issue flux-framework#1284.

Add one.
jameshcorbett pushed a commit to jameshcorbett/flux-sched that referenced this issue Oct 3, 2024
Problem: issue flux-framework#1284 identified a scenario where rabbits are not
released due to a traverser error during partial cancellation. The
traverser should skip the rest of the mod_plan function when an
allocation is found and mod_data.mod_type ==
job_modify_t::PARTIAL_CANCEL.

Add a goto statement to return 0 under this circumstance.
jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Oct 3, 2024
jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Oct 3, 2024
Problem: there are no tests for issue flux-framework#1284.

Add one.
jameshcorbett pushed a commit to jameshcorbett/flux-sched that referenced this issue Oct 3, 2024
Problem: issue flux-framework#1284 identified a scenario where rabbits are not
released due to a traverser error during partial cancellation. The
traverser should skip the rest of the mod_plan function when an
allocation is found and mod_data.mod_type ==
job_modify_t::PARTIAL_CANCEL.

Add a goto statement to return 0 under this circumstance.
@jameshcorbett
Copy link
Member Author

Ok @milroy I think I have an improved reproducer at https://github.com/jameshcorbett/flux-sched/tree/issue-1284

@jameshcorbett
Copy link
Member Author

With this patch that @trws and I talked about

diff --git a/qmanager/policies/base/queue_policy_base.hpp b/qmanager/policies/base/queue_policy_base.hpp
index 6fa2e44d..e9fd1166 100644
--- a/qmanager/policies/base/queue_policy_base.hpp
+++ b/qmanager/policies/base/queue_policy_base.hpp
@@ -666,7 +666,7 @@ class queue_policy_base_t : public resource_model::queue_adapter_base_t {
                     // during cancel
                     auto job_sp = job_it->second;
                     m_jobs.erase (job_it);
-                    if (final && !full_removal) {
+                    if (true) {
                         // This error condition indicates a discrepancy between core and sched.
                         flux_log_error (flux_h,
                                         "%s: Final .free RPC failed to remove all resources for "

applied on to the branch linked above, I see errors like:

Oct 08 02:39:39.086386 UTC sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 278065577984: Success
Oct 08 02:39:39.086651 UTC sched-fluxion-resource.debug[0]: cancel_request_cb: nonexistent job (id=278065577984)
Oct 08 02:39:39.086839 UTC sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=278065577984): Protocol error

milroy added a commit to milroy/flux-sched that referenced this issue Oct 14, 2024
Problem: issue flux-framework#1284 identified a scenario where rabbits are not
released due to a traverser error during partial cancellation. The
traverser should skip the rest of the mod_plan function when an
allocation is found and mod_data.mod_type ==
job_modify_t::PARTIAL_CANCEL.

Add a goto statement to return 0 under this circumstance.
@milroy
Copy link
Member

milroy commented Oct 14, 2024

@jameshcorbett I cloned your fork and ran t5101-issue-1284.t. Several things are going on.

The logs indicate that the test is using JGF read format to read R.json, but then it calls the rv1exec reader for cancel. That's because partial cancel is hard-coded to only use rv1exec since that was the only reader needed in production when I implemented rank-based partial cancel. I should have remembered that.

However, enabling the jgf-based partial cancel unveils more problems. The default match format used by the test is not compatible with jgf-based partial cancel (it appears to be rv1_nosched). Unfortunately, specifying jgf partial cancel-compatible match format results in yet another error upon graph initialization:

flux-config: error converting TOML to JSON: Invalid argument

I'll continue investigating and will report back as I find out more.

@milroy
Copy link
Member

milroy commented Oct 14, 2024

Update: tests 5101 and 5102 succeed if I specify the ssd pruning filter as follows:

test_expect_success 'an ssd jobspec can be allocated' '
	flux module remove sched-simple &&
	flux module remove resource &&
	flux config load <<EOF &&
[resource]
noverify = true
norestrict = true
path="${SHARNESS_TEST_SRCDIR}/R.json"
[sched-fluxion-resource]
prune-filters = "cluster:ssd,rack:ssd"
EOF
<...>

It appears that allowing core pruning filters in the graph causes cancellations to fail.

@jameshcorbett
Copy link
Member Author

OK, interesting. I will have to try it out. I just realized that we're switching the EAS clusters to the rv1 match format, from rv1_nosched. Is partial cancel going to break or are we going to have other issues?

@milroy
Copy link
Member

milroy commented Oct 17, 2024

I was a bit pessimistic in the previous comment. I figured out a way to support the external rank partial cancellation with rv1_nosched/rv1_exec format and linked PR #1292 to close this issue.

jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Oct 17, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Oct 17, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Oct 18, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
jameshcorbett added a commit to jameshcorbett/flux-sched that referenced this issue Oct 18, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 18, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 24, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 24, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 24, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 24, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 30, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 30, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Oct 31, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
milroy pushed a commit to milroy/flux-sched that referenced this issue Nov 1, 2024
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
milroy added a commit to milroy/flux-sched that referenced this issue Nov 1, 2024
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
@mergify mergify bot closed this as completed in #1292 Nov 2, 2024
jameshcorbett added a commit to jameshcorbett/flux-coral2 that referenced this issue Nov 4, 2024
Problem: flux-framework/flux-sched/issues/1284 came up in production
and was not caught beforehand because the testsuite never exhausts
the available rabbit resources.

Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust
all the rabbit resources.
jameshcorbett added a commit to jameshcorbett/flux-coral2 that referenced this issue Nov 5, 2024
Problem: flux-framework/flux-sched/issues/1284 came up in production
and was not caught beforehand because the testsuite never exhausts
the available rabbit resources.

Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust
all the rabbit resources.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants