Partial cancel not releasing rabbit resources (?) #1284

jameshcorbett · 2024-08-27T02:22:25Z

Snipped results of flux dmesg on hetchy:

2024-08-27T01:46:53.149076Z sched-fluxion-resource.err[0]: run_remove: dfu_traverser_t::remove (id=152883667495027712): mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149167Z sched-fluxion-resource.err[0]: ssd0.
2024-08-27T01:46:53.149175Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149181Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149186Z sched-fluxion-resource.err[0]: ssd1.
2024-08-27T01:46:53.149190Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149194Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149199Z sched-fluxion-resource.err[0]: ssd2.
2024-08-27T01:46:53.149204Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149208Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149218Z sched-fluxion-resource.err[0]: ssd3.
2024-08-27T01:46:53.149225Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149234Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149244Z sched-fluxion-resource.err[0]: ssd4.
2024-08-27T01:46:53.149251Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149257Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149267Z sched-fluxion-resource.err[0]: ssd5.
2024-08-27T01:46:53.149279Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149287Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149295Z sched-fluxion-resource.err[0]: ssd7.
2024-08-27T01:46:53.149303Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149309Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149316Z sched-fluxion-resource.err[0]: ssd6.
2024-08-27T01:46:53.149324Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149333Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149342Z sched-fluxion-resource.err[0]: ssd8.
2024-08-27T01:46:53.149349Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149355Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149363Z sched-fluxion-resource.err[0]: ssd9.
2024-08-27T01:46:53.149369Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149377Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149385Z sched-fluxion-resource.err[0]: ssd10.
2024-08-27T01:46:53.149391Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149397Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149405Z sched-fluxion-resource.err[0]: ssd11.
2024-08-27T01:46:53.149415Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149421Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149429Z sched-fluxion-resource.err[0]: ssd12.
2024-08-27T01:46:53.149436Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149443Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149451Z sched-fluxion-resource.err[0]: ssd13.
2024-08-27T01:46:53.149457Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149464Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149472Z sched-fluxion-resource.err[0]: ssd14.
2024-08-27T01:46:53.149480Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149487Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149495Z sched-fluxion-resource.err[0]: ssd15.
2024-08-27T01:46:53.149502Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149508Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149516Z sched-fluxion-resource.err[0]: ssd16.
2024-08-27T01:46:53.149522Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149528Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span aft
2024-08-27T01:46:53.149544Z sched-fluxion-resource.err[0]: partial_cancel_request_cb: remove fails due to match error (id=152883667495027712): Success
2024-08-27T01:46:53.150544Z sched-fluxion-qmanager.err[0]: remove: .free RPC partial cancel failed for jobid 152883667495027712: Invalid argument
2024-08-27T01:46:53.150564Z sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=parrypeak id=152883667495027712): Invalid argument
2024-08-27T01:50:11.281162Z sched-fluxion-qmanager.debug[0]: feasibility_request_cb: feasibility succeeded
2024-08-27T01:50:52.232045Z sched-fluxion-qmanager.debug[0]: feasibility_request_cb: feasibility succeeded

Also I think I observed that rabbit resources are not released by the scheduler when jobs complete. For instance, I ran a one-node rabbit job, and then tried to submit another one only for it to become stuck in SCHED.

Any thoughts on what might be going on @milroy ?

The text was updated successfully, but these errors were encountered:

jameshcorbett · 2024-08-27T20:31:54Z

I reloaded the resource and fluxion modules and scheduling went back to working as expected at first, but then as I ran jobs they eventually became stuck in SCHED.

[  +2.103872] job-manager[0]: scheduler: hello
[  +2.104034] job-manager[0]: scheduler: ready unlimited
[  +2.104099] sched-fluxion-qmanager[0]: handshaking with job-manager completed
[Aug27 12:59] sched-fluxion-resource[0]: find_request_cb: find succeeded
[ +15.464330] sched-fluxion-resource[0]: find_request_cb: find succeeded
[Aug27 13:00] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[Aug27 13:05] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[  +4.494890] job-manager[0]: housekeeping: fMjSa3tyHyZ started
[Aug27 13:07] job-manager[0]: housekeeping: fMjUstxpcZm started
[ +31.506371] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[Aug27 13:09] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +18.366731] job-manager[0]: housekeeping: fMjVuq84egF started
[Aug27 13:12] job-manager[0]: housekeeping: fMjSa3tyHyZ complete
[  +0.000729] sched-fluxion-resource[0]: run_remove: dfu_traverser_t::remove (id=153988281996936192): mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
[  +0.000771] sched-fluxion-resource[0]: ssd0.
[  +0.000777] sched-fluxion-resource[0]: Success.
[  +0.000787] sched-fluxion-resource[0]: partial_cancel_request_cb: remove fails due to match error (id=153988281996936192): Success
[  +0.000971] sched-fluxion-qmanager[0]: remove: .free RPC partial cancel failed for jobid 153988281996936192: Invalid argument
[  +0.000994] sched-fluxion-qmanager[0]: jobmanager_free_cb: remove (queue=parrypeak id=153988281996936192): Invalid argument

jameshcorbett · 2024-08-29T17:37:08Z

The issue seems to have been introduced between 0.36.1 and 0.37.0.

milroy · 2024-09-04T20:19:57Z

I suspect this line is reached with mod_type == job_modify_t::PARTIAL_CANCEL:

flux-sched/resource/traversers/dfu_impl_update.cpp

Line 538 in f8bd880

}

in which case an additional check and then goto done; is probably warranted. Do you have a reproducer for this issue?

jameshcorbett · 2024-09-04T20:44:51Z

I can reproduce in the flux-coral2 environment locally or on LC clusters, but there are a bunch of plugins loaded. The simplest thing I have is the following I think.

It reproduces on the follow R+JGF
R.json

(it may work more easily if you rename your docker container to have the hostname compute-01)

with jobspecs like:

version: 9999
resources:
  - type: slot
    count: 1
    label: default
    with:
    - type: ssd
      count: 1
      exclusive: true
    - type: node
      count: 1
      exclusive: true
      with:
      - type: slot
        label: task
        count: 1
        with:
        - type: core
          count: 1
# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "app" ]
    slot: task
    count:
      per_slot: 1

However I don't know how to get it to ignore the parse_jobspec: job ƒ2kTi2FyZ invalid jobspec; Unsupported resource type 'ssd' errors. In the flux-coral2 environment it isn't an issue because the jobspec is modified after submission.

grondo · 2024-09-04T20:49:41Z

However I don't know how to get it to ignore the parse_jobspec: job ƒ2kTi2FyZ invalid jobspec; Unsupported resource type 'ssd'

Is this coming from the job-list module? If so you can probably safely ignore it, or just unload job-list.

Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.

Problem: flux-framework/flux-sched/issues/1284 came up in production and was not caught beforehand because the testsuite never exhausts the available rabbit resources. Add a test that runs three back-to-back 10TiB rabbit jobs to exhaust all the rabbit resources.

Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.

milroy · 2024-09-12T01:38:14Z

After our discussion with Tom, I'm almost certain this issue is related to the ssds not being mapped to a broker rank.

It reproduces on the follow R+JGF
R.json

Are the rank values in this JGF representative of a production cluster (i.e., each rank is -1)?

Here's my first crack at a reproducer (t/issues/t1284-cancel-ssds.sh):

#!/bin/bash
#
#  Ensure fluxion cancels ssds as expected
#

log() { printf "issue#1284: $@\n" >&2; }

TEST_SIZE=2

log "Unloading modules..."
flux module remove sched-simple
flux module remove resource

flux config load <<EOF
[resource]
noverify = true
norestrict = true
path="${SHARNESS_TEST_SRCDIR}/R.json"
EOF

flux module load resource monitor-force-up
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux module list
flux module remove job-list 
flux queue start --all --quiet
flux resource list
flux resource status

log "Running test jobs"
flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml
flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml

flux job wait -av

flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml

flux jobs -a

With R.json set as the R.json file you provided, @jameshcorbett. I'm getting an error when trying to initialize the resource graph from R.json:

Sep 12 01:21:56.737938 UTC sched-fluxion-resource.err[0]: grow_resource_db_jgf: unpack_parent_job_resources: Invalid argument
Sep 12 01:21:56.737943 UTC sched-fluxion-resource.err[0]: update_resource_db: grow_resource_db: Invalid argument
Sep 12 01:21:56.737945 UTC sched-fluxion-resource.err[0]: update_resource: update_resource_db: Invalid argument
Sep 12 01:21:56.737990 UTC sched-fluxion-resource.err[0]: populate_resource_db_acquire: update_resource: Invalid argument
Sep 12 01:21:56.737991 UTC sched-fluxion-resource.err[0]: populate_resource_db: loading resources using resource.acquire
Sep 12 01:21:56.737996 UTC sched-fluxion-resource.err[0]: init_resource_graph: can't populate graph resource database
Sep 12 01:21:56.737997 UTC sched-fluxion-resource.err[0]: mod_main: can't initialize resource graph database

Also, it seems like the first test should be to allocate the whole cluster and then see if another job can be scheduled after the resources are released. I think this is the jobspec we want:

version: 9999
resources:
  - type: slot
    count: 1
    label: default
    with:
    - type: rack
      count: 1
      with:
      - type: ssd
        count: 36
      - type: node
        count: 1
        with:
        - type: core
          count: 10
# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "hostname" ]
    slot: task
    count:
      per_slot: 1

Is this the right approach? Any ideas what may be wrong with R.json?

Problem: there are no tests for issue flux-framework#1284. Add one.

jameshcorbett · 2024-09-16T23:10:32Z

@milroy I have a branch in my fork that repros the issue https://github.com/jameshcorbett/flux-sched/tree/issue-1284

Interestingly while fooling around with it I noticed that the issue only comes up if the jobspec has a top-level "slot". If instead it has "ssd" and "node" at the top level, it didn't seem to have the same problem.