remove: Final .free RPC failed to remove all resources for jobid 1395310723072: Success #1301

garlick · 2024-09-26T19:05:34Z

Problem: spurious error?

I'm getting this error on flux-sched-0.38.0-7-g2bd4253d whenever I run something that uses cores on more than one node, but not whole nodes

Sep 26 11:56:14.362759 PDT sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 3453455695872: Success
Sep 26 11:56:14.363202 PDT sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=3453455695872): Protocol error

For example, on a 2 node allocation with 4 cores per node, I can run on 1, 2, 3, 4, and 8 cores with no error. But 5, 6, 7 generate that error.

FWIW:

$ flux module trace sched-fluxion-qmanager
2024-09-26T12:03:57.006 sched-fluxion-qmanager rx > sched.alloc [462]
2024-09-26T12:03:57.007 sched-fluxion-qmanager tx > sched-fluxion-resource.match_multi [678]
2024-09-26T12:03:57.009 sched-fluxion-qmanager rx < sched-fluxion-resource.match_multi [370]
2024-09-26T12:03:57.009 sched-fluxion-qmanager rx < sched-fluxion-resource.match_multi [0]
2024-09-26T12:03:57.009 sched-fluxion-qmanager tx > kvs.commit [450]
2024-09-26T12:03:57.012 sched-fluxion-qmanager rx < kvs.commit [72]
2024-09-26T12:03:57.012 sched-fluxion-qmanager tx < sched.alloc [294]
2024-09-26T12:03:57.173 sched-fluxion-qmanager rx > sched.free [254]
2024-09-26T12:03:57.173 sched-fluxion-qmanager tx > sched-fluxion-resource.partial-cancel [286]
2024-09-26T12:03:57.173 sched-fluxion-qmanager rx < sched-fluxion-resource.partial-cancel [19]
2024-09-26T12:03:57.173 sched-fluxion-qmanager tx > log.append [155]
2024-09-26T12:03:57.174 sched-fluxion-qmanager tx > sched-fluxion-resource.cancel [24]
2024-09-26T12:03:57.174 sched-fluxion-qmanager rx < sched-fluxion-resource.cancel [3]
2024-09-26T12:03:57.174 sched-fluxion-qmanager tx > log.append [143]

The resources are actually freed though, and I can reallocate them.

The text was updated successfully, but these errors were encountered:

garlick · 2024-10-03T21:41:09Z

easy reproducer:

$ flux start -s2 sh -c 'flux run -n $(($(flux resource list -no {ncores})-1)) true'
Oct 03 14:40:01.444840 PDT sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 12733906944: Success
Oct 03 14:40:01.444937 PDT sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=12733906944): Protocol error

grondo · 2024-10-04T00:08:30Z

I'm seeing this on my cluster as well. In my case I'm allocating full nodes, but my nodes have different numbers of cores. The trigger for this bug seems to be when there are multiple entries in the R_lite array, i.e. there are at least two ranks that have different core idsets assigned. E.g.

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "2-3",
        "children": {
          "core": "0-3"
        }
      },
      {
        "rank": "4",
        "children": {
          "core": "0-7"
        }
      }
    ],
    "nodelist": [
      "pi[1-2,4]"
    ],
    "properties": {
      "cm4": "2-3",
      "rk1": "4"
    },
    "starttime": 1727999001,
    "expiration": 1728002601
  }
}

showed the error while

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "2-3",
        "children": {
          "core": "0-3"
        }
      }
    ],
    "nodelist": [
      "pi[1-2]"
    ],
    "properties": {
      "cm4": "2-3"
    },
    "starttime": 1727998990,
    "expiration": 1728002590
  }
}

Does not.

grondo · 2024-10-04T02:16:01Z

This seems to fix the error for me:

diff --git a/resource/readers/resource_reader_rv1exec.cpp b/resource/readers/resource_reader_rv1exec.cpp
index d630d239..9ed64626 100644
--- a/resource/readers/resource_reader_rv1exec.cpp
+++ b/resource/readers/resource_reader_rv1exec.cpp
@@ -961,15 +961,15 @@ int resource_reader_rv1exec_t::partial_cancel_internal (resource_graph_t &g,
             errno = EINVAL;
             goto error;
         }
+        if (!(r_ids = idset_decode (ranks)))
+            goto error;
+        rank = idset_first (r_ids);
+        while (rank != IDSET_INVALID_ID) {
+            mod_data.ranks_removed.insert (rank);
+            rank = idset_next (r_ids, rank);
+        }
+        idset_destroy (r_ids);
     }
-    if (!(r_ids = idset_decode (ranks)))
-        goto error;
-    rank = idset_first (r_ids);
-    while (rank != IDSET_INVALID_ID) {
-        mod_data.ranks_removed.insert (rank);
-        rank = idset_next (r_ids, rank);
-    }
-    idset_destroy (r_ids);

Though there's probably a much more efficient way to do this (e.g. build the idset, then insert into ranks_removed)

Problem: The rv1exec reader partial cancel support doesn't work when there are multiple entries in the execution R_lite array. This is because resource_reader_rv1exec_t::partial_cancel_internal() doesn't accumulate ranks as it loops over the R_lite array. This results in the nuisance log message remove: Final .free RPC failed to remove all resources for jobid... for every job that doesn't have the same core or gpu ids allocated for every rank. Accumulate ranks while looping over entries in R_lite instead of throwing away all but the last ranks idset. Fixes flux-framework#1301

Problem: The Fluxion testsuite does not contain a test that does end-to-end testing of resource release handling, including partial release as triggered by the Flux housekeeping service. Add a new test t1026-rv1-partial-release.t which implements a first cut of this testing. Add to the test specific cases that trigger issue flux-framework#1301.

flux-framework deleted a comment from grondo Sep 26, 2024

trws self-assigned this Oct 2, 2024

mergify bot closed this as completed in 2b4b6ff Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove: Final .free RPC failed to remove all resources for jobid 1395310723072: Success #1301

remove: Final .free RPC failed to remove all resources for jobid 1395310723072: Success #1301

garlick commented Sep 26, 2024 •

edited

Loading

garlick commented Oct 3, 2024

grondo commented Oct 4, 2024

grondo commented Oct 4, 2024

remove: Final .free RPC failed to remove all resources for jobid 1395310723072: Success #1301

remove: Final .free RPC failed to remove all resources for jobid 1395310723072: Success #1301

Comments

garlick commented Sep 26, 2024 • edited Loading

garlick commented Oct 3, 2024

grondo commented Oct 4, 2024

grondo commented Oct 4, 2024

garlick commented Sep 26, 2024 •

edited

Loading