Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove: Final .free RPC failed to remove all resources for jobid 1395310723072: Success #1301

Closed
garlick opened this issue Sep 26, 2024 · 3 comments
Assignees

Comments

@garlick
Copy link
Member

garlick commented Sep 26, 2024

Problem: spurious error?

I'm getting this error on flux-sched-0.38.0-7-g2bd4253d whenever I run something that uses cores on more than one node, but not whole nodes

Sep 26 11:56:14.362759 PDT sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 3453455695872: Success
Sep 26 11:56:14.363202 PDT sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=3453455695872): Protocol error

For example, on a 2 node allocation with 4 cores per node, I can run on 1, 2, 3, 4, and 8 cores with no error. But 5, 6, 7 generate that error.

FWIW:

$ flux module trace sched-fluxion-qmanager
2024-09-26T12:03:57.006 sched-fluxion-qmanager rx > sched.alloc [462]
2024-09-26T12:03:57.007 sched-fluxion-qmanager tx > sched-fluxion-resource.match_multi [678]
2024-09-26T12:03:57.009 sched-fluxion-qmanager rx < sched-fluxion-resource.match_multi [370]
2024-09-26T12:03:57.009 sched-fluxion-qmanager rx < sched-fluxion-resource.match_multi [0]
2024-09-26T12:03:57.009 sched-fluxion-qmanager tx > kvs.commit [450]
2024-09-26T12:03:57.012 sched-fluxion-qmanager rx < kvs.commit [72]
2024-09-26T12:03:57.012 sched-fluxion-qmanager tx < sched.alloc [294]
2024-09-26T12:03:57.173 sched-fluxion-qmanager rx > sched.free [254]
2024-09-26T12:03:57.173 sched-fluxion-qmanager tx > sched-fluxion-resource.partial-cancel [286]
2024-09-26T12:03:57.173 sched-fluxion-qmanager rx < sched-fluxion-resource.partial-cancel [19]
2024-09-26T12:03:57.173 sched-fluxion-qmanager tx > log.append [155]
2024-09-26T12:03:57.174 sched-fluxion-qmanager tx > sched-fluxion-resource.cancel [24]
2024-09-26T12:03:57.174 sched-fluxion-qmanager rx < sched-fluxion-resource.cancel [3]
2024-09-26T12:03:57.174 sched-fluxion-qmanager tx > log.append [143]

The resources are actually freed though, and I can reallocate them.

@flux-framework flux-framework deleted a comment from grondo Sep 26, 2024
@trws trws self-assigned this Oct 2, 2024
@garlick
Copy link
Member Author

garlick commented Oct 3, 2024

easy reproducer:

$ flux start -s2 sh -c 'flux run -n $(($(flux resource list -no {ncores})-1)) true'
Oct 03 14:40:01.444840 PDT sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 12733906944: Success
Oct 03 14:40:01.444937 PDT sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=12733906944): Protocol error

@grondo
Copy link
Contributor

grondo commented Oct 4, 2024

I'm seeing this on my cluster as well. In my case I'm allocating full nodes, but my nodes have different numbers of cores. The trigger for this bug seems to be when there are multiple entries in the R_lite array, i.e. there are at least two ranks that have different core idsets assigned. E.g.

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "2-3",
        "children": {
          "core": "0-3"
        }
      },
      {
        "rank": "4",
        "children": {
          "core": "0-7"
        }
      }
    ],
    "nodelist": [
      "pi[1-2,4]"
    ],
    "properties": {
      "cm4": "2-3",
      "rk1": "4"
    },
    "starttime": 1727999001,
    "expiration": 1728002601
  }
}

showed the error while

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "2-3",
        "children": {
          "core": "0-3"
        }
      }
    ],
    "nodelist": [
      "pi[1-2]"
    ],
    "properties": {
      "cm4": "2-3"
    },
    "starttime": 1727998990,
    "expiration": 1728002590
  }
}

Does not.

@grondo
Copy link
Contributor

grondo commented Oct 4, 2024

This seems to fix the error for me:

diff --git a/resource/readers/resource_reader_rv1exec.cpp b/resource/readers/resource_reader_rv1exec.cpp
index d630d239..9ed64626 100644
--- a/resource/readers/resource_reader_rv1exec.cpp
+++ b/resource/readers/resource_reader_rv1exec.cpp
@@ -961,15 +961,15 @@ int resource_reader_rv1exec_t::partial_cancel_internal (resource_graph_t &g,
             errno = EINVAL;
             goto error;
         }
+        if (!(r_ids = idset_decode (ranks)))
+            goto error;
+        rank = idset_first (r_ids);
+        while (rank != IDSET_INVALID_ID) {
+            mod_data.ranks_removed.insert (rank);
+            rank = idset_next (r_ids, rank);
+        }
+        idset_destroy (r_ids);
     }
-    if (!(r_ids = idset_decode (ranks)))
-        goto error;
-    rank = idset_first (r_ids);
-    while (rank != IDSET_INVALID_ID) {
-        mod_data.ranks_removed.insert (rank);
-        rank = idset_next (r_ids, rank);
-    }
-    idset_destroy (r_ids);

Though there's probably a much more efficient way to do this (e.g. build the idset, then insert into ranks_removed)

grondo added a commit to grondo/flux-sched that referenced this issue Oct 4, 2024
Problem: The rv1exec reader partial cancel support doesn't work when
there are multiple entries in the execution R_lite array. This is because
resource_reader_rv1exec_t::partial_cancel_internal() doesn't accumulate
ranks as it loops over the R_lite array. This results in the nuisance log
message

  remove: Final .free RPC failed to remove all resources for jobid...

for every job that doesn't have the same core or gpu ids allocated for
every rank.

Accumulate ranks while looping over entries in R_lite instead of
throwing away all but the last ranks idset.

Fixes flux-framework#1301
grondo added a commit to grondo/flux-sched that referenced this issue Oct 4, 2024
Problem: The Fluxion testsuite does not contain a test that does
end-to-end testing of resource release handling, including partial
release as triggered by the Flux housekeeping service.

Add a new test t1026-rv1-partial-release.t which implements a first
cut of this testing.

Add to the test specific cases that trigger issue flux-framework#1301.
grondo added a commit to grondo/flux-sched that referenced this issue Oct 4, 2024
Problem: The Fluxion testsuite does not contain a test that does
end-to-end testing of resource release handling, including partial
release as triggered by the Flux housekeeping service.

Add a new test t1026-rv1-partial-release.t which implements a first
cut of this testing.

Add to the test specific cases that trigger issue flux-framework#1301.
grondo added a commit to grondo/flux-sched that referenced this issue Oct 4, 2024
Problem: The Fluxion testsuite does not contain a test that does
end-to-end testing of resource release handling, including partial
release as triggered by the Flux housekeeping service.

Add a new test t1026-rv1-partial-release.t which implements a first
cut of this testing.

Add to the test specific cases that trigger issue flux-framework#1301.
@mergify mergify bot closed this as completed in 2b4b6ff Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants