-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove: Final .free RPC failed to remove all resources for jobid 1395310723072: Success #1301
Comments
easy reproducer:
|
I'm seeing this on my cluster as well. In my case I'm allocating full nodes, but my nodes have different numbers of cores. The trigger for this bug seems to be when there are multiple entries in the {
"version": 1,
"execution": {
"R_lite": [
{
"rank": "2-3",
"children": {
"core": "0-3"
}
},
{
"rank": "4",
"children": {
"core": "0-7"
}
}
],
"nodelist": [
"pi[1-2,4]"
],
"properties": {
"cm4": "2-3",
"rk1": "4"
},
"starttime": 1727999001,
"expiration": 1728002601
}
} showed the error while {
"version": 1,
"execution": {
"R_lite": [
{
"rank": "2-3",
"children": {
"core": "0-3"
}
}
],
"nodelist": [
"pi[1-2]"
],
"properties": {
"cm4": "2-3"
},
"starttime": 1727998990,
"expiration": 1728002590
}
} Does not. |
This seems to fix the error for me: diff --git a/resource/readers/resource_reader_rv1exec.cpp b/resource/readers/resource_reader_rv1exec.cpp
index d630d239..9ed64626 100644
--- a/resource/readers/resource_reader_rv1exec.cpp
+++ b/resource/readers/resource_reader_rv1exec.cpp
@@ -961,15 +961,15 @@ int resource_reader_rv1exec_t::partial_cancel_internal (resource_graph_t &g,
errno = EINVAL;
goto error;
}
+ if (!(r_ids = idset_decode (ranks)))
+ goto error;
+ rank = idset_first (r_ids);
+ while (rank != IDSET_INVALID_ID) {
+ mod_data.ranks_removed.insert (rank);
+ rank = idset_next (r_ids, rank);
+ }
+ idset_destroy (r_ids);
}
- if (!(r_ids = idset_decode (ranks)))
- goto error;
- rank = idset_first (r_ids);
- while (rank != IDSET_INVALID_ID) {
- mod_data.ranks_removed.insert (rank);
- rank = idset_next (r_ids, rank);
- }
- idset_destroy (r_ids); Though there's probably a much more efficient way to do this (e.g. build the idset, then insert into |
Problem: The rv1exec reader partial cancel support doesn't work when there are multiple entries in the execution R_lite array. This is because resource_reader_rv1exec_t::partial_cancel_internal() doesn't accumulate ranks as it loops over the R_lite array. This results in the nuisance log message remove: Final .free RPC failed to remove all resources for jobid... for every job that doesn't have the same core or gpu ids allocated for every rank. Accumulate ranks while looping over entries in R_lite instead of throwing away all but the last ranks idset. Fixes flux-framework#1301
Problem: The Fluxion testsuite does not contain a test that does end-to-end testing of resource release handling, including partial release as triggered by the Flux housekeeping service. Add a new test t1026-rv1-partial-release.t which implements a first cut of this testing. Add to the test specific cases that trigger issue flux-framework#1301.
Problem: The Fluxion testsuite does not contain a test that does end-to-end testing of resource release handling, including partial release as triggered by the Flux housekeeping service. Add a new test t1026-rv1-partial-release.t which implements a first cut of this testing. Add to the test specific cases that trigger issue flux-framework#1301.
Problem: The Fluxion testsuite does not contain a test that does end-to-end testing of resource release handling, including partial release as triggered by the Flux housekeeping service. Add a new test t1026-rv1-partial-release.t which implements a first cut of this testing. Add to the test specific cases that trigger issue flux-framework#1301.
Problem: spurious error?
I'm getting this error on
flux-sched-0.38.0-7-g2bd4253d
whenever I run something that uses cores on more than one node, but not whole nodesFor example, on a 2 node allocation with 4 cores per node, I can run on 1, 2, 3, 4, and 8 cores with no error. But 5, 6, 7 generate that error.
FWIW:
The resources are actually freed though, and I can reallocate them.
The text was updated successfully, but these errors were encountered: