-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resource Id remapping support #773
Conversation
@Mergefyio rebase |
@Mergifyio rebase |
Command
|
4341eca
to
7113842
Compare
This is good for your review. While testing this on rzansel with actual GPUs, I found a couple of problems with flux-core. I posted a PR for one problem: flux-framework/flux-core#3376 But I don't think we have any immediate solution for the other: flux-framework/flux-core#3375 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @dongahn, but may need a few minor changes that I highlighted.
@@ -0,0 +1,207 @@ | |||
/*****************************************************************************\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits: typos in commit message:
infon (ref_id is remapped to remapped_id) for each
-> info (ref_id is remapped to remapped_id) for each
of remapping info infomration per each.
-> of remapping info.
which them allows data querying to be performed
-> which then allows data querying to be performed
such a way that we can find the corresponding
-> in such a way that we can find the corresponding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this. I will clean up the commit message. Thanks!
m_remap[exec_target_range] = std::map<const std::string, | ||
std::map<uint64_t, | ||
uint64_t>> (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you can avoid the subsequent [exec_target_range].find
and [exec_target_range][name_type].find
lookups by doing something like:
m_remap[exec_target_range] = std::map<const std::string, | |
std::map<uint64_t, | |
uint64_t>> (); | |
m_remap[exec_target_range] = std::map<const std::string, | |
std::map<uint64_t, | |
uint64_t>> (); | |
m_remap[exec_target_range][name_type] = std::map<uint64_t, | |
uint64_t> (); | |
m_remap[exec_target_range][name_type][ref_id] = remapped_id; | |
goto success; |
There may be a cleaner way to accomplish this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, I picked readability and simplicity over optimization for the initial condition. I think it makes sense to add this in this case so I will try to add this. But I have a slight reservation on the use of goto
statement for success as a unconventional control flow can be confusing. So I will try to avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milroy: Looking at the code, I am actually having a second thought.
It feels like doing this optimization will make the control flow more complex. It is like loop hoisting used in hot loop optimization. But this isn't a hotspot as this initial iteration happens only once per each execution target range (in a typical case, this would be called just once). So I have a slight preference to readability (with simpler control flow) over minor performance improvement.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's not a hotspot, it's probably better to lean toward readability.
rc = -1; | ||
break; | ||
} | ||
if (remap_id > std::numeric_limits<int>::max ()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because hwloc_obj::logical_index
is unsigned
. Couldn't we use something smaller than uint64_t
for remap_id (as originating in resource/readers/resource_namespace_remapper.cpp
), or is there another advantage for using the 64 bit type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make this class as general as possible. I considered two ways. 1) make this class templated; or 1) use the largest data type. Going with 1) will have some side effects one of which is to have the implementations in header files. So for a little class like this, I thought it would be okay to go with 2). From our discussion yesterday, you seem to be okay with it. Unless I hear otherwise, I will keep this as is. Thanks.
t/scripts/flux-ion-resource.py
Outdated
@@ -83,6 +83,10 @@ def rpc_find (self, criteria, find_format=None): | |||
|
|||
def rpc_status (self): | |||
return self.f.rpc ("sched-fluxion-resource.status").get () | |||
|
|||
def rpc_namespace_info (self, rank, type_name, Id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: IMHO Id
is too close to id
, which is a built-in Python function. This isn't a necessary change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea id
was a keyword. I will go with identity
.
7113842
to
39ab316
Compare
39ab316
to
470e79b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just some comments below, none of them are required. Thanks @dongahn! Huge improvement for GPU users.
auto m_remap_iter = m_remap.find (exec_target_range); | ||
|
||
if (m_remap_iter == m_remap.end ()) { | ||
m_remap[exec_target_range] = std::map<const std::string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using the m_remap.emplace(exec_target_range, std::map<....>)
help here, since "insertion only takes place if no other element in the container has a key equivalent to the one being emplaced "? I don't think it would be any more efficient, but I think it would shrink the find
, if
, and [
lines down into a single line (for both the inner and outer maps). Just a thought, not a required change.
@@ -0,0 +1,91 @@ | |||
#!/bin/sh | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo in commit message: shen
-> when
470e79b
to
f8bca2d
Compare
auto m_remap_iter = m_remap.find (exec_target_range); | ||
|
||
if (m_remap_iter == m_remap.end ()) { | ||
auto ret = m_remap.emplace (exec_target_range, | ||
std::map<const std::string, | ||
std::map<uint64_t, | ||
uint64_t>> ()); | ||
if (!ret.second) { | ||
errno = ENOMEM; | ||
goto error; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are using emplace, it is my understanding that you don't need the find
or the if
:
auto m_remap_iter = m_remap.find (exec_target_range); | |
if (m_remap_iter == m_remap.end ()) { | |
auto ret = m_remap.emplace (exec_target_range, | |
std::map<const std::string, | |
std::map<uint64_t, | |
uint64_t>> ()); | |
if (!ret.second) { | |
errno = ENOMEM; | |
goto error; | |
} | |
auto ret = m_remap.emplace (exec_target_range, | |
std::map<const std::string, | |
std::map<uint64_t, | |
uint64_t>> ()); |
Also IIUC, insert
and emplace
only return false for ret.second
when the element already existed in the map, not when there is a memory error, so if you keep those checks, I think ENOMEM
is a misnomer. EEXISTS
might be better. I'm basing this off the docs in cplusplus.com, so I very well may be wrong about that. [1] [2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I see now that you are using the result of the find further down. So ignore my comment about not needing the find
. Sorry about that.
The bit about ENOMEM
still stands though. Assuming that ret.second
is only false when the element already exists (and not when out of memory), I don't think you need that check at all (the find already proved the element doesn't exist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered EEXIST
but that condition will never occur because of the enclosing condition: if (m_remap_iter == m_remap.end ()) {
So I used ENOMEM
as catch-all. Maybe it is better to just not checking ret.second
since it should not occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is better to just not checking ret.second since it should not occur?
Yeah, I would agree with that. Your call though. My previous approval still stands, so feel free to put MWP on this PR when you are happy with it. It'll probably need a rebase too for the new GitHub actions to take affect (I'm still seeing Travis on the status of this).
f8bca2d
to
0b4c129
Compare
Problem: The rank and resource Id namespace of a nested flux instance is different from that of its parent. Thus, the resource reader of a nested instance may need to remap the rank and/or certain resource Ids as it populates its resource graph store from the resources emitted from its parent. Add a header and C++ source file that introduces the resource_namespace_remapper_t class. It is designed to add and query resource Id or rank remapping information efficiently both in terms of performance and memory overheads. Using the add() method, users should first pass the resource type (e.g., "core", "gpu") or "rank" as the "name_type" argument along with the Id remapping info (ref_id is remapped to remapped_id) for each unique range of flux ranks (e.g., 0-3). resource_namespace_remapper_t keeps each remapping data using distinct_range_t as key to a std::map member object. IOW, we treat each distinct rank range which has the same remapping info as an equivalent set and keep only one copy of remapping info. Furthermore, the distinct_range_t class implements "operator<()" so that this mapping info can be kept in std::map in ascending rank-range order, which then allows data querying to be performed in O(logN) time. However, the method for querying is the query() method, whose semantics is per individual rank based. To support retrieval of remapping information for a individual rank, we implement "operator<()" in such a way that we can find the corresponding record when the rank intersects a rank-range key.
Make resource_namespace_remapper available for all reader classes by integrating it into the base reader class as a public member object.
Problem: the hwloc reader of a nested flux instance discovers the GPU resources using hwloc. Hwloc uses the logical Id space for each discovered GPU resource, which does not work with GPU affinity at execution time (i.e., CUDA_VISIBLE_DEVICES). Solution: If remap information has been established, hwloc reader remaps the logical Id of the self-discovered GPU devices into the remapped Id. The sched-fluxion-resource will be modified to establish this remap by using RV1 execution key coming from core's resource.acquire interface.
Establish GPU Id remapping by using RV1 execution key coming from core's resource.acquire interface. This is used by the hwloc reader to remap the self-discovered GPU Ids, which are logical, into physical.
Problem: it is nearly impossible to test hwloc's reader's GPU Id remapping on non-GPU systems using fake hwloc xml files, which makes it difficult to develop a CI test. Add ns-info (or namespace info) RPC to sched-fluxion-resource to support further testing. This RPC returns remapping information corresponding to <rank, resource-type-name, ref-id>.
Add a subcommand into flux-ion-resource: ns-info Rank Type Id to get to help invoke ns-info RPC within sched-fluxion-resource.
0b4c129
to
61b1c39
Compare
FYI -- CI failed because flux-core picked up the comma separated list format fix for |
Use a Sierra hwloc xml which has 4 GPUs. Get a nested allocation using flux mini alloc with 2 GPUs and run - flux ion-resource ns-info 0 gpu 0 - flux ion-resource ns-info 0 gpu 1 to check how the self-discovered GPU Id 0 and 1 are remapped to. When we use high Id first match policy, they should map to 2 and 3 because the Fluxion scheduler will select GPU 2 and 3 (i.e., CUDA_VISIBLE_DEVICE=2,3 passed into the nested instance). Similarly, when we use low Id first match policy, they should map to 0 and 1.
61b1c39
to
45c4143
Compare
Codecov Report
@@ Coverage Diff @@
## master #773 +/- ##
========================================
- Coverage 73.5% 73.4% -0.1%
========================================
Files 79 81 +2
Lines 8173 8316 +143
========================================
+ Hits 6011 6109 +98
- Misses 2162 2207 +45
|
Ok. that did the trick. Thanks @milroy and @SteVwonder for your thorough review! |
Thanks @dongahn for putting this together! |
This PR adds basic support for resource Id namespace remapping and solves hwloc GPU ID remapping.
class resource_namespace_remapper_t
)sched-fluxion-resource
to establish correct GPU remapping for hwloc reader. The hwloc reader of a nested flux instance discovers the GPU resources using hwloc. However, hwloc uses the logical Id space for each discovered GPU resource, which does not work with GPU affinity at execution time (i.e.,CUDA_VISIBLE_DEVICES
).