Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resource Id remapping support #773

Merged
merged 10 commits into from
Dec 17, 2020

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Nov 24, 2020

This PR adds basic support for resource Id namespace remapping and solves hwloc GPU ID remapping.

  • Add initial support to aid in remapping resource Id and rank efficiently for nested instances (i.e., class resource_namespace_remapper_t)
  • Integrate it into the base resource reader class to make it available for all of the reader classes
  • Fix incorrect GPU numbering of our hwloc reader using this support
  • Modify sched-fluxion-resource to establish correct GPU remapping for hwloc reader. The hwloc reader of a nested flux instance discovers the GPU resources using hwloc. However, hwloc uses the logical Id space for each discovered GPU resource, which does not work with GPU affinity at execution time (i.e., CUDA_VISIBLE_DEVICES).

@dongahn dongahn requested review from SteVwonder and milroy November 28, 2020 23:25
@dongahn dongahn changed the title [WIP] Add resource Id remapping support Add resource Id remapping support Nov 28, 2020
@dongahn
Copy link
Member Author

dongahn commented Nov 28, 2020

@Mergefyio rebase

@dongahn
Copy link
Member Author

dongahn commented Nov 28, 2020

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Nov 28, 2020

Command rebase: success

Branch has been successfully rebased

@dongahn
Copy link
Member Author

dongahn commented Nov 28, 2020

This is good for your review.

While testing this on rzansel with actual GPUs, I found a couple of problems with flux-core. I posted a PR for one problem: flux-framework/flux-core#3376

But I don't think we have any immediate solution for the other: flux-framework/flux-core#3375

Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @dongahn, but may need a few minor changes that I highlighted.

@@ -0,0 +1,207 @@
/*****************************************************************************\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits: typos in commit message:

infon (ref_id is remapped to remapped_id) for each

-> info (ref_id is remapped to remapped_id) for each

of remapping info infomration per each.

-> of remapping info.

which them allows data querying to be performed

-> which then allows data querying to be performed

such a way that we can find the corresponding

-> in such a way that we can find the corresponding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this. I will clean up the commit message. Thanks!

Comment on lines 119 to 121
m_remap[exec_target_range] = std::map<const std::string,
std::map<uint64_t,
uint64_t>> ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you can avoid the subsequent [exec_target_range].find and [exec_target_range][name_type].find lookups by doing something like:

Suggested change
m_remap[exec_target_range] = std::map<const std::string,
std::map<uint64_t,
uint64_t>> ();
m_remap[exec_target_range] = std::map<const std::string,
std::map<uint64_t,
uint64_t>> ();
m_remap[exec_target_range][name_type] = std::map<uint64_t,
uint64_t> ();
m_remap[exec_target_range][name_type][ref_id] = remapped_id;
goto success;

There may be a cleaner way to accomplish this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, I picked readability and simplicity over optimization for the initial condition. I think it makes sense to add this in this case so I will try to add this. But I have a slight reservation on the use of goto statement for success as a unconventional control flow can be confusing. So I will try to avoid it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milroy: Looking at the code, I am actually having a second thought.

It feels like doing this optimization will make the control flow more complex. It is like loop hoisting used in hot loop optimization. But this isn't a hotspot as this initial iteration happens only once per each execution target range (in a typical case, this would be called just once). So I have a slight preference to readability (with simpler control flow) over minor performance improvement.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's not a hotspot, it's probably better to lean toward readability.

resource/modules/resource_match.cpp Outdated Show resolved Hide resolved
rc = -1;
break;
}
if (remap_id > std::numeric_limits<int>::max ()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because hwloc_obj::logical_index is unsigned. Couldn't we use something smaller than uint64_t for remap_id (as originating in resource/readers/resource_namespace_remapper.cpp), or is there another advantage for using the 64 bit type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make this class as general as possible. I considered two ways. 1) make this class templated; or 1) use the largest data type. Going with 1) will have some side effects one of which is to have the implementations in header files. So for a little class like this, I thought it would be okay to go with 2). From our discussion yesterday, you seem to be okay with it. Unless I hear otherwise, I will keep this as is. Thanks.

@@ -83,6 +83,10 @@ def rpc_find (self, criteria, find_format=None):

def rpc_status (self):
return self.f.rpc ("sched-fluxion-resource.status").get ()

def rpc_namespace_info (self, rank, type_name, Id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: IMHO Id is too close to id, which is a built-in Python function. This isn't a necessary change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea id was a keyword. I will go with identity.

@codecov codecov bot deleted a comment from codecov-io Dec 2, 2020
Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some comments below, none of them are required. Thanks @dongahn! Huge improvement for GPU users.

auto m_remap_iter = m_remap.find (exec_target_range);

if (m_remap_iter == m_remap.end ()) {
m_remap[exec_target_range] = std::map<const std::string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using the m_remap.emplace(exec_target_range, std::map<....>) help here, since "insertion only takes place if no other element in the container has a key equivalent to the one being emplaced "? I don't think it would be any more efficient, but I think it would shrink the find, if, and [ lines down into a single line (for both the inner and outer maps). Just a thought, not a required change.

@@ -0,0 +1,91 @@
#!/bin/sh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo in commit message: shen -> when

Comment on lines 116 to 126
auto m_remap_iter = m_remap.find (exec_target_range);

if (m_remap_iter == m_remap.end ()) {
auto ret = m_remap.emplace (exec_target_range,
std::map<const std::string,
std::map<uint64_t,
uint64_t>> ());
if (!ret.second) {
errno = ENOMEM;
goto error;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are using emplace, it is my understanding that you don't need the find or the if:

Suggested change
auto m_remap_iter = m_remap.find (exec_target_range);
if (m_remap_iter == m_remap.end ()) {
auto ret = m_remap.emplace (exec_target_range,
std::map<const std::string,
std::map<uint64_t,
uint64_t>> ());
if (!ret.second) {
errno = ENOMEM;
goto error;
}
auto ret = m_remap.emplace (exec_target_range,
std::map<const std::string,
std::map<uint64_t,
uint64_t>> ());

Also IIUC, insert and emplace only return false for ret.second when the element already existed in the map, not when there is a memory error, so if you keep those checks, I think ENOMEM is a misnomer. EEXISTS might be better. I'm basing this off the docs in cplusplus.com, so I very well may be wrong about that. [1] [2]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I see now that you are using the result of the find further down. So ignore my comment about not needing the find. Sorry about that.

The bit about ENOMEM still stands though. Assuming that ret.second is only false when the element already exists (and not when out of memory), I don't think you need that check at all (the find already proved the element doesn't exist).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered EEXIST but that condition will never occur because of the enclosing condition: if (m_remap_iter == m_remap.end ()) {

So I used ENOMEM as catch-all. Maybe it is better to just not checking ret.second since it should not occur?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to just not checking ret.second since it should not occur?

Yeah, I would agree with that. Your call though. My previous approval still stands, so feel free to put MWP on this PR when you are happy with it. It'll probably need a rebase too for the new GitHub actions to take affect (I'm still seeing Travis on the status of this).

Problem: The rank and resource Id namespace of a
nested flux instance is different from that of its
parent. Thus, the resource reader of a nested
instance may need to remap the rank and/or certain
resource Ids as it populates its resource graph
store from the resources emitted from its parent.

Add a header and C++ source file that introduces
the resource_namespace_remapper_t class. It is
designed to add and query resource Id
or rank remapping information efficiently
both in terms of performance and memory overheads.

Using the add() method, users should first pass
the resource type (e.g., "core", "gpu") or "rank" as
the "name_type" argument along with the Id remapping
info (ref_id is remapped to remapped_id) for each
unique range of flux ranks (e.g., 0-3).

resource_namespace_remapper_t keeps each
remapping data using distinct_range_t as key to a
std::map member object. IOW, we treat each distinct
rank range which has the same remapping info as
an equivalent set and keep only one copy
of remapping info.

Furthermore, the distinct_range_t class implements
"operator<()" so that this mapping info can be kept
in std::map in ascending rank-range order,
which then allows data querying to be performed
in O(logN) time.

However, the method for querying is the query() method,
whose semantics is per individual rank based.
To support retrieval of remapping information
for a individual rank, we implement "operator<()"
in such a way that we can find the corresponding
record when the rank intersects a rank-range key.
Make resource_namespace_remapper available for
all reader classes by integrating it into
the base reader class as a public member object.
Problem: the hwloc reader of a nested flux instance
discovers the GPU resources using hwloc. Hwloc uses
the logical Id space for each discovered GPU
resource, which does not work with GPU affinity
at execution time (i.e., CUDA_VISIBLE_DEVICES).

Solution: If remap information has been established,
hwloc reader remaps the logical Id of the
self-discovered GPU devices into the remapped Id.
The sched-fluxion-resource will be modified
to establish this remap by using RV1 execution
key coming from core's resource.acquire interface.
Establish GPU Id remapping by using RV1 execution
key coming from core's resource.acquire interface.
This is used by the hwloc reader to remap the
self-discovered GPU Ids, which are logical,
into physical.
Problem: it is nearly impossible to test hwloc's
reader's GPU Id remapping on non-GPU systems
using fake hwloc xml files, which makes it
difficult to develop a CI test.

Add ns-info (or namespace info) RPC to
sched-fluxion-resource to support further testing.
This RPC returns remapping information corresponding
to <rank, resource-type-name, ref-id>.
Add a subcommand into flux-ion-resource:
ns-info Rank Type Id to get to help invoke ns-info
RPC within sched-fluxion-resource.
@dongahn dongahn added the merge-when-passing mergify.io - merge PR automatically once CI passes label Dec 17, 2020
@dongahn
Copy link
Member Author

dongahn commented Dec 17, 2020

FYI -- CI failed because flux-core picked up the comma separated list format fix for CUDA_VISIBLE_DEIVCES. I will quickly adjust the failing test. flux-framework/flux-core#3376

Use a Sierra hwloc xml which has 4 GPUs.

Get a nested allocation using flux mini alloc
with 2 GPUs and run
  - flux ion-resource ns-info 0 gpu 0
  - flux ion-resource ns-info 0 gpu 1
to check how the self-discovered GPU
Id 0 and 1 are remapped to.

When we use high Id first match policy,
they should map to 2 and 3 because
the Fluxion scheduler will select GPU 2 and 3
(i.e., CUDA_VISIBLE_DEVICE=2,3 passed into
the nested instance).

Similarly, when we use low Id first match
policy, they should map to 0 and 1.
@codecov
Copy link

codecov bot commented Dec 17, 2020

Codecov Report

Merging #773 (45c4143) into master (a8fe07c) will decrease coverage by 0.0%.
The diff coverage is 69.7%.

@@           Coverage Diff            @@
##           master    #773     +/-   ##
========================================
- Coverage    73.5%   73.4%   -0.1%     
========================================
  Files          79      81      +2     
  Lines        8173    8316    +143     
========================================
+ Hits         6011    6109     +98     
- Misses       2162    2207     +45     
Impacted Files Coverage Δ
resource/readers/resource_reader_base.hpp 100.0% <ø> (ø)
resource/readers/resource_reader_hwloc.cpp 80.2% <40.0%> (-2.5%) ⬇️
resource/readers/resource_namespace_remapper.cpp 66.1% <66.1%> (ø)
resource/modules/resource_match.cpp 74.8% <77.1%> (+<0.1%) ⬆️
resource/readers/resource_namespace_remapper.hpp 100.0% <100.0%> (ø)

@mergify mergify bot merged commit 59210e0 into flux-framework:master Dec 17, 2020
@dongahn
Copy link
Member Author

dongahn commented Dec 17, 2020

Ok. that did the trick. Thanks @milroy and @SteVwonder for your thorough review!

@SteVwonder
Copy link
Member

Thanks @dongahn for putting this together!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants