-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hwloc and LIKWID Versions #682
Comments
To be more precise: with likwid@4.3.2 and hwloc@1.11.11 I get
|
Without likwid but with hwloc@2.0.2 I get
|
@Spielix You're right, we should document the minimum requirements. We should also support hwloc v2 eventually. I'll see what I can do. In the meantime, do you need support for either likwid or hwloc-2? |
Like I said, first of all I would like to know what difference it makes to have these. I mean when one knows the library one may be able to imagine for what I it may be used in DASH, but not every use-case one can imagine may be implemented (yet), etc. |
@Spielix Ahh I see, sorry I didn't fully grasp your question. You should be able to safely build DASH without these two libraries. They are mainly used to query information on the machine you're running on, e.g., the number of cores and the size of memory. In most cases, none of that is crucial for using DASH though and DASH will fallback to Linux APIs to query some of this information if neither Likwid nor hwloc is available. You're safe disabling both Likwid and hwloc entirely and build without them... |
When I build using hwloc@1.11.11 and run dash-test-mpi with e.g.
When I leave away the |
I would love to get some comment on this. Is this
|
It is hard to say what is going on just from the error you posted. Could you give some more information your platform, your MPI, and which test exactly fails? |
Well, you not knowing where it comes from is pretty much enough information for me to just drop hwloc. As you may want to go further:
|
It seems that the locality part of the runtime trips over something in your setup. Unfortunately, I do not know enough about that part to quickly figure things out. Here is the relevant code (https://github.com/dash-project/dash/blob/development/dart-impl/base/src/locality.c#L611): if (group_subdomain_tag_len <= group_parent_domain_tag_len) {
/* Indicates invalid parameters, usually caused by multiple units
* mapped to the same domain to be grouped.
*/
DART_LOG_ERROR("dart__base__locality__domain_group ! "
"group subdomain %s with invalid parent domain %s",
group_subdomain_tags[sd], group_parent_domain_tag); AFAICS, the hwloc part is only really relevant if you plan to split teams based on hardware information (grouping all units on one node into a team for example). If not it's safe to ignore hwloc... |
Maybe @fuchsto can shed some light on what is going wrong here? |
@Spielix You mentioned that the test run takes significantly longer if you place four units on the same node. That is surprising because most of the tests are single-threaded. Can you make sure that the processes are not bound to the same core? Can you try running with The amount of output is expected, that's the normal test output. |
The EDIT There is ca a factor of two in runtime. |
This node has 2x Intel Xeon Silver 4110 @2.10GHz. With error it takes about 3 minutes, without it takes about 1.5 minutes. |
@Spielix If you don't specify |
I guess the test that is failing is one of the ones not being run with only one slot? |
It wouldn't be too surprising if this was a setup issue, as the admins are mostly working on single nodes. We had problems with the MPI setup before. Although I would have thought that this would only show when one uses more than one node... |
Can you try to launch one unit per node to see if the issue persists there? (if multi-node runs are part of your use-case) |
I can, but the the only type I have 4 nodes of is knl, so the single cores are very slow (the network is slow too.). When I tried it I got a seemingly different error. |
As you can see in the new issue I have found out what stopped the non-hwloc run from working (I thought that it couldn't be that slow/that much to test). I actually didn't use the right branch. With the |
Fix compilation issues #682 hwloc 2.x and likwid
It would be very cool to have minimum/maximum Version numbers for these in the readme. For both libraries there can be build problems if the wrong version is loaded. E.g. it doesn't work with hwloc@2.0.2, but with hwloc@1.11.11. If I then add likwid@4.3.2 there are build errors again. I have no other Versions of likwid installed and I would like to know which one will work (newer or older than 4.3.2) before I try to build likwid.
Also it would be nice to know how DASH benefits exactly from these (and other) libraries, to know if they are worth the effort for a given project.
The text was updated successfully, but these errors were encountered: