-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add method to query HW size #630
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Jeff Hammond <jehammond@nvidia.com>
19de193
to
8bd9160
Compare
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
this is gross but i need #define GNU_SOURCE before any other headers, or the CPU_ISSET macro isn't defined. Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
Signed-off-by: Jeff Hammond <jeff.science@gmail.com>
Sorry @jeffhammond I hijacked the PR description to flesh out some ideas related to the larger problem of thread pinning. |
To be clear, final version will not abort. But I could not figure out how to set BLIS threading variables correctly. That's why there's a preprocessing warning saying "please help me". |
@jeffhammond once we're in the OMP region, it's kind of too late to do anything except, as in |
@jeffhammond can you walk me through the logic of the PR as implemented? I'm not sure how the CPU mask alone can reliably be used to detect oversubscription. |
I agree. We can do slightly better than serialization if nested is enabled but I don't think that's an important thing to spend time on. |
@@ -199,6 +200,12 @@ void bli_l3_thread_decorator_thread_check | |||
) | |||
{ | |||
dim_t n_threads_real = omp_get_num_threads(); | |||
dim_t n_threads_hwmask; | |||
if ( omp_in_parallel() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devinamatthews here, we look for the number of HW threads available to the thread if in OpenMP already, or the process, if not in OpenMP already, and if the user is trying to use more SW threads than HW threads, we abort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but this requires the user to have set the CPU mask (or OMP affinity?) correctly. What if I just run BLIS in OpenMP (with nesting) and set both OMP_NUM_THREADS and BLIS_NUM_THREADS equal to the number of cores? Or is this only meant to guard against a more specific use case?
bli_thrcomm_init( n_threads_hwmask, gl_comm ); | ||
bli_rntm_set_num_threads_only( n_threads_hwmask, rntm ); | ||
#warning HELP ME HERE | ||
bli_rntm_set_ways_only( 1, 1, 1, 1, 1, rntm ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devinamatthews this is the part where i need help. i can just do serialization, but i was thinking about trying to do slightly better, e.g. if i have 40 cores and the user wants to run on 80 threads, i can set things up to use 40 threads properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing the number of active threads after thread creation is possible, but not easy. If this logic can go before the OpenMP region then things are much easier.
the affinity mask HW thread count is compared to the user-specified SW thread count. |
Ah OK, so the user could in principle reduce the number of HW cores "visible" to BLIS by setting a mask but if no action is taken then BLIS just sees all cores. OK. |
@jeffhammond sorry to be dense but I still don't see how this check fixes the oversubscription problem in #588. Each "instance" of BLIS would still pass the check, but the total number of threads is more than the number of cores. Detecting nested OMP and dropping down to one core could work, but would have to be configurable. Otherwise, maybe we could check |
I've asked @egaudry to test with his application, which may be settings affinity masks via MPI or HWLOC. But I can say that it works to detect oversubscription for me. If I set OMP_NUM_THREADS=16 on an Intel CPU with 8 HW threads, it aborts. When I set OMP_NUM_THREADS=8, it's fine. I added a test that can do other things. |
Yes that should work fine, but setting OMP_NUM_THREADS=BLIS_NUM_THREADS=4 and nesting should not abort and yet oversubscribe. Isn't this the more pressing problem? |
Here are my suggested solutions:
|
Also, I suggest an environment variable |
The token idea doesn't work if BLIS runs alongside other things that use threads. But I'll see what I can do. I disagree that nesting is the higher priority. The bug that motivated all of this was not because of nested OpenMP. |
@egaudry @jeffhammond the issue in #588 was explained here. How is that not nested OMP? |
Okay it was nested open but also affinity. I can fix them both. |
Sorry, I haven't had the time to check on my side. @jeffhammond is correct that the issue was seen because an affinity mask was set to clarify, the slowdown occurred because more thread than available physical cores where used in BLIS within a region protected with an openmp lock. |
This is my first step towards addressing the oversubscription encountered in #588.
This adds a utility to figure out how many CPUs (hardware threads) are available in the affinity mask of the calling context. This is a stronger restriction than the number of available software threads.
@egaudry
Signed-off-by: Jeff Hammond jehammond@nvidia.com
Open questions (not all need to be addressed in this PR):
omp_get_max_threads()
)?n_threads > n_hw_cores
?