-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pthreads doesn't seem to work #288
Comments
@Lephar Sorry to hear about your frustrations with getting multithreading to work as intended, and thanks for reaching out to us. Way 1 should have worked. Way 2 should have also worked, though the efficiency might not have been great. The third way also should have worked. You did not indicate the dimensions of your Also, can you give me some information about your hardware? Also, what compiler, and what version of that compiler, are you using? It sounds like you're using Linux, so that rules out a lot that can typically go wrong. I have some other ideas/things to try, but let's start with this for now. |
Hello, you are right, those are some important details I've skipped. m = n = k = 4000, they are both 4000x4000 square matrices and they are filled with random data. It takes around 2.7 - 2.8 seconds on a single thread of my Intel i7 8750H (it has 6 cores, 12 threads). I also watched CPU usage while calculations and it is around 8%~ all the time (around 1/12 of CPU capacity). Both the library and the program are compiled with the same gcc version, gcc 8.2.1 (x64). Included blis/blis.h and linked with -lblis option on gcc. I can also upload the full code when I have access to my PC. Oh and it is nothing near frustration, this is an awesome library :D |
@Lephar Thanks for your kind words about BLIS. I'm glad you sound mostly satisfied, this hiccup aside. Before I suggest what to try next, a few comments:
Now, some things to try. First, I'd like to standardize the test environment, so I'm going to ask you to run the testsuite. With BLIS built (and configured with
The testsuite is comprehensive, but you don't need to run the whole thing. Instead, you can limit which tests are run by editing
I changed the first line to a
And to smooth the results, let's take the best of three trials by changing line 11 to:
Now, I'd like you to set
Now run the testsuite:
The first crucial portion of the output ends around line 66:
This confirms that BLIS used the parallelization scheme I specified. The second crucial part is the actual performance at the end of the output. Here's what I get on my 4-core Haswell system:
This performance is about right given that single-threaded Let me know what you see in the testsuite output, and that might tell us more about what's going on. PS: Another interesting data point would be to go through all of the motions above, with the only difference being that you build BLIS via |
One more important thing: we need a reference point from which to measure speedup. So after running with
(Alternatively, you can
|
That was very helpful, thank you for your time. I did exactly as you said and got some interesting results;
So it is definitely working. I also tried some extra cases and found that 6 threads are indeed optimal as you said. The gflops values of multithreaded tests varied about ±20 between executions even when I set number of repeats to a number bigger than 3, but still I got the idea. |
This could be because BLIS does not make any attempt to bind threads to cores via CPU affinity when configured with pthreads. Unfortunately, pthreads has no "native" mechanism for specifying affinity; you would have to call an operating system function such as I am planning to add a section to the Multithreading.md documentation that discusses affinity, particularly via OpenMP.
Yes, this behavior is intentional. We had to decide which variable(s) would take precedence if both ways were employed (automatic and manual). We decided that any specification of the manual way should override the automatic way. Sorry you had to discover this empirically. The aforementioned policy was very intentional on our part, and it should have been included in the Multithreading.md documentation. I am planning to add a paragraph on the topic. I will look into the remaining issue regarding |
@Lephar I meant to comment on this in my previous reply: I found this to be a bit surprising, though entirely believable. Sometimes, oversubscribing threads relative to the number of physical cores causes downright awful performance, and it seems like that was happening in your case. (@dnparikh Your initial intuition looks correct in this case: his over-subscription did choke the CPU.) I think I figured out the problem with |
@Lephar Happily, I realized that our |
@Lephar Also, please try out 93d5631, which hopefully contains a permanent fix to the issue. It also includes additions to the Multithreading.md documentation. Thanks to users like you, we are able to find little issues like this that might otherwise go unnoticed. We sincerely appreciate your feedback. :) |
Yeah, I was expecting a performance penalty (or inconsistency between runs) caused by cache invalidation when using multithreading, I was just not expecting it to happen around 8~16 threads on 4000x4000 matrices. But these operations utilize very continual memory space after all, should've thought that it would be affected by locality a lot more than other operations.
Well, this is the logical behaviour. It was my oversight to set them to defaults instead of unsetting them.
Yes, adding |
Great. I'll consider this issue closed. If you encounter any other problems, simply open another issue and let us know what you're seeing. I was about to invite you to join our blis-devel mailing list, but then I noticed you were already a member. :) Thanks for your interest in BLIS. Please keep in touch. |
After following multithreading docs and several failed attempts, decided to open this issue but the source of the problem very well be my lack of understanding on the library internals. In any way, it is a good idea to add some simple multithreading examples on
examples/tapi
since there is no complete code examples on Multithreading.md. Not many tutorials or documentation on the web too, since this is a relatively new library.Anyway, bli_dgemm() works perfectly fine with single core. But none of the ways specified in multithreading docs has any effect on the computation time and
bli_thread_get_num_threads()
always returns -1. Here is what I did:Configured as follows before make (also tried with auto instead of x86_64, --enable-threading=pthreads instead of -t pthreads)
Way 1:
Way 2:
Also added
bli_thread_set_num_threads()
to my source and tried with or without environment variables:Code compiles and works fine, no syntax or linking error, still correct results on matrices, the problem is threading has just no effect. Tested on Arch Linux (4.19 kernel) with both AUR package and 0.5.0 version downloaded from this github page. Same results.
The text was updated successfully, but these errors were encountered: