-
Notifications
You must be signed in to change notification settings - Fork 64
multithread the MD5/IQM model CPU code using OpenMP #1838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
For now there is a code that is guarded by a |
8151314
to
cf6377b
Compare
I tested 2-thread, 8-thread and 32-thread machines. With 2-thread and 8-thread machines, maxing the threads was giving more performance, while on the 32-thread machine the performance was going up when adding threads up to 16, then was slowing down when doing more than 16 threads, so I caped the automatic thread detection at 16. I assume that after 16 threads the thread management becomes too costly and destroys the benefit of dispatching the work. The cvar range allows up to 32 threads for the ones wanting to test about it. |
Humm, no, with 8 threads it performs better with 6 threads, I'll add a more complex heuristic then. |
0cf9734
to
a89f545
Compare
On a machine with 8 cores and only running the game so the framerate is more stable, with the parallel code for that part I get I'll drop the legacy sequential code for that part as well. |
Well, no, I still had a doubt, so I used some hud to draw a framerate curve, and added a cvar to switch the code, and this doesn't change anything. One problem is that I probably don't run that code at all. And I added a logger, it never prints anything. Anyway the code isn't as heavy as |
542bda9
to
c2e029c
Compare
I noticed something very interesting on that 8-core machine, which is a laptop. By default without the threading it does |
Using that same 8-core laptop, when using the |
7e7418d
to
0a5f50e
Compare
While I was at it, I parallelized some parts of the MD3, MD5 and IQM loading code as well. The parallelization of the MD3 loading code is a bit noisy on the diff size because unlike the MD5 and IQM code that I cleaned-up long time ago, the MD3 code was full of reused “global to functions” variables that would just create race conditions once the code is parallelized. |
8394d74
to
b51db2f
Compare
Now that I think about it, it's probably possible to template the chunking as well. |
The latest version just uses OMP as a basic thread pool. You can find various simple thread pool implementations that are just a couple hundred lines of code, so I will try hooking up the code to one of those to see if we can drop the dependency. I bet the problem with your non-library-using chunked implementation was just that it spends too much time creating and destroying threads, which is solved by a thread pool. |
That's welcome, but if for some reasons it happens that other implementations don't perform as well as OMP, the libgomp isn't really an annoying dependency on both Linux and MSYS2. |
So I just somewhat tested it by using the amount of vertex as number of chunks, with a chunk of size 1. I have hard time to see a difference on my workstation using 16-threads, but that's because I have other things running alongside, both (chunked or not) currently run at 400~410 fps. On my 8-thread laptop I see a small difference on the fact with the chunked implementation it sometime tops at 183 fps while with the non-chunked implementation it doesn't top higher than 179 fps, I reproduced this multiple times. |
b51db2f
to
6a37cac
Compare
I unchunked the code. If we want to investigate chunking, we can do it later, and if we do it we should do it in the template instead. |
Doing that simplified the code and I finally topped at 182 fps on the 8-thread laptop with the unchunked code. So I guess we don't have to care about chunking it. |
6a37cac
to
8d455c0
Compare
I tried the compiler's built-in loop parallelization with |
Also I removed the load-time OMP commits from that branch. I don't think that's worthwhile because ~97% of model loading time is spent on textures; the vertex data is hardly worth optimizing. We should try to avoid incurring costs of OMP when we are not actually going to use it. If we have fully GPU-based vertex skinning, we shouldn't start up the threads. And we shouldn't link OMP into the server which doesn't use it. |
I decided not to bother trying the thread pool since the pragma-based approach with OMP actually seems the least intrusive: that way, there are no lambdas which would make the single-threaded version less efficient. And the amount of extra code is minimal. Also MSVC supposedly implements OMP, so I will try that later. |
Ah yes, since I don't chunk anymore, we don't need a lambda anymore as well. Though, you're not setting the thread count before running the loop, and I noticed that when not setting them right before running the loop, the amount of threads being used is unpredictable. |
Changing the number of threads at runtime wouldn't work yet on my branch, but |
I don't know if that's related, but with |
How are you determining that "the amount of threads being used is unpredictable"? It makes sense that turning on |
By printing the output of In my previous experiments I got very weird things, like |
0dded86
to
3baf36d
Compare
I removed the |
I also removed the commits parallelizing loading stuff, that can be discussed later. |
On the 8-thread laptop I now top at 185fps, and the frametime curve is now much more smooth, and the throttling starts later and the framerate slow down due to throttling is going down more slowly (it keeps the higher framerates much longer). |
Yes, if we can use OMP that would be very good, it's very easy to integrate in our code, and the code just builds without problem when OMP is missing. |
Just as a test I commented out the |
3baf36d
to
1138716
Compare
OK, it works if I do it in |
I added a log line in |
1138716
to
8355d47
Compare
I added the CMake stuff for MSVC. |
It doesn't work yet on Windows, because The other thing I am asking is to avoid adding the I found that luckily the threads are not really started if they are not used (with GNU OpenMP at least), so it is not important to avoid requesting them when GPU vertex skinning is fully supported. |
That doesn't sound like guaranteed behaviour. |
Omp
facilities in frameworkFor now it is disabled by default, one should use the
-DUSE_OPENMP=ON
cmake option to enable it.In the future I plan to progressively enable it:
we would have to modify the release validation script to accept the fact the executable depends on
libgomp.so
.The
libgomp.so
library is as standard as the glibc so it's fine.we have to modify the release validation script to accept the fact the executable depends on
libgomp.dll
,and we would have to modify the release build script to package
libgomp.dll
.The
libgomp.dll
is provided by MSYS2 so it's fine.I don't plan to enable it on macOS as I've heard that macOS doesn't ship the LLVM's
libomp.so
by default.Such enablement will be done on later PRs.
The purpose of adding OpenMP abilities is to make it optional to speed-up operations with it, but the same operations should work without it.
Then it implements parallelization of the MD5 and IQM CPU code.
This was investigated on:
It uses a chunked implementation, as tests demonstrated it was the fastest one.
Using a beefy computer and enabling 16 threads I got that performance difference with the chunked implementation on the same heavy scene:
91fps
438fps
Of course, the performance difference is expected to be lower on older CPUs usually running alongside older GPUs whose limitations enforce that CPU codepath, but it is now demonstrated that such parallelization scales well. This can move some devices from the
slow
to theplayable
category, or from theplayable
to thepassed
category.A goo way to test that is to follow those instructions:
This will spawn the human player and move it to the alien base entrance, where all the IQM buildable models from the alien base will be rendered because of them being in vis, with at least two IQM animated acid tubes actually in direct sight, plus the MD5 first person rifle on foreground. Starting from that one can also shot the acide tubes and empty the rifle magazines to play additional animations from the acid tube death and the rifle first person shoot and reload.
One can test various amount of threads this way:
The
0
default value let the engine picks an amount of threads by itself, other values enforce the amount of threads.