Skip to content

Conversation

illwieckz
Copy link
Member

@illwieckz illwieckz commented Oct 2, 2025

  • Add Omp facilities in framework
  • Use OpenMP to multithread the MD5/IQM model CPU code

For now it is disabled by default, one should use the -DUSE_OPENMP=ON cmake option to enable it.

In the future I plan to progressively enable it:

  • Expected: Enable it on Linux with GCC,
    we would have to modify the release validation script to accept the fact the executable depends on libgomp.so.
    The libgomp.so library is as standard as the glibc so it's fine.
  • Probable: Enable it on Windows with MinGW,
    we have to modify the release validation script to accept the fact the executable depends on libgomp.dll,
    and we would have to modify the release build script to package libgomp.dll.
    The libgomp.dll is provided by MSYS2 so it's fine.

I don't plan to enable it on macOS as I've heard that macOS doesn't ship the LLVM's libomp.so by default.

Such enablement will be done on later PRs.

The purpose of adding OpenMP abilities is to make it optional to speed-up operations with it, but the same operations should work without it.

Then it implements parallelization of the MD5 and IQM CPU code.

This was investigated on:

It uses a chunked implementation, as tests demonstrated it was the fastest one.

Using a beefy computer and enabling 16 threads I got that performance difference with the chunked implementation on the same heavy scene:

Before After
91fps 438fps

Of course, the performance difference is expected to be lower on older CPUs usually running alongside older GPUs whose limitations enforce that CPU codepath, but it is now demonstrated that such parallelization scales well. This can move some devices from the slow to the playable category, or from the playable to the passed category.

A goo way to test that is to follow those instructions:

/set r_vboModels off
/devmap plat23
/team h; class rifle; delay 1s setviewpos 1920 1920 20 0 0

This will spawn the human player and move it to the alien base entrance, where all the IQM buildable models from the alien base will be rendered because of them being in vis, with at least two IQM animated acid tubes actually in direct sight, plus the MD5 first person rifle on foreground. Starting from that one can also shot the acide tubes and empty the rifle magazines to play additional animations from the acid tube death and the rifle first person shoot and reload.

One can test various amount of threads this way:

/set common.ompThreads 4

The 0 default value let the engine picks an amount of threads by itself, other values enforce the amount of threads.

@illwieckz illwieckz added A-Renderer T-Feature-Request Proposed new feature T-Improvement Improvement for an existing feature labels Oct 2, 2025
@illwieckz
Copy link
Member Author

For now there is a code that is guarded by a NO_MT_IF_NO_TBNTOQ define. This is because this code doesn't use R_TBNtoQtangents() and I want to test if that other code not using R_TBNtoQtangents() is slow enough to benefit from the parallelism (it looks like it is, but I will test more).

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch from 8151314 to cf6377b Compare October 2, 2025 01:17
@illwieckz
Copy link
Member Author

illwieckz commented Oct 2, 2025

I tested 2-thread, 8-thread and 32-thread machines. With 2-thread and 8-thread machines, maxing the threads was giving more performance, while on the 32-thread machine the performance was going up when adding threads up to 16, then was slowing down when doing more than 16 threads, so I caped the automatic thread detection at 16. I assume that after 16 threads the thread management becomes too costly and destroys the benefit of dispatching the work. The cvar range allows up to 32 threads for the ones wanting to test about it.

@illwieckz
Copy link
Member Author

I tested 2-thread, 8-thread and 32-thread machines. With 2-thread and 8-thread machines, maxing the threads was giving more performance.

Humm, no, with 8 threads it performs better with 6 threads, I'll add a more complex heuristic then.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch 2 times, most recently from 0cf9734 to a89f545 Compare October 2, 2025 01:41
@illwieckz
Copy link
Member Author

illwieckz commented Oct 2, 2025

For now there is a code that is guarded by a NO_MT_IF_NO_TBNTOQ define. This is because this code doesn't use R_TBNtoQtangents() and I want to test if that other code not using R_TBNtoQtangents() is slow enough to benefit from the parallelism (it looks like it is, but I will test more).

On a machine with 8 cores and only running the game so the framerate is more stable, with the parallel code for that part I get 85 fps, with the legacy code I get 80 fps. That confirms what I have observed on my main machine (430 fps vs 410 fps, where framerate was much unstable due to other applications running around, so the doubt was allowed. The win isn't that big on that part, but that's measurable.

I'll drop the legacy sequential code for that part as well.

@illwieckz
Copy link
Member Author

On a machine with 8 cores and only running the game so the framerate is more stable, with the parallel code for that part I get 85 fps, with the legacy code I get 80 fps. That confirms what I have observed on my main machine (430 fps vs 410 fps, where framerate was much unstable due to other applications running around, so the doubt was allowed. The win isn't that big on that part, but that's measurable.

I'll drop the legacy sequential code for that part as well.

Well, no, I still had a doubt, so I used some hud to draw a framerate curve, and added a cvar to switch the code, and this doesn't change anything. One problem is that I probably don't run that code at all.

And I added a logger, it never prints anything. Anyway the code isn't as heavy as R_TBNtoQtangents() but isn't cheap, I'll probably keep the parallelized code.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch from 542bda9 to c2e029c Compare October 2, 2025 02:18
@illwieckz
Copy link
Member Author

I noticed something very interesting on that 8-core machine, which is a laptop. By default without the threading it does 65fps. but the CPU isn't maxing the temperature. At the moment I enable the threading, the performances jumps to 140fps but then the temperature is maxed and then the performance slowly decreases until it reaches 85fps where it keeps this framerate (and the temperature isn't maxed anymore).

@illwieckz
Copy link
Member Author

illwieckz commented Oct 2, 2025

Using that same 8-core laptop, when using the powersave governor to make sure the CPU doesn't throttle due to temperature (and is already on the lowest frequency anyway), when enabling the parallelism it switches from 1 thread to 6 threads and the performance jumps from a stable 16fps to a sable 40fps, which is exactly a 2.5× boost, that's good! And the temperature remains the same.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch 4 times, most recently from 7e7418d to 0a5f50e Compare October 2, 2025 05:00
@illwieckz
Copy link
Member Author

illwieckz commented Oct 2, 2025

While I was at it, I parallelized some parts of the MD3, MD5 and IQM loading code as well. The parallelization of the MD3 loading code is a bit noisy on the diff size because unlike the MD5 and IQM code that I cleaned-up long time ago, the MD3 code was full of reused “global to functions” variables that would just create race conditions once the code is parallelized.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch 2 times, most recently from 8394d74 to b51db2f Compare October 2, 2025 05:09
@illwieckz
Copy link
Member Author

Now that I think about it, it's probably possible to template the chunking as well.

@slipher
Copy link
Member

slipher commented Oct 2, 2025

The latest version just uses OMP as a basic thread pool. You can find various simple thread pool implementations that are just a couple hundred lines of code, so I will try hooking up the code to one of those to see if we can drop the dependency. I bet the problem with your non-library-using chunked implementation was just that it spends too much time creating and destroying threads, which is solved by a thread pool.

@illwieckz
Copy link
Member Author

The latest version just uses OMP as a basic thread pool. You can find various simple thread pool implementations that are just a couple hundred lines of code, so I will try hooking up the code to one of those to see if we can drop the dependency.

That's welcome, but if for some reasons it happens that other implementations don't perform as well as OMP, the libgomp isn't really an annoying dependency on both Linux and MSYS2.

@illwieckz
Copy link
Member Author

illwieckz commented Oct 2, 2025

I may have not tested yet the non-chunked code with the pragma indeed.

So I just somewhat tested it by using the amount of vertex as number of chunks, with a chunk of size 1.

I have hard time to see a difference on my workstation using 16-threads, but that's because I have other things running alongside, both (chunked or not) currently run at 400~410 fps.

On my 8-thread laptop I see a small difference on the fact with the chunked implementation it sometime tops at 183 fps while with the non-chunked implementation it doesn't top higher than 179 fps, I reproduced this multiple times.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch from b51db2f to 6a37cac Compare October 2, 2025 21:26
@illwieckz
Copy link
Member Author

I unchunked the code.

If we want to investigate chunking, we can do it later, and if we do it we should do it in the template instead.

@illwieckz
Copy link
Member Author

Doing that simplified the code and I finally topped at 182 fps on the 8-thread laptop with the unchunked code. So I guess we don't have to care about chunking it.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch from 6a37cac to 8d455c0 Compare October 2, 2025 21:38
@slipher
Copy link
Member

slipher commented Oct 2, 2025

I tried the compiler's built-in loop parallelization with #pragma omp parallel for -- see the slipher/omp-for branch. This gives me a measurable performance boost over the lambda-based dispatch.

@slipher
Copy link
Member

slipher commented Oct 2, 2025

Also I removed the load-time OMP commits from that branch. I don't think that's worthwhile because ~97% of model loading time is spent on textures; the vertex data is hardly worth optimizing. We should try to avoid incurring costs of OMP when we are not actually going to use it. If we have fully GPU-based vertex skinning, we shouldn't start up the threads. And we shouldn't link OMP into the server which doesn't use it.

@slipher
Copy link
Member

slipher commented Oct 2, 2025

I decided not to bother trying the thread pool since the pragma-based approach with OMP actually seems the least intrusive: that way, there are no lambdas which would make the single-threaded version less efficient. And the amount of extra code is minimal. Also MSVC supposedly implements OMP, so I will try that later.

@illwieckz
Copy link
Member Author

Ah yes, since I don't chunk anymore, we don't need a lambda anymore as well.

Though, you're not setting the thread count before running the loop, and I noticed that when not setting them right before running the loop, the amount of threads being used is unpredictable.

@slipher
Copy link
Member

slipher commented Oct 2, 2025

Though, you're not setting the thread count before running the loop, and I noticed that when not setting them right before running the loop, the amount of threads being used is unpredictable.

Changing the number of threads at runtime wouldn't work yet on my branch, but Omp::Init called on startup does set the number of threads, so it should work fine as long as you don't toggle the cvars.

@illwieckz
Copy link
Member Author

I don't know if that's related, but with r_smp it's unpredictable. Even when setting it at the start of each frame, this isn't enough.

@slipher
Copy link
Member

slipher commented Oct 3, 2025

How are you determining that "the amount of threads being used is unpredictable"? It makes sense that turning on r_smp would throw off timing measurements by having another thread unpredictably running at the same time. So don't do that!

@illwieckz
Copy link
Member Author

By printing the output of omp_get_num_threads() and also by looking at the amount of busy threads in htop.

In my previous experiments I got very weird things, like omp_get_num_threads() returning 2 when I've set 16, etc.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch 2 times, most recently from 0dded86 to 3baf36d Compare October 3, 2025 00:35
@illwieckz
Copy link
Member Author

illwieckz commented Oct 3, 2025

I removed the if (BUILD_SERVER), etc. from the cmake file, I also removed the if (NOT BUILD_CGAME), etc. from it because I guess it would prevent to use OpenMP when building native games in the same cmake build as the engine, as we don't build native games in subprojects.

@illwieckz
Copy link
Member Author

I also removed the commits parallelizing loading stuff, that can be discussed later.

@illwieckz
Copy link
Member Author

On the 8-thread laptop I now top at 185fps, and the frametime curve is now much more smooth, and the throttling starts later and the framerate slow down due to throttling is going down more slowly (it keeps the higher framerates much longer).

@illwieckz
Copy link
Member Author

illwieckz commented Oct 3, 2025

I decided not to bother trying the thread pool since the pragma-based approach with OMP actually seems the least intrusive.

Yes, if we can use OMP that would be very good, it's very easy to integrate in our code, and the code just builds without problem when OMP is missing.

@illwieckz
Copy link
Member Author

Just as a test I commented out the EnlistThreads() calls and then the engine spawns 32 threads and the framerate is 1fps.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch from 3baf36d to 1138716 Compare October 3, 2025 01:24
@illwieckz
Copy link
Member Author

Just as a test I commented out the EnlistThreads() calls and then the engine spawns 32 threads and the framerate is 1fps.

OK, it works if I do it in RB_RenderThread().

@illwieckz
Copy link
Member Author

I added a log line in /gfxinfo reporting OpenMP support and the current amount of threads it will use.

@illwieckz illwieckz force-pushed the illwieckz/multithreaded-cpu-model branch from 1138716 to 8355d47 Compare October 3, 2025 02:48
@illwieckz
Copy link
Member Author

I added the CMake stuff for MSVC.

@slipher
Copy link
Member

slipher commented Oct 4, 2025

I added the CMake stuff for MSVC.

It doesn't work yet on Windows, because Omp.h and omp.h are considered the same filename.

The other thing I am asking is to avoid adding the libgomp dependency to the server, which does not use it. So the new files should be placed in the renderer source list and the cvar should have an r_ prefix to reflect the module which uses it.

I found that luckily the threads are not really started if they are not used (with GNU OpenMP at least), so it is not important to avoid requesting them when GPU vertex skinning is fully supported.

@VReaperV
Copy link
Contributor

VReaperV commented Oct 4, 2025

I found that luckily the threads are not really started if they are not used (with GNU OpenMP at least), so it is not important to avoid requesting them when GPU vertex skinning is fully supported.

That doesn't sound like guaranteed behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Renderer T-Feature-Request Proposed new feature T-Improvement Improvement for an existing feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants