Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix threading issues in resource loading #73647

Merged
merged 3 commits into from
Feb 23, 2023

Conversation

RandomShaper
Copy link
Member

@RandomShaper RandomShaper commented Feb 20, 2023

This PR is an alternative to #73623, aiming to fix the same issues spotted there, but in a different way; namely:

  • The race condition where the task semaphore has a chance of being destroyed before the other thread can wait on it is solved here by using the sync primitive that can naturally fix the problem, which is a condition variable. In the current implementation, the problem is that a mutex is unlocked and then a semaphore is awaited. Those operations should make an atomic operation, which is precisely what a condition variable does. Therefore ConditionVariable is introduced, together with a new flavor of binary mutex needed to be used with it.
  • The problem that a load task has been initiated in a non-thready context (regular resource load or load in thread, but with no subthreads) and then involved in load threads created later is solved here by allowing the condition variable (formerly, a semaphore) needed to await on the task to be created lazily. Such approach keeps the current benefit of having one-threaded cases free of having to create and deal with the condition variable.

This needs testing. And I beg a thorough review of this code, since this is slippery ground.

NOTE: Given the complex nature of the changes, I'm keeping them in separate commits.

@RandomShaper
Copy link
Member Author

@mrTag, would you please test this approach with your game and tell me the outcome?

@RandomShaper RandomShaper force-pushed the fix_threaded_load branch 2 times, most recently from e9e709a to ba2c460 Compare February 20, 2023 20:19
@YuriSizov YuriSizov modified the milestones: 4.0, 4.1 Feb 20, 2023
@RandomShaper
Copy link
Member Author

So isn't this in time for 4.0? This PR fixes serious core issues.

@akien-mga
Copy link
Member

It's a serious core PR with potential for regression too, minutes before RC3 and one week before stable. Are you confident enough in the implementation and is it worth the risk? Can we get @reduz to approve it right now before RC3?

@RandomShaper
Copy link
Member Author

It's indeed a double edge sword, and very sharp in both sides. If @reduz could review the changes and confirm they indeed only but fortify the implementation (i.e., I didn't misinterpret the code), I could do some stress testing. Then I'd be confident enough to move forward with this.

@akien-mga
Copy link
Member

I discussed this briefly with @reduz, he agrees that this kind of changes is needed but it's part of a much bigger work package on thread safety that he wants to discuss with you. So that's definitely more in scope for 4.1 at this stage.

@mrTag
Copy link
Contributor

mrTag commented Feb 21, 2023

@RandomShaper I just tested our game with this branch and it worked without problems. With the current master branch loading would fail pretty much every time for us (with the "referenced nonexistent resource" error), so that is a pretty good indicator that your version fixes the problem! And it is definitely a cleaner solution than in my pull request, I didn't know about conditionals and their use for cases like this. I'll be using this branch for our game from now on and I'll get back to you when any problems arise. That includes testing it on linux (I only tested it on Windows for now).

And I agree: it is a pretty serious issue. In my experience with the ResourceLoader, load_threaded_request is currently only usable in the simplest of cases (i.e. only one resource at a time). Which is a valid use case, since it enables asynchronous loading of one scene while showing a loading screen. But I completely understand the need for more testing here.

@RandomShaper
Copy link
Member Author

Made #73675, which addresses the issues in a much simpler way.

Copy link
Member

@akien-mga akien-mga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved by reduz for 4.0:

reduz: I think #73647 is the way to go, I also need to implement condition variables in the thread worker pool
Akien: Alright, so we merge now for 4.0 RC4 and hope it doesn't wreak havoc? and poke Pedro with a Hellrule burning pitchfork if it does 😛
reduz: I would go with that yeah
This falls into the category of temporary fix until we rewrite it properly after 4.0 is out

@Anixias
Copy link

Anixias commented Feb 23, 2023

Just thought I'd throw out here that it still occasionally freezes the engine in my case (Windows 10), seemingly failing when starting the loading process. It does work more often than not now, though.

@mrTag
Copy link
Contributor

mrTag commented Feb 24, 2023

I did some more testing, this time also on my linux machine. The test case is: loading into a level in our game, which triggers lots of parallel load_threaded_requests, with lots of overlapping external resources in the scenes. I tested both use_sub_threads true and false.

On my windows machine I didn't have any problems, ran the test 10 times with sub_threads and 10 times without.

On my linux machine (which is also not the best, hardware wise!), I had 2 different problems: the "referenced nonexistent resource" error and the deadlock, resulting in an endless loading screen (no freeze, though). Here is the tally with 10 runs per category (sub_threads and no sub_threads):
sub_threads: endless load: 0 error: 8 no problems: 2
no sub_threads: endless load: 4 error: 0 no problems: 6

Then I remembered, that I changed some if conditions in my PR from if (load_task.semaphore) to if (load_task.thread) and that still makes sense in this PR: not every ThreadLoadTask should mess with the thread_waiting_count and thread_loading_count variables. The ones that were created in the load function didn't increase those numbers and so they shouldn't decrease them. So I changed that (like this: ChasingCarrots@61feb7c ) and ran the tests under linux again:
sub_threads: endless load: 0 error: 6 no problems: 4
no sub_threads: endless load: 0 error: 0 no problems: 10
The endless loading screen went away! I didn't have any problems when not using sub threads. The errors when using sub_threads are still there, though.

Does that change make sense to you @RandomShaper ?

@akien-mga
Copy link
Member

Then I remembered, that I changed some if conditions in my PR from if (load_task.semaphore) to if (load_task.thread) and that still makes sense in this PR: not every ThreadLoadTask should mess with the thread_waiting_count and thread_loading_count variables. The ones that were created in the load function didn't increase those numbers and so they shouldn't decrease them. So I changed that (like this: ChasingCarrots@61feb7c ) and ran the tests under linux again:

I bisected #73983 as caused by this PR, so I tried your proposed change in hope that it would solve that issue, but it doesn't seem to. (Doesn't mean that it's not correct, I just checked if by chance it would fix it too.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants