Skip to content

Conversation

@bwrsandman
Copy link
Contributor

@bwrsandman bwrsandman commented Jan 18, 2019

Current Status on master

On Linux, the Xenia app and many of the test ui tests do not run correctly due to waits on WaitHandles flowing through and returning early.

Before

Fix status of this PR

I have been working on this PR for about a year and it now has reached a point where I consider it done and ready for review.

This PR implements most of threading_posix.cc by TDD to make sure behaviour is the same as win32.
This is a follow-up to (#1108) in order to get xenia to actually run on Linux. Following this, a few GTK-Vulkan fixes need to be done as well as implementing POSIX file system stubs. With those, a GTK window actually appears and responds to events.

After

Fix description

Borrowed commit from #1076 for Linux debug symbols.

Fixed occurrences of micro/millisecond confusion on timers.

Posix threads do not support suspension in the same way as windows threads. Real-time interrupts used to give this functionality and to implement thread callback when in alertable state.
More real-time interrupts used to signal timers.
Since there are more than two signals used, SIGUSR1 and SIGUSR2 are not used. Signals are allocated between SIGRTMIN and SIGRTMAX which are dynamic and can change per machine or kernel version but should start from SIGUSR2 + 1.

Not all systems support real-time signals. bits/signum.h indicates
that they are supported by overriding __SIGRTMAX to a value greater
than __SIGRTMIN. These constants give the kernel-level hard limits,
but some real-time signals may be used internally by glibc. Do not
use these constants in application code; use SIGRTMIN and SIGRTMAX
(defined in signal.h) instead.

/* Return number of available real-time signal with highest priority.  */
extern int __libc_current_sigrtmin (void) __THROW;
/* Return number of available real-time signal with lowest priority.  */
extern int __libc_current_sigrtmax (void) __THROW;

#define SIGRTMIN        (__libc_current_sigrtmin ())
#define SIGRTMAX        (__libc_current_sigrtmax ())

Real-time interrupts handlers are installed in a lazy-load fashion. This could be moved to an initialize phase to avoid lazy-load, but no obvious location for this was found.

Waiting is implemented using static condition variables. To do this, each WaitHandle is also an instance of PosixConditionBase, the latter which has the same functionality of wait and wait multiple. This implementation was more successful than using the existing PosixFdHandle which was then removed in order to use a more generalizable super class which gave the added bonus of waiting on multiple types of primitives at the same time since they all use the same conditional variables.

Alertable state is set as a thread local static variable.

I ran these tests multiple times concurrently in Windows and Linux on my machine, my laptop and on travis. I will also test it on my LattePanda Alpha SBC to test a lower-end machine. Throughout the development, there have been fails which are not always reproducible and I brought this number down with fixes and hopefully there aren't any rare bugs left.

Things left to check later

Suspending an already suspended thread was not tested nor implemented and left for a later time.

Implementing affinity and priority was attempted but in the end abandoned due to Linux's technicalities.
I could be wrong but there doesn't seem to be any way to set a thread's priority which is higher than the main thread's without running as a super user so we could only set it to Normal or Lower.

Sleeping for times lower than 1ms works on my setups but causes deadlocks on Travis. This is no problem for the unit tests. The times were just increased by factors of 10 to 100. This, however, means that tests which run in about 5s could potentially run in < 1s if that deadlock issue is fixed.

There will always be fundamental differences between Windows and Linux which could be tested against. My hope is that the existence of these unit tests can make pin pointing these differences more easy.

@Razzile
Copy link
Member

Razzile commented Jan 18, 2019

Hot damn this is amazing work!

@bwrsandman
Copy link
Contributor Author

Note to reviewers:
Please let me know if there are any changes to be made regarding architecture, code style, naming, spelling or if there are any tests to add.

If you check out the branch and run into issues, please send me a coredump if one is produced. Let me know what distro you run, which kernel version, the compiler with version, machine specs (cpu, number of cores, memory, load) and reproduceability. Having all this information is incredibly useful to find and fix threading bugs.

Any comment is appreciated and I will try to make requested changes within a week.

@chris-hawley
Copy link
Contributor

chris-hawley commented Jan 19, 2019

First of all. This is excellent work. This has been a long-standing missing component of the emulator.

I cloned your repository and then rebased it against the master for the latest fixes. I implemented the filesystem headers required and tried running it. I had some issues with it crashing and ran it under valgrind a few times and noticed this. To be clear, the crash wasn't caused by this output. The vulkan backend was crashing. Given my troubles with Vulkan, could you verify this behavior and fix if necessary? I have also added the three functions that were required to get the filesystem working.

==15421== Thread 2 :
==15421== Invalid read of size 1
==15421==    at 0x201FB4: xe::threading::PosixCondition<xe::threading::Thread>::ThreadStartRoutine(void*) (threading_posix.cc:906)
==15421==    by 0x55BEA9C: start_thread (in /usr/lib/libpthread-2.28.so)
==15421==    by 0x5D98B22: clone (in /usr/lib/libc-2.28.so)
==15421==  Address 0x87beab0 is 32 bytes inside a block of size 48 free'd
==15421==    at 0x48389AB: free (vg_replace_malloc.c:530)
==15421==    by 0x201F73: xe::threading::PosixCondition<xe::threading::Thread>::ThreadStartRoutine(void*) (threading_posix.cc:901)
==15421==    by 0x55BEA9C: start_thread (in /usr/lib/libpthread-2.28.so)
==15421==    by 0x5D98B22: clone (in /usr/lib/libc-2.28.so)
==15421==  Block was alloc'd at
==15421==    at 0x483777F: malloc (vg_replace_malloc.c:299)
==15421==    by 0x569AE19: operator new(unsigned long) (in /usr/lib/libc++.so.1.0)
==15421==    by 0x203EFA: xe::threading::PosixThread::Initialize(xe::threading::Thread::CreationParameters, std::function<void ()>) (threading_posix.cc:836)
==15421==    by 0x20227A: xe::threading::Thread::Create(xe::threading::Thread::CreationParameters, std::function<void ()>) (threading_posix.cc:938)
==15421==    by 0x1ECA0D: xe::Logger::Logger(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&) (logging.cc:71)
==15421==    by 0x1EBDF2: xe::InitializeLogging(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&) (logging.cc:245)
==15421==    by 0x1C9AFE: main (main_posix.cc:37)
#include "xenia/base/assert.h"
#include "xenia/base/filesystem.h"
#include "xenia/base/logging.h"
#include "xenia/base/string.h"

#include <dirent.h>
#include <fcntl.h>
#include <ftw.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <libgen.h>
#include <pwd.h>
#include <iostream>

namespace xe {
namespace filesystem {

std::wstring GetExecutablePath() {
	char buff[FILENAME_MAX] = "";
	readlink("/proc/self/exe", buff, FILENAME_MAX);
  std::string s(buff);
  return to_wstring(s);
}

std::wstring GetExecutableFolder() {
  auto path = GetExecutablePath();
  return xe::find_base_path(path);
}

std::wstring GetUserFolder() {
  struct passwd *pw = getpwuid(getuid());
  char *homedir = pw->pw_dir;
  std::string home(homedir);
  return to_wstring(home+"/.local/share/");
}

@gibbed
Copy link
Member

gibbed commented Jan 19, 2019

Consider this high on my priority list, I'll review it as soon as I am able. Great work!

@bwrsandman
Copy link
Contributor Author

bwrsandman commented Jan 19, 2019

@chris-hawley I plan to tackle vulkan on linux next.
I have only spent a little time looking at the issues with vulkan (to get xenia-app running) as most of my time was spent at a lower level testing against the threading unit tests directly.

Other than the filesystem stubs which are easy to fix with hard-coded values, I identified two issues :

  • The vulkan GTK window is not setup correctly
    • The xcb surface requirement isn't asked for (a9c2053)
    • The xcb connection/native_platform_handle isn't set (0e5af7d)
  • The VkDebugReportCallbackCreateInfoEXT isn't properly set (32a0e25)

I am pretty familiar with Vulkan and graphics so I plan on taking a more in-depth look at the implementation and following up with fixes in another pull request.

As for:

==15421== Thread 2 :
==15421== Invalid read of size 1
...

I haven't seen this message before, do you have a branch I can check out?

@chris-hawley
Copy link
Contributor

I'll create a branch called linux-testing in my xenia fork.

@bwrsandman
Copy link
Contributor Author

bwrsandman commented Jan 19, 2019

Great. By the way, you can see my Vulkan work and changes on this branch. Subject to change.

@chris-hawley
Copy link
Contributor

branch is all set to go

@bwrsandman
Copy link
Contributor Author

bwrsandman commented Jan 19, 2019

@chris-hawley The is due to the xcb connection/native_platform_handle not being set (0e5af7d) and the crash happens within either gtk or the graphics driver (in my case nvidia).

If you cherry-pick a9c2053 and 0e5af7d from my branch, you won't get those issues.

Anyways, vulkan crashing is a separate issue which will be resolved with another PR.

@chris-hawley
Copy link
Contributor

that fixed the vulkan crashing.

@chris-hawley
Copy link
Contributor

Also, we have a discord if you would rather do this back and forth there. discord

@Triang3l
Copy link
Member

Triang3l commented Jan 19, 2019

GetUserFolder should have the following logic (@Prism019's comment on #1316 is related to this):

  • If the environment variable XDG_DATA_HOME is set, the path is $XDG_DATA_HOME/Xenia.
  • If it's not set:
    • If the environment variable HOME is set, the path is $HOME/.local/share/Xenia.
    • Optionally (I don't know how realistic this situation is, other apps don't always handle it), if HOME is not set, use getpwuid(getuid())->pw_dir as $HOME (this is not always the same as $HOME — the user may override the home directory, but $HOME takes precedence in this case).

@bwrsandman bwrsandman mentioned this pull request Jan 27, 2019
@bwrsandman bwrsandman force-pushed the linux_threads branch 2 times, most recently from d4970e2 to ea2edef Compare February 3, 2019 03:01
@bwrsandman
Copy link
Contributor Author

I fixed some memory leaks that were pointed out by valgrind.
They occurred due to threads not being joined after being cancels as in the case of Terminate().
There was a unit test not joining with threads.
There was a case of use after delete of thread start_data.

Copy link
Member

@DrChat DrChat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review with some cosmetic requests.
This is really cool, and using conditional variables is a very interesting solution! Good work!

@bwrsandman
Copy link
Contributor Author

bwrsandman commented Mar 9, 2019

Thanks for the review @DrChat. I have made the requested changes and rebased on master.
Have you had a chance to run it yourself?
Please let me know if there's anything else to do.

@DrChat
Copy link
Member

DrChat commented May 17, 2019

So, I didn't merge this at the time because of the complex nature of this pull request - it's modifying a delicate part of the emulator, which can introduce very subtle bugs.

The changes here look pretty well thought out though, and I'm leaning towards merging this. One thing though: What about that deadlock on Travis from the unit tests?

@bwrsandman
Copy link
Contributor Author

// TODO(bwrsandman): due_times of under 1ms deadlock under travis

This was something which only happened on Travis. I wasn't able to get to the bottom of this since none of my machines could replicate. If I had to guess it's the nature of the container and the kernel used is somehow interpreting less than 1 ms as infinite.
I would love to have this fixed as it could speed up the tests by up to 1000x (switching ms to ns).
Unfortunately, since I can't replicate it outside of travis, it's difficult.

@bwrsandman
Copy link
Contributor Author

Rebased to latest master

Test logical_processor_count() 3 times to test static return value stays
correct.
Run EnableAffinityConfiguration(). No asserts possible.
Test setting thread id, test using uint32_t max to reset.
Test setting thread name. No asserts possible.
Test running MaybeYield(). No obvious more complex test case.
Test running SyncMemory(). No obvious more complex test case.
Add Sleep Test for 50ms.
Fix Sleep under linux that was using microseconds as nanoseconds.
Factor timespec creation to template function using div/mod and nanoseconds
from duration cast.
Implement HighResolutionTimer for Posix by using native timers.
Callbacks are triggered with realtime interrupts if they are supported.
Create an enum to track user-defined interrupts as well as an initializer and
handler to register these interrupts per thread.
Add test cases for timers for both single and multiple.
Fix Sleep function to continue sleeping if interrupted by system.
Add local .gdbinit to ignore signal 34 which is used by high res timer
Remove atomic boolean in fence. Variable signaled_ is already protected
by mutex.
Remove wait loop with single predicate wait protected with mutex.

Add Fence Signal and Wait tests
Test signaling without waiting.
Test signaling before waiting.
Test signaling twice before waiting.
Test synchronizing threads with fence.

Few REQUIRES were used to test as there are no return codes.
A failing test may hang indefinitely or cause a segfault which would still
register as a fail.
Linux: Remove copy and destroy call in make_unique invokation which closes
handles on all events.
Testing: Add Wait test for Events set and unset.
Remove file-descriptor specific wait implementation to PosixFdHandle class
which breaks on waits of non-fd handles.
Replace with PosixConditionHandle and extend to support auto reset and
initial values.
Simplify mutex and conditional variable use with stdlib versions which
wrap these primitives but provide better C++ interface.
Test Event and Reset
Make conditional_variable and mutex static and create generalisation of
Wait for vector of handles.
Use std::any for waitany and std::all for waitall
@bwrsandman bwrsandman force-pushed the linux_threads branch 2 times, most recently from cd681c3 to 3363dd5 Compare November 15, 2020 15:55
bwrsandman and others added 16 commits November 15, 2020 11:04
Add PosixConditionBase as base class for Waitables to use common
primitives mutex and conditional variable
Add abstract signaled() and post_execution() to use single WaitMultiple
implementation.
Test acquiring and releasing semaphores on same and on different threads.
Test previous_count values.
Test WaitAll and WaitAny.

Add tests for invalid semaphore creation parameters but disactivated as
they do not pass on any platform. These should be enabled and the
implementations fixed to match documentation.
Keep track of recursive locks with owner and count of locks.
Only allow recursive locks from same thread and increment count.
Only allow first locks from when count is zero.

Test acquiring and releasing mutant on same and on different threads.
Test Release return values.
Test WaitAll and WaitAny.
Test Manual Reset and Synchronization timers single threaded.
Test Cancelling timers.
Test WaitMultiple.
Ignore real-time event 35 in .gdbinit which is used to signal timer.

Callbacks don't seem to be called so testing them is difficult.
Use real-time event interrupt to communicate suspend in timely manner.
Use conditional_variable to implement suspend wait and resume trigger.

Ignore real-time event 36 in .gdbinit which is used to signal suspend.

Test suspending threads.
Add thread local bool for alertable state.
Use real-time event interrupt to run callback.
Fix sleep duration from miliseconds (microseconds / 1000) to seconds in sleep
command.
Add note for future implementation.

Ignore real-time event 37 in .gdbinit which is used to signal callback.

Test AlertableSleep
Test Thread QueueUserCallback.
TODO: Test alerted wait result when using IO functions.
Implement TLSHandle with pthread_key_t.

Test Alloc, Free, Get and Set.
Add Signal abstract function to handles.
Test SignalAndWait.
Give other threads access to initially suspended threads by signalling
conditional variable before waiting for state to be changed again.
Shorten names to 16.
Rename Win32 to Windowing.
Shorten GraphicsSystem thread names due to 16 length limit of pthread.
Without this change, both show up as GraphicsSystem.
Remove redundant "Worker" and "Thread" from names.
Remove redundant thread handle from thread name.
Add suspend count to thread implementation.
Increment suspend count on suspend and decrement on resume.
Wait on suspend count to be decremented to 0.
Return suspend count on suspend and on resume before incr/decr.
Fix naming of resume suspend count to make clear that suspend count is
before incr/decr.
Add test.
Move wait implementation to not use native_handle.
Implement native_handle for each primitive using posix natives.
@gibbed gibbed merged commit 68cf47e into xenia-project:master Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants