Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC-GAP: Allow configuration of TLS and improve native TLS performance on macOS #3502

Merged
merged 1 commit into from
Nov 14, 2019

Conversation

rbehrends
Copy link
Contributor

@rbehrends rbehrends commented Jun 14, 2019

This is a tracking PR for progress on HPC-GAP changes. Aside from older changes in this PR, we also aim at improving our implementations of thread-local storage, specifically so that HPC-GAP can be used as a library.

Our current implementation requires HPC-GAP to be run as the main program, which in particular prevents its use as a library. Alternative portable implementations are considerably slower on systems that do not properly support efficient thread-local storage (such as macOS and CygWin).

We're attempting to address this problem in two ways.

One issue is implementations that rely on pthread_getspecific() directly or indirectly do not treat such function calls as pure functions and thus do not get used for common subexpression elimination or hoisted out of loops. As those systems all rely on gcc/clang, we can tag pthread_getspecific() as __attribute__((pure)), which will immediately lead to a measurable performance improvement.

On macOS, we can go one step further. While there is no guaranteed stable ABI, pthread_getspecific() has consistently been implemented by movq %gs:(,%rdi, 8), %rax. We therefore can verify that pthread_getspecific() actually uses this implementation (memcmp() against the machine code) and then use a direct assembly language implementation that has nearly the same performance characteristics as our current implementation.

In addition, we store the key in a static variable that is initialized with an __attribute__((constructor)__ function. This is generally slightly faster than putting it in a non-static variable on macOS in particular, as we're saving one level of indirection.

Also, in the worst case, this code functions with just pthread_getspecific() and no other mechanism available, it is just slower.

Todo:

  • Configure check for __attribute__((pure)) and definition of a macro for it.
  • Configure check for __attribute__((constructor)) and definition of a macro for it.
  • Configure check for it being acceptable to define __pthread_getspecific() as pure.
  • Configure check for macOS supporting this particular implementation.
  • Potentially distinguish between building HPC-GAP as a library and a standalone program?
  • Efficient Cygwin implementation.

@rbehrends rbehrends added topic: HPC-GAP Issues and PRs related to HPC-GAP do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) topic: kernel labels Jun 14, 2019
@fingolfin
Copy link
Member

My understanding is that this contains PR #2845, correct? If so, it really would be nice if we could get that PR finished and merged, before building new stuff atop of it...

@rbehrends
Copy link
Contributor Author

Strictly speaking, this PR does not actually depend on the other one, but is committed on top of that branch for convenience.

I am already working on cleaning up the other PR as part of my current work to make multi-threaded Oscar a possibility, so that we can move on from that.

@rbehrends rbehrends force-pushed the hpcgap-alt-tls branch 2 times, most recently from d99b7e5 to 83ee9a8 Compare September 19, 2019 10:53
configure.ac Outdated Show resolved Hide resolved
configure.ac Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
etc/ci-prepare.sh Outdated Show resolved Hide resolved
src/hpc/tls.h Outdated Show resolved Hide resolved
src/hpc/tls.h Outdated Show resolved Hide resolved
src/hpc/tls.h Outdated Show resolved Hide resolved
src/hpc/tlsconfig.h Outdated Show resolved Hide resolved
src/hpc/tlsconfig.h Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
@rbehrends rbehrends force-pushed the hpcgap-alt-tls branch 2 times, most recently from 7eb6ab0 to f132902 Compare September 20, 2019 11:54
.gitignore Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
src/hpc/threadapi.c Outdated Show resolved Hide resolved
src/precord.c Outdated Show resolved Hide resolved
@rbehrends
Copy link
Contributor Author

This is now a revised version that mirrors the Mono logic. It can also now be more easily extended to equally support Windows implementations.

@rbehrends rbehrends force-pushed the hpcgap-alt-tls branch 4 times, most recently from bd302fb to 51464e8 Compare October 2, 2019 12:12
Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit 'Update .gitignore and .travis.yml files from master.' needs a new commit message.

Otherwise, this is shaping up well, but we need to figure out how to get it into a mergeable state, see my comments.

src/hpc/tls.h Outdated Show resolved Hide resolved
src/system.h Outdated
#endif



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very unimportant suggestion, feel free to ignore: but I try to restrict to at most 2 empty lines in a row.

Suggested change

configure.ac Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
configure.ac Show resolved Hide resolved
Comment on lines 53 to 55
unward/bin/unward --inplace src
# commit the result to prevent the docomp test from triggering
git commit -m "Unward" src
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means people who want to use the code in this PR still need to have unward, and run it, and then have lots of modifications in their tree. That's IMHO the major blocker for merging this PR, and until we resolve it, there is no way this can be merged. So I see two ways forward:

  1. We run unward and commit the modifications once and for all, and move on (my understanding always was that this is the plan?)
  2. We determine that we will have to keep using unward for the foreseeable. In that case, we will have to use it like ward, and thus need to restore the parts of the build system which transparently handled applying ward (now: unward) to sources followed by compilation (instead of working inlaces, the results would be placed into gen/src/; and build rules would ensure gen/src/FOO.c gets rebuilt whenever src/FOO.c or one of its header dependencies is touched.

Thing is, we need people to use this code so it gets tested, so keeping it in a PR indefinitely is not a good way forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is something that I don't have a good solution for myself. My plan was to not put the unward changes in for a transition period. The reason is that if we need to change something about how they work (e.g. include an option for disabling guards for performance reasons), they mess up diffs and git blame, as they're pretty extensive.

But that leads to the problem that if you build HPC-GAP without unward, you get a near immediate crash, as an unnecessary read or write guard fails (one that gets removed by unward).

Possible solution: add a --without-guards configure option. That allows builds to (unsafely) work without unward for the transition period and would also later allow HPC-GAP to be built without guards for performance reasons if performance is critical.

If we were to put unward into the build system, it already supports a -o <dir> option to place the transformed sources (and all untransformed ones, too) into a separate directory, say gen/. This takes less than a second, is incremental (it doesn't overwrite files that didn't get changed and therefore doesn't change their time stamps) and thus make-friendly, and is in principle possible.

But this would again introduce an external dependency, which I'm not very enthusiastic about (even though unward, unlike ward, is itself dependency-free and only requires a C++-98 or later compiler). A large part of the raison d'être for unward is, after all, that it can be dropped from the build system eventually and does not continue to be a maintenance burden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds as if you are aiming for option 1 eventually. But the big question then is: when would we be ready for it? What's the hard criterium for making that decision? I keep coming back to this very same questions since when you first started talking about unward. You keep saying it's meant to be used once, but also keep saying that it may need to be changed, and then re-run. Unless there is a clear goal as to when it's "done", I don't see a solution out of this dilemma, other than to accept that the notion of "unward will be run once and then discarded" is hopeless. Hence option 2.

At the same time, it's also totally unclear to me what major changes to unward output you think might be needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible changes that I'm contemplating right now and that may require changes in the output:

  1. Allow disabling of guards at compile time.
  2. Add #ifdef HPCGAP around the inline functions, so that they don't get compiled when not built with HPC-GAP. This was unnecessary when they were still static, but now they're not static anymore and should not show up in the code regardless.

My goal with unward was to keep the choice open-ended. This is why I specifically supported both the following options:

  1. Make the process idempotent, so that you can rerun unward if necessary.
  2. Add a -o option to allow for use as a preprocessor in the build system (i.e. like ward).

This means that we wouldn't be forced by design to pick one or the other.

As for the process, I was hoping to basically get the following done:

  1. Add the current PR (cleaned up) to GAP.
  2. Ask a couple of people to test that it works with unward (there aren't many who are using HPC-GAP, but there are a couple at least).
  3. If that's the case, then after a short delay (couple of weeks) add the unward changes in a separate patch.

I am totally fine with rolling them into the current PR, too, if you think that's the best way forward. In the end, you probably have a better view of the big picture as far as GAP is concerned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbehrends (CC: @stevelinton @ChrisJefferson ) regarding your first two points, it seems you already added --enable-guards / --disable-guards (which is perfectly fine by me). Note sure around which inline functions you want(ed?) to add #ifdef HPCGAP, but I guess it's fine, too.

Regarding how to proceed: My main concern about merging this PR is how it would affect people who are using HPC-GAP right now, but who do not yet know about unward and thus are not using it. Will they get a horribly broken HPC-GAP, or just one that's more or less on the same level as what we have in master right now anyway?

If it was the latter, there is indeed nothing major stopping this from being merged (my remaining remarks are all very minor nitpicks and should be easy to resolve).

However, it seems to be the former (starting GAP built from your branch w/o running unward causes an immediate segfault due to violated guards). That makes me somewhat more reserved about the idea of merging this as-is. It'd be still OK by me if there was a very concrete timeline for it, and also a plan B should we determined that result of the merge is unusable. But my main concern is who the "couple of people" would be we'd ask to test it, and what makes us think they'd be more likely to test it then, than they would be to test this PR / your branch if we asked them explicitly to test it?

Until we come up with a good strategy to deal with that, it might indeed be best to split this commit into two: a non-controversial commit we could merge right away (which would contain the TLS changes, the renaming of HAVE_NATIVE_TLS to USE_NATIVE_TLS and a bunch of other minor changes), and the rest (mainly: the guards changes).

src/gasman.h Outdated Show resolved Hide resolved
src/gasman.h Outdated Show resolved Hide resolved
@rbehrends rbehrends force-pushed the hpcgap-alt-tls branch 4 times, most recently from 791ac67 to d9203e4 Compare October 7, 2019 12:45
@fingolfin
Copy link
Member

@rbehrends the commit here is titled "Make HPC-GAP work without Ward." but that's not really what this is about, given that we already are not using ward...

src/hpc/tls.h Outdated Show resolved Hide resolved
src/hpc/tls.h Outdated Show resolved Hide resolved
src/hpc/tls.h Show resolved Hide resolved
src/hpc/tls.h Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
src/hpc/thread.c Show resolved Hide resolved
src/hpc/thread.c Outdated Show resolved Hide resolved
src/hpc/thread.c Outdated Show resolved Hide resolved
src/hpc/thread.c Outdated Show resolved Hide resolved
Comment on lines 53 to 55
unward/bin/unward --inplace src
# commit the result to prevent the docomp test from triggering
git commit -m "Unward" src
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbehrends (CC: @stevelinton @ChrisJefferson ) regarding your first two points, it seems you already added --enable-guards / --disable-guards (which is perfectly fine by me). Note sure around which inline functions you want(ed?) to add #ifdef HPCGAP, but I guess it's fine, too.

Regarding how to proceed: My main concern about merging this PR is how it would affect people who are using HPC-GAP right now, but who do not yet know about unward and thus are not using it. Will they get a horribly broken HPC-GAP, or just one that's more or less on the same level as what we have in master right now anyway?

If it was the latter, there is indeed nothing major stopping this from being merged (my remaining remarks are all very minor nitpicks and should be easy to resolve).

However, it seems to be the former (starting GAP built from your branch w/o running unward causes an immediate segfault due to violated guards). That makes me somewhat more reserved about the idea of merging this as-is. It'd be still OK by me if there was a very concrete timeline for it, and also a plan B should we determined that result of the merge is unusable. But my main concern is who the "couple of people" would be we'd ask to test it, and what makes us think they'd be more likely to test it then, than they would be to test this PR / your branch if we asked them explicitly to test it?

Until we come up with a good strategy to deal with that, it might indeed be best to split this commit into two: a non-controversial commit we could merge right away (which would contain the TLS changes, the renaming of HAVE_NATIVE_TLS to USE_NATIVE_TLS and a bunch of other minor changes), and the rest (mainly: the guards changes).

@rbehrends
Copy link
Contributor Author

I'll address the remaining comments tomorrow, but about how to go forward with a merge: we could flip the default for --disable-guards around to have guards disabled by default; this would preserve the status quo and you would have to ./configure --enable-hpcgap --enable-guards and run unward to get guards semantics. Once we merge the unward changes, we could then enable guards by default.

@fingolfin
Copy link
Member

Still has guards failures here:

gs:=[ExtraspecialGroup( 27, 3 ), DihedralGroup(32)];
G:=DirectProduct(gs);
A:=AutomorphismGroup(G);

gives me

Error, Attempt to read object 4666793776 of type plain list without having read access in
  AddDictionary( dict, img, Length( orb ) ); at GAPROOT/lib/grpmat.gi:306 called from
DoSparseLinearActionOnFaithfulSubset( grp, OnRight, sort ) at GAPROOT/lib/grpmat.gi:494 called from
NicomorphismOfGeneralMatrixGroup( G, false, false ) at GAPROOT/lib/grpmat.gi:535 called from
...

Using lldb to figure out where that plist comes from revealed that it's cacheBag; specifically, the failure is here:

   1874
   1875	#ifdef HPCGAP
   1876	    // reset the method cache if necessary
-> 1877	    if (ELM_PLIST(cacheBag, 1) != methods) {
   1878	        Obj * cache = BASE_PTR_PLIST(cacheBag);
   1879	        cache[0] = methods;
   1880	        memset(cache + 1, 0, SIZE_OBJ(cacheBag)-2*sizeof(Obj));

The method cache is per-thread, so the access should be fine; but I think we never set a region on those lists, hence the access error. @rbehrends can you take a look?

@rbehrends
Copy link
Contributor Author

@fingolfin Yes, I'm working on that. There are a couple more (and oddly enough, one I see only locally, not on Travis).

@rbehrends rbehrends force-pushed the hpcgap-alt-tls branch 2 times, most recently from 15d1803 to cd73786 Compare October 23, 2019 13:33
@rbehrends rbehrends changed the title WIP: alternate TLS implementation for HPC-GAP. WIP: HPC-GAP improvements (different guard implementation, TLS configurability) Oct 23, 2019
@fingolfin
Copy link
Member

There is still an error in the same place as before (the AutomorphismGroup example I pasted above), but now for another reason; it only happens if autpgrp is loaded. When this happens, for some reason, we run into this code inside the GAP library, in lib/pquot.gi:505:

    wt := qs!.collector![SCP_WEIGHTS];
    wt{n+[1..nhwg]} := [1..nhwg] * 0 + class;  # <- guard is violated here

The guard is violated because wt is in the read-only region. I have not yet figured out why it is read-only (as I am about to change trains now, so no time)

@fingolfin
Copy link
Member

(This also means there is an easy workaround, but we should first better understand what's going on, so we don't break multi threaded stuff)

diff --git a/lib/pquot.gi b/lib/pquot.gi
index e30e26d0c..6c39cf5c8 100644
--- a/lib/pquot.gi
+++ b/lib/pquot.gi
@@ -502,8 +502,9 @@ UpdateWeightInfo := function( qs )
     class := class + 1;
     qs!.collector![SCP_CLASS] := class;
 
-    wt := qs!.collector![SCP_WEIGHTS];
+    wt := ShallowCopy(qs!.collector![SCP_WEIGHTS]);
     wt{n+[1..nhwg]} := [1..nhwg] * 0 + class;
+    qs!.collector![SCP_WEIGHTS] := MakeImmutable(wt);
 
     avc2 := [1..n]+0;
     for g in [1..n] do

@rbehrends
Copy link
Contributor Author

There is also a problem with the FactInt package in tst/testinstall/hpc/tasks.tst in lines 35 and 37. The calls to CallAsTask( NF, 7, [ 1 ] ) and CallAsTask( NF, 7, [ 1, 2 ] ) error out because the function IntegerFactorization in FactInt-1.6.2/lib/general.gi accesses a global cache (FACTINT_SMALLINTCACHE), which is created in the main thread as thread-local data.

@fingolfin
Copy link
Member

@rbehrends some optimization opportunities:

  • CURR_FUNC accesses only data and objects which are thread local, so perhaps we can turn off ward in it?
  • likewise, NARG_FUNC, CALL_0ARGS, etc. all shouldn't need ward (and hence no UNSAFE variants) because function and function bodies are essentially immutable after being created. Well, with some caveats:
    • for T_FUNCTION, all members of struct FuncBag with exception of the name are constant after creation; for operations, a few of the additional members (such as the methods cache, and the extra field) are not, but that's easy to take care of.
    • for T_BODY, the content is constant except that the profiling code wants to set bits in statements/expressions once they are executed for the first time. But surely we can take care of that with an atomic operation?

My suggestion here would be to change T_FUNCTION and T_BODY to not be in the public region by default; instead, we create them thread local, and then once we are done setting them up, change them to the readonly region (resp. for T_BODY, we might even be ok marking them as read-only, with the caveat that atomically setting those profiling bits must still work)

@rbehrends
Copy link
Contributor Author

This is a general optimization opportunity for all objects that are immutable. Currently, we have these tagged via MakeBagTypePublic() and that means that any read or write guard check exits early via the fast path. However, even the fast path incurs some overhead, so we should definitely excise the unnecessary checks.

I would, however, propose to postpone such optimizations until everything is working and integrated, unless you think there are good reasons to include them in this PR?

@fingolfin
Copy link
Member

No, there is no need to include those in this PR.

BTW, there are more reasons for this than optimization: also safety / correctness. Right now, we create a lot of bags in the public region, then modify them extensively. This is OK because those objects are not visible to the outside while we are modifying them; and then later on, we never modify them (or only in a "safe") fashion. Or so we claim, but there is no good way to actually check that. But if we instead created those objects thread local, and only made them public once the setup is complete, then we could (at least in principle) forbid ADDR_OBJ / PTR_BAG for objects in the public region (raising a guard error or so), only allowing CONST_ADDR_OBJ / CONST_PTR_BAG and UNSAFE_ADDR_OBJ / UNSAFE_PTR_BAG. Of course the devil is in the details, but that would be the direction I'd try to follow.

@coveralls
Copy link

coveralls commented Oct 27, 2019

Coverage Status

Coverage increased (+0.0002%) to 84.513% when pulling e1913b8 on rbehrends:hpcgap-alt-tls into 3cca408 on gap-system:master.

@fingolfin
Copy link
Member

Reminder: once this PR is merged, we need to update https://github.com/gap-system/pkg-ci-scripts/blob/master/build_gap.sh so that unward is run when testing against HPC-GAP.

@rbehrends rbehrends changed the title WIP: HPC-GAP improvements (different guard implementation, TLS configurability) HPC-GAP: Allow configuration of TLS and improve native TLS performance on macOS Nov 13, 2019
@fingolfin fingolfin merged commit 3111e67 into gap-system:master Nov 14, 2019
@fingolfin fingolfin removed the do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) label Nov 14, 2019
@ThomasBreuer ThomasBreuer added the release notes: not needed PRs introducing changes that are wholly irrelevant to the release notes label Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes: not needed PRs introducing changes that are wholly irrelevant to the release notes topic: HPC-GAP Issues and PRs related to HPC-GAP topic: kernel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants