HPC-GAP: Allow configuration of TLS and improve native TLS performance on macOS #3502

rbehrends · 2019-06-14T12:04:15Z

This is a tracking PR for progress on HPC-GAP changes. Aside from older changes in this PR, we also aim at improving our implementations of thread-local storage, specifically so that HPC-GAP can be used as a library.

Our current implementation requires HPC-GAP to be run as the main program, which in particular prevents its use as a library. Alternative portable implementations are considerably slower on systems that do not properly support efficient thread-local storage (such as macOS and CygWin).

We're attempting to address this problem in two ways.

One issue is implementations that rely on pthread_getspecific() directly or indirectly do not treat such function calls as pure functions and thus do not get used for common subexpression elimination or hoisted out of loops. As those systems all rely on gcc/clang, we can tag pthread_getspecific() as __attribute__((pure)), which will immediately lead to a measurable performance improvement.

On macOS, we can go one step further. While there is no guaranteed stable ABI, pthread_getspecific() has consistently been implemented by movq %gs:(,%rdi, 8), %rax. We therefore can verify that pthread_getspecific() actually uses this implementation (memcmp() against the machine code) and then use a direct assembly language implementation that has nearly the same performance characteristics as our current implementation.

In addition, we store the key in a static variable that is initialized with an __attribute__((constructor)__ function. This is generally slightly faster than putting it in a non-static variable on macOS in particular, as we're saving one level of indirection.

Also, in the worst case, this code functions with just pthread_getspecific() and no other mechanism available, it is just slower.

Todo:

Configure check for __attribute__((pure)) and definition of a macro for it.
Configure check for __attribute__((constructor)) and definition of a macro for it.
Configure check for it being acceptable to define __pthread_getspecific() as pure.
Configure check for macOS supporting this particular implementation.
Potentially distinguish between building HPC-GAP as a library and a standalone program?
Efficient Cygwin implementation.

fingolfin · 2019-06-17T12:09:11Z

My understanding is that this contains PR #2845, correct? If so, it really would be nice if we could get that PR finished and merged, before building new stuff atop of it...

rbehrends · 2019-06-17T20:24:49Z

Strictly speaking, this PR does not actually depend on the other one, but is committed on top of that branch for convenience.

I am already working on cleaning up the other PR as part of my current work to make multi-threaded Oscar a possibility, so that we can move on from that.

configure.ac

etc/ci-prepare.sh

src/hpc/tls.h

src/hpc/tlsconfig.h

configure.ac

.gitignore

configure.ac

src/hpc/threadapi.c

src/precord.c

rbehrends · 2019-09-30T09:53:15Z

This is now a revised version that mirrors the Mono logic. It can also now be more easily extended to equally support Windows implementations.

fingolfin

The commit 'Update .gitignore and .travis.yml files from master.' needs a new commit message.

Otherwise, this is shaping up well, but we need to figure out how to get it into a mergeable state, see my comments.

src/hpc/tls.h

fingolfin · 2019-10-02T11:41:36Z

src/system.h

+#endif
+
+
+


Very unimportant suggestion, feel free to ignore: but I try to restrict to at most 2 empty lines in a row.

Suggested change

configure.ac

fingolfin · 2019-10-02T13:36:40Z

etc/ci-prepare.sh

+  unward/bin/unward --inplace src
+  # commit the result to prevent the docomp test from triggering
+  git commit -m "Unward" src


So this means people who want to use the code in this PR still need to have unward, and run it, and then have lots of modifications in their tree. That's IMHO the major blocker for merging this PR, and until we resolve it, there is no way this can be merged. So I see two ways forward:

We run unward and commit the modifications once and for all, and move on (my understanding always was that this is the plan?)

We determine that we will have to keep using unward for the foreseeable. In that case, we will have to use it like ward, and thus need to restore the parts of the build system which transparently handled applying ward (now: unward) to sources followed by compilation (instead of working inlaces, the results would be placed into gen/src/; and build rules would ensure gen/src/FOO.c gets rebuilt whenever src/FOO.c or one of its header dependencies is touched.

Thing is, we need people to use this code so it gets tested, so keeping it in a PR indefinitely is not a good way forward.

So, this is something that I don't have a good solution for myself. My plan was to not put the unward changes in for a transition period. The reason is that if we need to change something about how they work (e.g. include an option for disabling guards for performance reasons), they mess up diffs and git blame, as they're pretty extensive.

But that leads to the problem that if you build HPC-GAP without unward, you get a near immediate crash, as an unnecessary read or write guard fails (one that gets removed by unward).

Possible solution: add a --without-guards configure option. That allows builds to (unsafely) work without unward for the transition period and would also later allow HPC-GAP to be built without guards for performance reasons if performance is critical.

If we were to put unward into the build system, it already supports a -o <dir> option to place the transformed sources (and all untransformed ones, too) into a separate directory, say gen/. This takes less than a second, is incremental (it doesn't overwrite files that didn't get changed and therefore doesn't change their time stamps) and thus make-friendly, and is in principle possible.

But this would again introduce an external dependency, which I'm not very enthusiastic about (even though unward, unlike ward, is itself dependency-free and only requires a C++-98 or later compiler). A large part of the raison d'être for unward is, after all, that it can be dropped from the build system eventually and does not continue to be a maintenance burden.

This sounds as if you are aiming for option 1 eventually. But the big question then is: when would we be ready for it? What's the hard criterium for making that decision? I keep coming back to this very same questions since when you first started talking about unward. You keep saying it's meant to be used once, but also keep saying that it may need to be changed, and then re-run. Unless there is a clear goal as to when it's "done", I don't see a solution out of this dilemma, other than to accept that the notion of "unward will be run once and then discarded" is hopeless. Hence option 2.

At the same time, it's also totally unclear to me what major changes to unward output you think might be needed?

There are two possible changes that I'm contemplating right now and that may require changes in the output:

Allow disabling of guards at compile time.

Add #ifdef HPCGAP around the inline functions, so that they don't get compiled when not built with HPC-GAP. This was unnecessary when they were still static, but now they're not static anymore and should not show up in the code regardless.

My goal with unward was to keep the choice open-ended. This is why I specifically supported both the following options:

Make the process idempotent, so that you can rerun unward if necessary.

Add a -o option to allow for use as a preprocessor in the build system (i.e. like ward).

This means that we wouldn't be forced by design to pick one or the other.

As for the process, I was hoping to basically get the following done:

Add the current PR (cleaned up) to GAP.

Ask a couple of people to test that it works with unward (there aren't many who are using HPC-GAP, but there are a couple at least).

If that's the case, then after a short delay (couple of weeks) add the unward changes in a separate patch.

I am totally fine with rolling them into the current PR, too, if you think that's the best way forward. In the end, you probably have a better view of the big picture as far as GAP is concerned.

@rbehrends (CC: @stevelinton @ChrisJefferson ) regarding your first two points, it seems you already added --enable-guards / --disable-guards (which is perfectly fine by me). Note sure around which inline functions you want(ed?) to add #ifdef HPCGAP, but I guess it's fine, too.

Regarding how to proceed: My main concern about merging this PR is how it would affect people who are using HPC-GAP right now, but who do not yet know about unward and thus are not using it. Will they get a horribly broken HPC-GAP, or just one that's more or less on the same level as what we have in master right now anyway?

If it was the latter, there is indeed nothing major stopping this from being merged (my remaining remarks are all very minor nitpicks and should be easy to resolve).

However, it seems to be the former (starting GAP built from your branch w/o running unward causes an immediate segfault due to violated guards). That makes me somewhat more reserved about the idea of merging this as-is. It'd be still OK by me if there was a very concrete timeline for it, and also a plan B should we determined that result of the merge is unusable. But my main concern is who the "couple of people" would be we'd ask to test it, and what makes us think they'd be more likely to test it then, than they would be to test this PR / your branch if we asked them explicitly to test it?

Until we come up with a good strategy to deal with that, it might indeed be best to split this commit into two: a non-controversial commit we could merge right away (which would contain the TLS changes, the renaming of HAVE_NATIVE_TLS to USE_NATIVE_TLS and a bunch of other minor changes), and the rest (mainly: the guards changes).

src/gasman.h

fingolfin · 2019-10-16T08:16:02Z

@rbehrends the commit here is titled "Make HPC-GAP work without Ward." but that's not really what this is about, given that we already are not using ward...

src/hpc/tls.h

configure.ac

src/hpc/thread.c

fingolfin · 2019-10-16T08:53:32Z

etc/ci-prepare.sh

+  unward/bin/unward --inplace src
+  # commit the result to prevent the docomp test from triggering
+  git commit -m "Unward" src


@rbehrends (CC: @stevelinton @ChrisJefferson ) regarding your first two points, it seems you already added --enable-guards / --disable-guards (which is perfectly fine by me). Note sure around which inline functions you want(ed?) to add #ifdef HPCGAP, but I guess it's fine, too.

Regarding how to proceed: My main concern about merging this PR is how it would affect people who are using HPC-GAP right now, but who do not yet know about unward and thus are not using it. Will they get a horribly broken HPC-GAP, or just one that's more or less on the same level as what we have in master right now anyway?

If it was the latter, there is indeed nothing major stopping this from being merged (my remaining remarks are all very minor nitpicks and should be easy to resolve).

However, it seems to be the former (starting GAP built from your branch w/o running unward causes an immediate segfault due to violated guards). That makes me somewhat more reserved about the idea of merging this as-is. It'd be still OK by me if there was a very concrete timeline for it, and also a plan B should we determined that result of the merge is unusable. But my main concern is who the "couple of people" would be we'd ask to test it, and what makes us think they'd be more likely to test it then, than they would be to test this PR / your branch if we asked them explicitly to test it?

Until we come up with a good strategy to deal with that, it might indeed be best to split this commit into two: a non-controversial commit we could merge right away (which would contain the TLS changes, the renaming of HAVE_NATIVE_TLS to USE_NATIVE_TLS and a bunch of other minor changes), and the rest (mainly: the guards changes).

rbehrends · 2019-10-16T09:36:30Z

I'll address the remaining comments tomorrow, but about how to go forward with a merge: we could flip the default for --disable-guards around to have guards disabled by default; this would preserve the status quo and you would have to ./configure --enable-hpcgap --enable-guards and run unward to get guards semantics. Once we merge the unward changes, we could then enable guards by default.

fingolfin · 2019-10-21T05:25:21Z

Still has guards failures here:

gs:=[ExtraspecialGroup( 27, 3 ), DihedralGroup(32)];
G:=DirectProduct(gs);
A:=AutomorphismGroup(G);

gives me

Error, Attempt to read object 4666793776 of type plain list without having read access in
  AddDictionary( dict, img, Length( orb ) ); at GAPROOT/lib/grpmat.gi:306 called from
DoSparseLinearActionOnFaithfulSubset( grp, OnRight, sort ) at GAPROOT/lib/grpmat.gi:494 called from
NicomorphismOfGeneralMatrixGroup( G, false, false ) at GAPROOT/lib/grpmat.gi:535 called from
...

Using lldb to figure out where that plist comes from revealed that it's cacheBag; specifically, the failure is here:

   1874
   1875	#ifdef HPCGAP
   1876	    // reset the method cache if necessary
-> 1877	    if (ELM_PLIST(cacheBag, 1) != methods) {
   1878	        Obj * cache = BASE_PTR_PLIST(cacheBag);
   1879	        cache[0] = methods;
   1880	        memset(cache + 1, 0, SIZE_OBJ(cacheBag)-2*sizeof(Obj));

The method cache is per-thread, so the access should be fine; but I think we never set a region on those lists, hence the access error. @rbehrends can you take a look?

rbehrends · 2019-10-21T08:43:08Z

@fingolfin Yes, I'm working on that. There are a couple more (and oddly enough, one I see only locally, not on Travis).

fingolfin · 2019-10-23T17:11:11Z

There is still an error in the same place as before (the AutomorphismGroup example I pasted above), but now for another reason; it only happens if autpgrp is loaded. When this happens, for some reason, we run into this code inside the GAP library, in lib/pquot.gi:505:

    wt := qs!.collector![SCP_WEIGHTS];
    wt{n+[1..nhwg]} := [1..nhwg] * 0 + class;  # <- guard is violated here

The guard is violated because wt is in the read-only region. I have not yet figured out why it is read-only (as I am about to change trains now, so no time)

fingolfin · 2019-10-23T17:12:58Z

(This also means there is an easy workaround, but we should first better understand what's going on, so we don't break multi threaded stuff)

diff --git a/lib/pquot.gi b/lib/pquot.gi
index e30e26d0c..6c39cf5c8 100644
--- a/lib/pquot.gi
+++ b/lib/pquot.gi
@@ -502,8 +502,9 @@ UpdateWeightInfo := function( qs )
     class := class + 1;
     qs!.collector![SCP_CLASS] := class;
 
-    wt := qs!.collector![SCP_WEIGHTS];
+    wt := ShallowCopy(qs!.collector![SCP_WEIGHTS]);
     wt{n+[1..nhwg]} := [1..nhwg] * 0 + class;
+    qs!.collector![SCP_WEIGHTS] := MakeImmutable(wt);
 
     avc2 := [1..n]+0;
     for g in [1..n] do

rbehrends · 2019-10-23T21:26:38Z

There is also a problem with the FactInt package in tst/testinstall/hpc/tasks.tst in lines 35 and 37. The calls to CallAsTask( NF, 7, [ 1 ] ) and CallAsTask( NF, 7, [ 1, 2 ] ) error out because the function IntegerFactorization in FactInt-1.6.2/lib/general.gi accesses a global cache (FACTINT_SMALLINTCACHE), which is created in the main thread as thread-local data.

fingolfin · 2019-10-24T10:05:48Z

@rbehrends some optimization opportunities:

CURR_FUNC accesses only data and objects which are thread local, so perhaps we can turn off ward in it?
likewise, NARG_FUNC, CALL_0ARGS, etc. all shouldn't need ward (and hence no UNSAFE variants) because function and function bodies are essentially immutable after being created. Well, with some caveats:
- for T_FUNCTION, all members of struct FuncBag with exception of the name are constant after creation; for operations, a few of the additional members (such as the methods cache, and the extra field) are not, but that's easy to take care of.
- for T_BODY, the content is constant except that the profiling code wants to set bits in statements/expressions once they are executed for the first time. But surely we can take care of that with an atomic operation?

My suggestion here would be to change T_FUNCTION and T_BODY to not be in the public region by default; instead, we create them thread local, and then once we are done setting them up, change them to the readonly region (resp. for T_BODY, we might even be ok marking them as read-only, with the caveat that atomically setting those profiling bits must still work)

rbehrends · 2019-10-24T12:08:59Z

This is a general optimization opportunity for all objects that are immutable. Currently, we have these tagged via MakeBagTypePublic() and that means that any read or write guard check exits early via the fast path. However, even the fast path incurs some overhead, so we should definitely excise the unnecessary checks.

I would, however, propose to postpone such optimizations until everything is working and integrated, unless you think there are good reasons to include them in this PR?

fingolfin · 2019-10-25T10:48:25Z

No, there is no need to include those in this PR.

BTW, there are more reasons for this than optimization: also safety / correctness. Right now, we create a lot of bags in the public region, then modify them extensively. This is OK because those objects are not visible to the outside while we are modifying them; and then later on, we never modify them (or only in a "safe") fashion. Or so we claim, but there is no good way to actually check that. But if we instead created those objects thread local, and only made them public once the setup is complete, then we could (at least in principle) forbid ADDR_OBJ / PTR_BAG for objects in the public region (raising a guard error or so), only allowing CONST_ADDR_OBJ / CONST_PTR_BAG and UNSAFE_ADDR_OBJ / UNSAFE_PTR_BAG. Of course the devil is in the details, but that would be the direction I'd try to follow.

coveralls · 2019-10-27T12:17:09Z

Coverage increased (+0.0002%) to 84.513% when pulling e1913b8 on rbehrends:hpcgap-alt-tls into 3cca408 on gap-system:master.

fingolfin · 2019-10-30T11:39:04Z

Reminder: once this PR is merged, we need to update https://github.com/gap-system/pkg-ci-scripts/blob/master/build_gap.sh so that unward is run when testing against HPC-GAP.

rbehrends added topic: HPC-GAP Issues and PRs related to HPC-GAP do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) topic: kernel labels Jun 14, 2019

rbehrends force-pushed the hpcgap-alt-tls branch 2 times, most recently from d99b7e5 to 83ee9a8 Compare September 19, 2019 10:53

fingolfin reviewed Sep 19, 2019

View reviewed changes

ChrisJefferson reviewed Sep 20, 2019

View reviewed changes

configure.ac Outdated Show resolved Hide resolved

rbehrends force-pushed the hpcgap-alt-tls branch 2 times, most recently from 7eb6ab0 to f132902 Compare September 20, 2019 11:54

fingolfin reviewed Sep 20, 2019

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

fingolfin reviewed Sep 20, 2019

View reviewed changes

configure.ac Outdated Show resolved Hide resolved

fingolfin reviewed Sep 20, 2019

View reviewed changes

src/hpc/threadapi.c Outdated Show resolved Hide resolved

fingolfin reviewed Sep 20, 2019

View reviewed changes

src/precord.c Outdated Show resolved Hide resolved

rbehrends force-pushed the hpcgap-alt-tls branch from f132902 to de252d1 Compare September 30, 2019 09:46

rbehrends force-pushed the hpcgap-alt-tls branch 4 times, most recently from bd302fb to 51464e8 Compare October 2, 2019 12:12

fingolfin reviewed Oct 2, 2019

View reviewed changes

rbehrends force-pushed the hpcgap-alt-tls branch 4 times, most recently from 791ac67 to d9203e4 Compare October 7, 2019 12:45

fingolfin reviewed Oct 16, 2019

View reviewed changes

rbehrends force-pushed the hpcgap-alt-tls branch from d9203e4 to 4167799 Compare October 17, 2019 12:00

rbehrends force-pushed the hpcgap-alt-tls branch from 4167799 to 5e87c82 Compare October 18, 2019 19:35

rbehrends force-pushed the hpcgap-alt-tls branch 2 times, most recently from 15d1803 to cd73786 Compare October 23, 2019 13:33

rbehrends changed the title ~~WIP: alternate TLS implementation for HPC-GAP.~~ WIP: HPC-GAP improvements (different guard implementation, TLS configurability) Oct 23, 2019

rbehrends mentioned this pull request Oct 23, 2019

Garbage collection causes crash in HPCGAP gap-packages/JupyterKernel#109

Closed

fingolfin mentioned this pull request Oct 23, 2019

WIP: New HPC-GAP guard checking without using ward #2845

Open

fingolfin force-pushed the hpcgap-alt-tls branch from 9db1e68 to 3fd7fb5 Compare October 27, 2019 11:16

rbehrends force-pushed the hpcgap-alt-tls branch from e36cbb5 to 92c3aa3 Compare November 7, 2019 10:21

This was referenced Nov 7, 2019

lib: make GroupByQuotientSystem more robust #3731

Merged

HPC-GAP: Fix CloseInput() in threads without stdin #3732

Merged

Configurable TLS support for HPC-GAP.

e1913b8

rbehrends force-pushed the hpcgap-alt-tls branch from 92c3aa3 to e478074 Compare November 8, 2019 13:34

rbehrends mentioned this pull request Nov 11, 2019

WIP: Making HPC-GAP work with Julia #3734

Open

6 tasks

fingolfin approved these changes Nov 13, 2019

View reviewed changes

rbehrends changed the title ~~WIP: HPC-GAP improvements (different guard implementation, TLS configurability)~~ HPC-GAP: Allow configuration of TLS and improve native TLS performance on macOS Nov 13, 2019

rbehrends force-pushed the hpcgap-alt-tls branch from e478074 to e1913b8 Compare November 13, 2019 15:37

fingolfin merged commit 3111e67 into gap-system:master Nov 14, 2019

fingolfin removed the do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) label Nov 14, 2019

fingolfin mentioned this pull request Oct 13, 2020

HPC-GAP: Add configure option to set HAVE_NATIVE_TLS? #1885

Closed

ThomasBreuer added the release notes: not needed PRs introducing changes that are wholly irrelevant to the release notes label Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC-GAP: Allow configuration of TLS and improve native TLS performance on macOS #3502

HPC-GAP: Allow configuration of TLS and improve native TLS performance on macOS #3502

rbehrends commented Jun 14, 2019 •

edited

Loading

fingolfin commented Jun 17, 2019

rbehrends commented Jun 17, 2019

rbehrends commented Sep 30, 2019

fingolfin left a comment

fingolfin Oct 2, 2019

fingolfin Oct 2, 2019

rbehrends Oct 4, 2019

fingolfin Oct 4, 2019

rbehrends Oct 4, 2019

fingolfin Oct 16, 2019

fingolfin commented Oct 16, 2019

fingolfin Oct 16, 2019

rbehrends commented Oct 16, 2019

fingolfin commented Oct 21, 2019

rbehrends commented Oct 21, 2019

fingolfin commented Oct 23, 2019

fingolfin commented Oct 23, 2019

rbehrends commented Oct 23, 2019

fingolfin commented Oct 24, 2019

rbehrends commented Oct 24, 2019

fingolfin commented Oct 25, 2019

coveralls commented Oct 27, 2019 •

edited

Loading

fingolfin commented Oct 30, 2019

		#endif

HPC-GAP: Allow configuration of TLS and improve native TLS performance on macOS #3502

HPC-GAP: Allow configuration of TLS and improve native TLS performance on macOS #3502

Conversation

rbehrends commented Jun 14, 2019 • edited Loading

fingolfin commented Jun 17, 2019

rbehrends commented Jun 17, 2019

rbehrends commented Sep 30, 2019

fingolfin left a comment

Choose a reason for hiding this comment

fingolfin Oct 2, 2019

Choose a reason for hiding this comment

fingolfin Oct 2, 2019

Choose a reason for hiding this comment

rbehrends Oct 4, 2019

Choose a reason for hiding this comment

fingolfin Oct 4, 2019

Choose a reason for hiding this comment

rbehrends Oct 4, 2019

Choose a reason for hiding this comment

fingolfin Oct 16, 2019

Choose a reason for hiding this comment

fingolfin commented Oct 16, 2019

fingolfin Oct 16, 2019

Choose a reason for hiding this comment

rbehrends commented Oct 16, 2019

fingolfin commented Oct 21, 2019

rbehrends commented Oct 21, 2019

fingolfin commented Oct 23, 2019

fingolfin commented Oct 23, 2019

rbehrends commented Oct 23, 2019

fingolfin commented Oct 24, 2019

rbehrends commented Oct 24, 2019

fingolfin commented Oct 25, 2019

coveralls commented Oct 27, 2019 • edited Loading

fingolfin commented Oct 30, 2019

rbehrends commented Jun 14, 2019 •

edited

Loading

coveralls commented Oct 27, 2019 •

edited

Loading