-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpauto.tst spends a lot of time in CmpVec8BitVec8Bit #1868
Comments
As to the last part: allowing matrices where the rows are not separate objects is of course a central aspect of the matrixobj project, so I consider this here another strong motivation to work on that. |
I'm a bit surprised that this ends up as a timing issue, since the code (I think I'm talking about the same as you observe) is using element lists to save on element tests. Roughly, a generating set for a subgroup is built up in a spinning algorithm. The basic operation thus is to test membership of an element in the group so far. Now, if the group has at most 50000 elements, an element list is used instead of doing an algorithmic element test (that would translate to a permutation group, having to obtain the permutation image of a matrix first). Now a lot of tests are done. (and indeed one can reduce this number, see 16de963, which might help a bit). What I'm confused about is what would be better: a) Do an algorithmic element test: I cannot see how this would improve, as this involves a lot of arithmetic b) Hashing: The current hash routine for matrices builds on FF vectors, forming arithmetic with vector entries and calling Int to move into characteristic zero. I cannot imagine this being faster. Alternatively, one could look at a few positions that make all matrices differ and extract these into a vector for hashing. c) Convert the matrices to vectors by concatenation? d) The planned matrix objects? If someone has a better idea on how to speed up these redundancy tests, I'd be interested to hear. |
Just some rambling and somewhat incoherent thoughts and remarks: I don't think the options @hulpke lists are necessarily mutually exclusive. For top performance, I believe one must force all matrices to use the same representation. This makes it possible to effectively cache the comparison function, and use a single hash function (which then also does not need to bother to stay compatible with other representations, and thus can be super tuned). Also, using "true" compressed matrices will help in certain cases, e.g. the 2x2 GF(7) Steve mentioned, which would fit into 12 bits. (If the matrix rep is tight like that, of course it might be better to e.g. use a balanced tree or something similar, like a sorted list, instead of a hash table. Heck, if it only takes up 12 bits, we can have a 2^12=4096 byte bit set which records for each matrix whether it's in there or not; but of course having such tiny vector spaces is a very special case). Not needing to store separate row objects would save indirections, cache misses and lots of overhead (on a 64 bit system, every positional object has a 16 byte header, plus another 8 bytes for a pointer to the type; plus the master pointer; and the bag size is effectively rounded up to a multiple of 8. So even if the actual data is just 2 bytes, we store a whopping 32 bytes (nicely confirmed by By squashing the rows together and thus having them effectively share a single type, we avoid lots of indirections and can also squeeze more data into the cache. Of course one can then still easily beat this with custom code, which e.g. implements a "list of compressed matrices of identical type", where each matrix actually is not a full object/bag, but rather just a few reserved bytes in a single larger object; i.e. something vaguely like a "tensor" (but optimized for this usecase, i.e. looking up whether a given matrix is contained; accessing "subobjects" would be relatively slow, as a new "real" matrix object would have to be created on the fly....) Anyway, this would save a lot of additional overhead and ensure everything is stored together, hopefully in the same cache line(s). Of course for matrices of larger dimension resp. over larger fields, different approaches might be better; there probably is no single "perfect" data structure for this, and (as so often) various heuristics would have to be applied to choose a (hopefully) good one. |
So "Convert the matrices to vectors by concatenation" would be helpful right now (it avoids the row objects), and with If you then ensure all those concatenated vectors / matrix objects use the same rep, you can use a hash table (or whatever) with a fixed and efficient hash function. And also lookup and reuse the equality method, thus avoid method lookup. |
Just to report a few more experiments: Alexander's patch does reduce the time in CmpVec8Bit... a bit, but adds much more time in permutation multiply. I tried some changes to This does support the thrust of the matrix objects work to allow a whole matrix (or even a whole set of matrices) to reside in one object. |
I think at this point we're in a situation where a significant improvement would come from hashing. The current hashing of matrices over a finite field happens through (probably badly written) ad-hoc library code.
If such kernel fiunctionality becomes available, I'd be happy to rewrite this (and other) matrix group code to improve performance as has been outlined. |
Yes, this is definitely something I want, too, and part of our (well, mine at least) motivation for creating the |
Just adding this as an issue to raise and share the results of some poking around I've been doing. No immediate problem as such.
Observed behaviour
Profiling teststandard with gprof reveals that a lot of time is being spent in the kernel function
CmpVec8BitVec8Bit
. About 3 billion calls, taking a total of 138s on mandel. A few weeks ago there were some other outliers as well, likeQuoIntPerm2
andElmsBlist
, but we've dealt with those. Anyway, the comparisons turn out to come fromtestextra/grpauto.tst
where part of the automorphism group code (CompatiblePairs
) ends up doing a lot of calls likem in sorted_list_of_matrices
wherem
is typically a small compressed matrix andsorted_list_of_matrices
is also often quite short. The kernel delegates from\in
toPosition
which ends up doing a binary search, which in turn does lots of comparisons of compressed matrices, which then compare the rows...Each individual call is not very expensive, but it involves accessing quite a lot of bits of data that may be in different cache lines -- masterpointer, object, masterpointer of type and type id for the matrices, then masterpointer, object and masterpointer of field info and field info for each vector. Once it's done that, it compares one or two bytes of data in each row -- if they're not equal it also has to convert the entries in the final byte into FFEs and compare those... All in all a lot of work and cache pressure to compare two 2x2 matrices over GF(7). Although experiments suggest that this is still about 30% faster than comparing uncompressed matrices.
By the way, if you're trying to replicate this, you want to remove
tst/testextra/switch_obj.tst
which does a LOT of forced garbage collections and distorts the profile. In the rest of the tests, GC time is generally under 10%, a little more for permutation-heavy code.Expected behaviour
I guess the expected behaviour is to run faster, but it's far from clear what the best way to achieve that is. It may be that
CompatiblePairs
can use a hash table instead, which means many matrices would not need to be accessed because they have the wrong hash value (or to use a permutation representation for smaller matrix groups) or avoid the work some other way. A nicer solution would be an alternative compressed matrix representation designed for small matrices. If you don't mind the rows of the matrix not being objects, then you can design much nicer representations, but this might break quite a lot of code likefor row in mat do for elm in row do .... od; od;
in
could be a bit cleverer about not delegating toPosition
for very short lists as well, which might save one or two comparisons for lists of length 1 and maybe 2.The text was updated successfully, but these errors were encountered: