-
Notifications
You must be signed in to change notification settings - Fork 203
faster Toom-Cook 3 algorithms #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If what you are saying about the cutoff points, this is good 👍 How is test coverage? Maybe add a tiny test specifically calling s_mul_karatsuba and s_mul_toom and comparing for very large numbers? |
I have updated my home-machine, if you remember and since then, some things work…a bit differently, to say the least. Can you do me the favour and, if you have access to Bionic (virtual won't work here), run EDIT II: So it is my algorithm?
The results of the run with profiling is quite exactly what I got with my old system. That's why I proudly talked about halving the cutoffs.
Mmh…
*scratches head*
Was already working at it ( it's all in |
bd63311
to
787da71
Compare
Got to something acceptable by reducing the amount of variables to one, but it is still not known what the precise cause is. New version:
|
787da71
to
b9dbf76
Compare
@czurnieden This is also still work in progress? I added a few labels to the PRs. I hope you don't mind. Feel free to remove them if you think you are done! |
@czurnieden I saw you mentioned gprof - I recommend perf these days, which doesn't require recompilation and has other advantages afaik. |
I'm waiting for the approval&merging of #280 to rebase on that.
Am just used to it. Don't need more than a short glimpse at the output to see what the problem/bottleneck is. And recompiling? Don't use it very often, once or twice a year max, so that doesn't matter. |
Ah ok. I profile much more often and like it to have these cpu perf counter infos. I like to have timings per asm instruction from time to time ;) |
38a137f
to
8b5f4a6
Compare
With Oh, that dreaded |
8b5f4a6
to
5303d8e
Compare
I thought before about removing it. But it blows up the sizes of my commits. I already changed like 16K lines :-P |
5303d8e
to
bfe08db
Compare
It is a "nice to have" but I'm pretty sure any of the better IDEs around are able to it and do it much better. Should be generated on demand only. I'll take a look at it later. |
I would keep generating it and just gitignore it and remove it from git. But we also have tommath_class.h which is also bad, but not as bad. |
Was quickly done, see #283
Yeah, true. |
@minad fine for you to merge even with the multi-clears instead of a single one? |
There is no |
probably |
bfe08db
to
bacbacb
Compare
bacbacb
to
8056a60
Compare
Just a technical question - I wonder about that. What is causing the slowdown? More allocations? If you access the malloc cache I wouldn't expect such a big difference. I had expect that all the time is eaten up by the numerics code. |
I know that realloc is a bit slower than a fresh allocation, but it is quite a lot, indeed! |
Well realloc can be much faster but if you really realloc, it is as slow as free+malloc. However with modern allocators usually allocations are taken from a "hot" cache and not that slow then. In particular in your tuner, you are doing many times the same thing. Therefore there should be plenty of touched memory ready. |
If allocations are really an issue and mp_init_size_multi would solve that, why not consider that? |
okay, if both of you are wondering about this I'll merge later |
I haven't really looked I am only speculating (I can look later). If you allocate a lot of memory before you free stuff again, it will get very slow since the OS has to supply memory. But in this case you have many alloc/free pairs, right? How much memory are we talking btw? Are your blocks larger than page size or maybe multiple times larger? Then malloc could directly do mmap/munmap. In particular munmap is unbelievably slow on linux (I had a case where I did munmapping in a background thread for that reason). Did you strace? Edit: So if we do mp_init+mp_grow it means a very small allocation plus maybe a very large one. The result should be fully dominated by mp_grow, since mp_init will just give and take from the cache. |
this change brought the tune build-job seriously down from >7min to <2min? that's pretty impressive! |
I had no time yet to do anything to look into it, most of it is based on educated guesses. I also need to read through the documentations because I just updated from the old LTS to the new one, that is I skipped one LTS release. New kernel, new LibC, new all, I'll need a couple of days to get sufficiently up-to-date with all the gory details and there are a lot of entrails to dig through ;-)
You complained, I delivered ;-) |
Oh no. Now we have to reorder the travis yamml again |
should I merge or leave this open until the investigation by @czurnieden is finished? |
My only complaint are those many labels and mp_clear. But I don't want to stand in the way here. The investigation and maybe other improvements could still happen after merging. |
@czurnieden you can decide whether you want to put in more effort or if it's fine like this for you, please rebase and remove the WIP label if you think it's ready :)
and that's also true |
8056a60
to
228e487
Compare
It works, it is fast, it is ugly.
But I just reba…oh. Re. investigation: it seems as if the new GLibC is indeed under a new memory-management. Will compile an old one (glibc-2.23) for a short test and dig through the sources if that test supports my suspicions. That will take a couple of days; compiling the GlibC alone is not a small task. But the result of that investigation won't be of any large relevance for my code (would make me wonder if any at all) and if it does: "maybe other improvements could still happen after merging". The main branch here has not been named "develop" without a reason ;-) |
@czurnieden I did an experiment concerning allocations and running your tuner. I managed to get rid of all allocations in mp_init and all allocations smaller than MP_MIN_PREC. With allocations 13.2s, without allocations 12.4s. So all these (often needless) small allocations we are doing during init cost as around 5%. This is somehow what I tried: typedef struct {
mp_digit sd[MP_STATIC_PREC]; /* use this for small allocs */
int used, alloc;
mp_sign sign;
mp_digit *dp;
} mp_int; |
Then I did another experiment where I replaced malloc with a small shim which takes allocations smaller than MP_PREC from a simple cache (singly linked list). With that I got at one run 12.6s. Not as good but still like 5%. I will test LDPRELOAD jemalloc next. Until now my experiment at least shows that glibc malloc is not as good as I expected for our usage pattern. I thought the glibc malloc cache is better. |
Jemalloc 11.8s But take the numbers with a grain of salt. No statistics, just for a rough picture. Jemalloc it seems has a very good cache. Better at least than my naive one and the MP_STATIC_PREC gives nearly no advantage. |
The Toom-Cook algorithms do not need a lot of small allocation, the numbers are quite large, so your 5% is probably all you can get without a specialized memory management. Tomfastmath could make much better use of an inbuild memory management, but there is already a discussion about that if I remember it correctly. But I did as you told me and run some tracing programs. The most useful for C&P'ing the results here is probably The first four lines, the ones with the Single rounds with the version of tc3mul and tc3qsr I started this branch with.
As you can see: TC3-sqr is quite an outlier Single rounds with the version of tc3mul and tc3qsr I started this branch with
Here Kara-mul is the outlier. With the exception of Kara-mul it is also the result I got with my old system. (got me a Glibc 2.23 compiled *pooh*, no difference. Will try it with an old kernel later) With the current versions and Single rounds mp_init_size
And with mp_grow
Current version with LTM -O2, tune -O3
Same with -O2 (both)
No difference between [LTM -O2, tune -O3] and both -O2, all well inside the error-bars. Would have made me wonder if not but now I have it that The most significant outlier is TC3sqr within the very first measurement, so I'll start from there. |
I don't understand what you are saying here. I just tested somehow specialised memory managment and I tested jemalloc (probably the best allocator around) and got between 5% to 10% speedup. At some point it is not possible to win more since what is expensive is not the allocation but touching the memory (stalling the cpu and in the worst case mapping pages which have not been touched before). And interestingly this is aligned with what you are measuring. There is the "memset" call which costs a lot of time and for some reason it does not appear anymore in the last version. We are zeroing the upper digits quite often and this could be a problem. At least some of the memsetting can be disabled now by disabling #255. EDIT: I did another experiment where I disabled many of the MP_ZERO_DIGITS memsets (Not all of them such that things don't break. Many functions rely on the upper bits being zero). I kept calloc in mp_init_size however. With this I got around 4% to 6% speedup when running tune_it. This is only due to touching the memory less often. EDIT2: But if I recall correctly, I never saw any such outliers as you do. It is always ordered like kmul<ksqr<tmul<tsqr with somehow reasonable numbers (like factor two between the different muls and the different squares respectively). What I found are relatively consistent speedups of around 5% for the various memory tunings (reduced amount of mallocs, better jemalloc, touching memory less often). |
With "specialized" I meant an ingrained MM, carefully tailored to LTM where LTM can even give feedback. Feedback would be useful in the example of Toom-Cook where the (max) size of the variables is known in advance and they could tell the MM to keep their memory until the end, overwrite instead of doing deep copies et cetera p. p. And the other way around, too, rewrite LTM to make the MM's life easier. I would expect a gain of 30% and more in the case of Toom-Cook. There are most likely a lot of even sillier things possible but I'm not as up-to-date with the current state re MM as you are, it is a long time ago (10, 15 years?) that I had to write my own malloc because what was available at that time was not usable for my application.
Now that I had the time to look it up, there are others that claim to be better ;-) But all of these benchmarks you can find with Google's help only answer the question "Better for what?" and that's the crux of the biscuit here.
It is the
Yeah, but LTM is used for cryptography and security does not come without some cost. We can't allow data to linger in memory for longer than absolutely necessary so that's not an option.
That is a brutal one and I'm pretty sure you would recall it ;-) The rest is all explainable, the amount of changes in speed are expected and I can even explain the difference between I could spend a whole day testing everything and the kitchen-sink but I think it's time now for the assembler-dump. |
What one could try is some kind of region allocator which just reserves a region in the beginning of some bigger operation and never frees. Everything is freed in the end of the operation. This is the only optimization I can imagine which could give an advantage. But instead of optimizing allocation behavior the lowest hanging fruit I am seeing is reducing the amount of MP_ZERO_DIGITSs.
I am pretty sure, this won't help us much. The mallocs usually already use fine grained buckets and you might even be able to tune them via some settings. The only thing you could win is inlining some kind of fast path code, accessing the bucket if there is something in it and in the slow path calling into the allocator. I think the linux kernel slab allocator does something like this.
Sure it always depends on the use case. But according to my recent experience these general purpose allocators are already so good in selecting and returing a chunk of memory that accessing the memory is the far more expensive operation. But good allocators ensure that you allocate recently used chunks first since they are already hot in the cache etc. Where these allocators usually differ is in their fragmentation, long runtime behavior and if there are many threads etc. Not relevant to our microbenchmarks ;)
We could at least make it an option. It is like a 5% cost which is basically added to every fundamental operation. From the crypto/mitigation point of view, I think it would also be ok to just let the upper digits uninitialized or let them contain garbage, since the mp_ints probably live only for a short period of time and the lower digits still contain the probably much more valuable data. What is important is zeroing the memory before free (and this is what we are doing). But if we would want to make it configurable - it is a bit hard. I tried it but I am not familar with how most algorithms work. I disabled some ZERO_DIGITS, checked valgrind for uninitialized accesses, looked a bit around the control flow, while trying not to break the testsuite.
Maybe enable MP_USE_MEMSET for benchmarking? This makes it clear where time is just eaten up by zeroing. For this reason I introduced this option.
If you are digging around further, give perf a try :) |
The list was just an example of what can be done if one is willing to go to the extreme, not as suggestions for LTM ;-)
No, LTM is not only meant to, it actually gets used in cryptographic programs, so no fiddling here, at least not with the default behaviour. The cost involved for that kind of security is normally accepted as unavoidable if it is not too much and 5-10% is "not too much" for most people. (Thanks to MS? ;-) ) It is tempting, yes, but I wouldn't touch cryptographic code without a very good reason. We can only make it optional and offer some kind of
Yeah, a lot of work.
For the autotuner only:
For the testsuite (run 10 times)
A bit worse with system memset if the numbers are very large, but drops into insignificance if we assume that the testsuite comes closer to normal use than the autotune. You have the overhead of the function call and the checks&balances but memset is also highly optimized. |
For sure I would only make it optional. However I would actually like to know what the reason is to do this zeroing. What kind of attacks are prevented by doing that? If an attacker has the possibility to read leaked data from the upper digits, I assume the probability is high that they also have the possibility to read leaked data from the lower digits. The usual mitigation of zeroing before free (or even that the hardened malloc zeros everything always) is in place in order to prevent leaking data from a critical cryptographical and hopefully safe part of an application to an potentially unsafe part of the application. Hardened allocators randomize the allocated addresses or segregate allocations in some kind of pools in order to prevent leaking data from one part to the other. Zeroing the memory reduces the time in which the critical data is around. This is helpful if the data could leak in other ways (swapping or some more subtle channels?). But in the lower digits we still have the finished computation lying around, so this argument seems kind of moot. And because of swapping you should use locked memory for example. We are also not doing that by default. For all these reasons I concluded that zeroing the upper digits is kind of useless from a security pov in contrast to zeroing before free.
I don't think there will be a big difference. For small memsets we lose, for big ones we win using memset. But during profiling, using a memset call helps to see the associated costs. Otherwise the cost will just be assigned to the function doing the memset manually. This is what I meant. However as we've seen before at O3 loops are replaced by memset and vice versa (or by inline rep stosb or sth). |
Because we have absolutely no information about, safe influence on the environment, we have to assume the worst (within reasonable limits, of course)
If you have an operation that tries to remove information, parts of that information can still reside in the upper digits. Oversimplified example: 123456789 % 12345701 = 12345480 (overwriting:
Ah, now I see what you meant. |
Yes, probably this is a strong enough argument. But given the other facts about memory allocation, unlocked memory it might still not be worth it. But as I said, if we would remove zero_digits, we would make it optional so or so. But I won't do it any time soon since I don't think it is a pressing issue and the potential for breaking stuff is to great. |
Just brought the implementation up to state-of-the-art, nothing else changed.
Halfs the cut-off point and brings it therefore into the expected region. The expected region is a bit less than twice the Karatsuba cut-off point and that is where it is now.
This code was part of #227.