-
Notifications
You must be signed in to change notification settings - Fork 203
Alternative algorithm in mp_n_root_ex #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4e87a21
to
6c33402
Compare
the most important question here is: which of the algorithms are constant time? that's what matters for dropbear, heimdal and all the other ltc users |
(It doesn't get used in LTC at least I couldn't grep it) |
Right it was only heimdal who used it |
Now you made me curious. I couldn't find it used in their master branch but that doesn't mean anything, of course. Do you know more? It s not that much work to change it but if it needs to be cryptographically secure it needs to be correct in every aspect and checking that is a lot of work. |
e8ab5a9
to
5f7aa4b
Compare
Now you made me curious as well and I just realized that I should've checked some stuff already a long time ago... |
Rebased and updated because of #294 I took the liberty to post a question at crypt.stackexchange to find out if such a plain truncating nth-root function is used in any cryptographic algorithm that is used by more than half a dozen people. I still doubt it. A modular nth-root maybe, but a vanilla one? No, I don't think so. |
@czurnieden Does it make sense to cut out the part which is only useful for very big numbers (Halley)? Maybe we can avoid introducing too much complexity. I would suggest to replace the original algorithm with the new one if it proves worth it. |
Like I suggested? Cut out completely or just bracket out? The cutoff on my machine is somewhere in the million bits range and that's at a level you certainly wouldn't use LTM anymore, so I would go for a complet cut. |
Yes, complete cut please. Does the same apply to FFT multiplication btw in your other PR? |
That's a good criterion :) The same criterion can be used to argue against crypto sensitive code in ltm. I think ltm is good for crypto exploration but not for production use. At least it shouldn't be used in cases where side channels and timings are relevant. LTM is used for big integers for example in language runtimes (tcl, perl6, probably others, ...) where small integers are promoted. This is also my main use case. |
Was sure you would say that, so I already did it tonight ;-)
The six million-bit-cutoff? You can take a look at bncore.c where I listed some of the cutoffs. The MP_28BIT on 64-bit arch mix I use has a cutoff of 78,400 bits for multiplication (with TC4 and TC5) and 126,000 bits for squaring. It is much higher with the default 60-bit large mp_digit but that is mainly caused by the five 12-bit slices. It is much lower with four 15-bit slices (not yet implemented) but the upper cutoff would be quite low, too. The nth-root function on the other side, who needs it regularly with such large numbers? My question at stackexchange is now nearly a day old, so it can be safely assumed that there is no mainstream cryptographic use for a plain nth-root. But I will prepare one to be easily C&P'd into it if that changes (and publish as a gist or something alike in the meantime). There is also optional (preprocessor) code in it for single digits. It is only useful with large MP:_xBIT and more expensive to test (needs an extra round in Travis) because it is a compile time option. |
What I would suggest - we make it a separate function mp_mul_fft and then the user has the choice. Since in this case the cutoffs are so much off. In particular - you can assume that users working with very large numbers don't use MP_8BIT but rather MP_64BIT. Furthermore I think it is also reasonable to assume that people needing the best performance for such numbers for scientific use, either have their own code or simply use GMP. I see the advantage of tommath in the the freedom and simplicity to embed in other projects since there is no linking restriction.
Yes. And if a use case comes up we might hear about it from a user.
Hmm, I would recommend you don't spend time on sth which is probably not useful. But use your own judgement, I have no idea ;)
I would probably cut it too. I think single digit optimizations are mostly useful for binary operations, since these are the ones where the big+small case should be optimized. |
All functions in LTM are separate functions?
The values for MP_28BIT on a 64-bit arch fit well in the series kara->tc3 ->tc4->tc5, the soft-128-bit ruins it a bit for MP_60BIT, admitted.
Probably MP_28BIT, at least I do it. Works well on 32-bit (MP_16BIT is also an option for 32-bit, but the difference is small) and is faster with larger numbers on 64-bit.
The size of the stripped GMP libgmp.a is ~850kb + ~500kb libgmp10 + ~50kb libgmp++4 + libc, libtommath.a with tc4, tc5 and FFT stripped is ~260kb and depends on libc which can be one of the mini-libcs. Actually, we need only ( My own version of LTM with a lot more bells and whistles, some of them don't even belong in it, is just shy of 600kb, stripped. I think we could deliver a bigint library for general use (no frills, no tchotchkes) that can handle the occasional large number (and print it, too) in much less than 500kb stripped and that with a no-headache-license and full backwards compatibility, too. I'm pretty sure a viable alternative to GMP with such advantages would "sell" well. Additions needed to reach that goal
That's all. About an additional 150-200k in size (probably less, haven't measured it). Optional without tc4, tc5, fft, and Newton division (B&Z is needed for fast read/print) to save about 72kb (measured). (This should go to the ABI thread, I think)
I wasted so much time in my youth and still have no children ;-)
OK. |
I mean - do not add a branch in mp_mul. s_mp_karatsuba_mul is for example not a separate function in the sense of the public api, this is what I meant. However since we have compile time configuration people could also decide if they want it to be included. But be aware that the bundled lib in distros for example uses the default compile settings mostly, so good choices should be made here.
I generally agree with your sentiment. However I don't think they points you are mentioning are the important selling points. What I consider important:
But concerning adoption not too much should be expected. This library is already pretty mature and I am unsure if there are many new potential users. It would have been interesting to get the library inside huge projects like firefox for example (js bigint support), but they wrote their own bigint lib I've just read recently (or took the one from V8, whatever). Concerning GMP - it probably has a better design since it separates between non allocating low level functions and high level functions. In tommath we only have highlevel functions. Maybe if fastmath would be merged in partially, things could be improved. But this is a mega project.
Here I think the medium sized values are more important for general use, so fft is probably special use.
You are right, this is probably a weak point as of now.
I don't think there is an issue here?
Nice to have but this is already something for special use. May I ask what you use the library for? Are you doing some experiments or is it mostly just for fun implementing nice algorithms? Maybe people are also using the library for some number theoretic experiments, but in that case they would probably go with the fastest lib which is around. |
Watch up! Incoherent ramblings of a grumpy old man below! ;-)
That would make it completely useless. No normal user starts to hack LTM to get FFT in.
And not to forget to add the branches in
What does "clean" mean? Do you have an objective criterium? "safe"? For doing what? Or: against what? Or: for whom? And what "obvious" is depends highly on the point of view, sadly. I'm a member over at the Stackoverflow/Stackexchange community and I've seen things… *sigh*
"Correctness" is something that can be proven (to some extent). Some of the algorithms used in LTM are, some are not. Do you plan to fill in the blanks? (Restrictions of decidability holds, of course) "good testing"? What does "good" mean for you in this context? What must be tested and what can be dropped? We cannot test everything, that is for sure. We can only test all edge-cases and the first "all" in this sentence is quite bold even. Some of the current tests tests just some random values and hope for the best (and I don't exclude myself here) and there is no properly documented analysis of the edge-cases of others. There should be no need to try several hundred thousand tests against another bigint implementation assuming that the other bigint implementation is correct and that several hundred thousand tests will give any insight into the correctness of the innards of LTM. We are only wasting Travis' computing time with this spray&pray.
We don't need "more tests", we just need the amount of tests that will cover all edge-cases plus and minus one. Not more and definitely not less. That is a lot of work including documentation—and I know how much you, no, we all hate documentation if we have to write it ourselves ;-) —but we won't come to an end otherwise. But there is also a bit of pragmatism allowed, e.g.: using Valgrind to test the use of malloc and friends.
Yes, it is possible. But easy? Really? But that is a well known problem and many tried their teeth at it, including me. So I can tell you that autoconf and friends are hated by all and cmake is not flexible enough (might have changed in the meantime, haven't followed its growth in the last years.). BTW: can a stripped down LTM still be tested with
What do you mean by "fast"? It is fast enough for me with the exceptions I listed and wrote code for to make LTM fit for my purposes. "competitive"? Why should we compete with GMP on speed?
Than why are you refurbishing it? ;-)
You'll never know until you try.
It is a different design, yes, but it is better? I don't know. One of the (many) reasons LTM is designed that way is readability because it was also meant to have some sort of pedagogical utility. I don't know what came first, the book or LTM but they belong together.
What are "medium sized values"? General use is exactly that: general, so nothing is more important than others or you are back to some kind of specialization.
(probably a misunderstanding, I mean number conversion)
So, no primesieve? Factorizing is a by-product once you have a fast sieve and adding a second level algorithm like e.g.: Pollard-Rho make it fast enough for numbers with a second largest factor of up to about 80 bits. Sounds not much but is quite useful and adds just 3k (stripped) to the lib.
Several things. It was a blueprint for a Javascript bigint lib; it is one of the voters together with GMP and Landon C. Noll's calc to decide if a result maybe wrong; the base for rational fixed-point math (was a bit faster than Hauser's softfloat on that special FPU-less hardware and not many special functions were needed in the first place, so it was quickly written); a base for simple cryptography (here the license was the main reason); the bigint base for my own little language; CPU thermic test: computing and printing and rereading 10^6! in a loop on every (isolated) core to get a pretty picture with the Flir;and whatever I have forgotten, too ;-).
Most likely not. Experimental math aims for correctness, not the highest speed. And in this day and age you just throw more metal at it—it's cheap now and won't strain the grant too much. |
66024f8
to
09bcb79
Compare
I mean you expose both mp_mul and mp_mul_fft.
Sure this are subjective criterions. But for safety there are a well defined criteria, e.g. type safety, which prevents the user of misusing functions. This helps correctness. Unfortunately C is not very good at that, but we can do better as for example in #258. Clean is maybe not a good word, think coherent etc. For example having separate functions for doing different things. ioctl is a very bad example and mp_expt_ex with this fast parameter is bad. Furthermore I added these two complement functions mp_tc_and, while there still was mp_and. This is not good. What we have now is better. But I cannot define these things, I can only give you examples.
Obvious in the sense of types. I think
I would love to have things proven. There are different levels of verification. You are probably talking about proving that certain algorithms work theoretically, but there is also the possibility to prove the code top to bottom. But then program extraction would be the better solution instead of hand written code. Basically what you did by hand from pari/gp was program extraction. But I don't mean those things. This is too much effort and completely out of scope. We should aim for correctness by having good test coverage. Randomised tests are good for that but surely not perfect.
Edge cases + randomised tests are pretty good imho. I would also like to have tested the special code paths separately, as you added in #280.
I disagree.
Edge cases are important, we agree. But randomised tests are actually good if you roughly verify the structure of the algorithm at the same time.
Valgrind and sanitizers are very important since they also test for out of bound access et.
Why is it not possible to test TC separately for a wide range of numbers?
Hehe, I find so and I am actually using this for embedding. It is a quite manual process however, I mostly have to hand select the functions I want and use tommaths dependency graph.
You mean configuring a stripped down version? I am doing roll your own, and according to my experience it is quite easy.
Potentially yes, if we would add the ifdefs or MP_HAS from #262 to the test suite. I my case I have a separate test suite however which tests my stuff and the selected ltm functions on a higher level.
Could be, I have no idea. But in GMP there are some asm routines, we are not beating those. fastmath was made specifically since ltm is not fast.
Maybe not. Probably not. I just named criteria that people commonly use. I am using ltm despite being probably slower.
I don't know, but this is certainly a good selling point for ltm. Another one I've read is that ltm does fine grained error checking. For example GMP exits in case of malloc failure.
Because I needed additional functionality and then I realised that I am not perfectly happy with the current status. But things got better, I am much happier now ;)
100% agree. The pedagogical utility is great. But users who just use the library don't care about that I guess. So there are different groups of users who want and need different things.
This is fuzzy. I am not writing precisely here but why should I? Our discussion is also quite lose about various things :)
Ah ok. You are right.
I mentioned factorizing and primesieve together for that reason. I think they should be there, but I am not sure how many users need it. Has someone asked for this? But if you have a use case for a project of yours, it is already a perfectly valid reason to include it here. And maybe users will come, so I am not against this.
Cool many things :) So it is basically your go-to big number library? Then I guess you already have a good picture what you want inside the lib. I am mostly using it for language bigints. Maybe it will also get inside GHC.
It highly depends on the experiment you are doing. If the computation is slow and scales with the size of your experiment, then you probably go with a fast lib. For example if you want to scan large ranges of numbers. E.g. like these seti@home style experiments. |
More
The smoothing needs some, maybe even all of the Toom-Cook functions, so it might get a bit more complicated with duplicated code and all. But I haven't even started yet, my original code for smoothing is quite some tangled mess and needs a good thorough clean-up first (I also have a hook in the balancing, which shouldn't be there). For a bit more information about the concept see e.g.: Brent et al. "Faster Multiplication in GF(2)". Yes, it is a simple concept, but the devil is, as always, in the details.
Ah, I see. Good.
Mainly the first, but there are no new algorithms in LTM[1] so there is literature to point to. Proving standard C is not possible, it would need a subset of C. I think you have something like Cadiz in mind? (ISO/IEC 13568 and its corrigenda ) Or the Latex Z-tool. Not to forget good old Spivey But you are right, of course, that would be way out of scope.
What do these random tests find what the "edge-cases plus one and minus one" (meaning one test for the edge case that must succeed, one beyond that must fail and one inside that also must succeed ) does not? Yes, there might be one typo in the software causing it to fail (that is giving wrong output) for one or more singular inputs. With the very large domains of an arbitrary precision function (infinite in theory, quite finite in praxi) the chance to find that input with a handful of random tests is very low. It silences the conscience, yes, but it is a deceitful silence in my opinion. If the domain is small enough it should be tested completely. I did that for my version of mp_radix_size where I tested all possible input for 8, 16, and 32 bit. Problem: you can't do that in Travis, at least not regularly.
Why? I'm not against testing against a different bigint implementation, far from it, I even use it myself although for single results only and against two others, not just one. But we can't do that in Travis, because Noll's I'm only against the spray&pray method. We need to sit down and evaluate which tests are actually needed and if you want to do some random tests on top, ok, but just running 333,333 tests (A special reason for this number?) is wasteful.
It is, of course, it was just an example for untested edge-cases where it is debatable if they really need to be tested because they don't get used in the first place but it would be possible to press them to accept that input. An edge-case so to say ;-)
They have the advantage of much less architectures they need to run on.
That's quite a lot. But if you look at their Toom-Cook thresholds…no these are the bare numbers, they lack a multiplier and that multiplier is the x in Toom-Cook x-way. So their TC3-mul-threshold is for x86_64 81*3=243 which is what we have (they use the full 64 bit of the limbs) and really go to town with their optimizing. Plain C is probably the language with the highest number of supported architectures and that is also a large advantage. For the price of some higher runtime, admitted.
It gets easier if you do it regularly but it is quite overwhelming for a first time user.
No, make a configurer where the user can tick of some boxes of what they need and the script automatically calculates all dependencies and generates all files, such that the only thing the user has to do at the end is to type It doesn't have to be some shiny QT GUI, a simple ncurses frontend is more than sufficient, just A large part of the logic is already implemented, we can (ab)use the
That was one of the more serious questions because we have no control over what get's stripped by the users.
It would be nice if the error gets an address, that is which function deep down failed first.The input is not always reproducible such that a simple Now that you have introduced
The source of featuritis, it's easy to fall for it, I have some experience to that regard ;-)
At least one: me ;-)
Not that I know, no. [sieve] The sieve adds about 3.8k (stripped) to the lib, 7k with Pollard-Rho minus the 2.7k (stripped) for the prime-table that isn't needed anymore. I hoped that it would speed up prime generation but it was neutral at best, slightly slower at worst.
No, I use all three. They all have different pros and cons, I decide per use-case.
Wouldn't call these experimental math ;-) So, enough of my rantings for tonight ;-) [1] My nth-root one is, in theory, but it is just a port of a known technique for floats to integers. There might be a paper in it but that would need a complete number theoretical analysis. *pooh * [2] seti@home started in 1999, folding@home in 2000 |
I think it gives a you a good assurance of correctness. It is called quick check testing. I am pretty sure there are papers which measure the effectiveness of quickcheck testing in the presence of a rough verification of the structure of the algorithm. This will capture more cases than for example only testing edge cases, where we could just return 0 on all other inputs ;) Not a measurement, but a real world experiment, where they formally verified or performed haskell->coq translation https://arxiv.org/abs/1803.06960. However they found zero bugs since the lib has already been quick checked.
I fully agree. For small inputs, exhaustive. For large inputs randomised. There are also fuzzers which take the randomised approach btw and find weird code paths using a guided randomised method. This way they can somehow make the huge search tractable.
Yes, but we also take what we have. If you want to spend time on it, feel free :) We could however consider restricting the test-vs-mtest travis jobs to the develop branch as I did for valgrind? Would you agree with that?
Yes, I would like such edgy cases to be tested even if they are not used :)
:D
I think only using C, one can go quite fast. I think you can achieve factor 2x or factor 1.5x with plain C if the code is written properly. It could however be that there are routines where SIMD is used to great effect. In that case one would have to write the C code in a specific way and maybe the autovectorizer would not get it. The reason why I think they are using asm is also that they want to use the full limbs and have to catch overflow. And __builtin_add_overflow is not standard C. Could that be? Hmm, well you can also catch overflow in standard C but I think it won't be fast enough.
Yes, this is what I meant ;) We could add sth like this and it wouldn't be much work. I basically need something like this - I only did the selection manually for now. Maybe such an autoconfigurer would be useful for more people.
If we get MP_HAS we could also replace the T macro in the testsuite such that it checks for the corresponding BN_MP_X_C macro and then we would not clutter anything. However a set of basic functions would be needed always (mp_cmp etc).
No, I don't want that. The error checking is good enough now. And I write my code such that no failures occur. It is only nice to have the possibility to do full error checking without hard failure. I don't need it as a debugging instrument. For this I have other better tools.
Yes I think it is acceptable.
I meant the math searches, wasn't there sth about abc? |
(I just commited a rough sketch for a quite primitive configurator in #301)
You should test egde-case plus one inside and one beyond. It can still return wrong results for every other input, so we test some random input. It can still return wrong results for every other input, so we test some more random input. It can still return wrong results for every other input, so we test some more…
Of course I want to spend some time, I just have none.
Yes, that would be better. Although I don't think that Travis does it without any compensation we should be parsimonious with other people's money.
I will comb through and add them if I find more.
It is in almost all cases fast enough. If you are a bit careful of what you are doing. I never had the need to go to the bare metal for speed in the last 10 or so years, only for saving space. If the compiler doesn't step in your way. I had no time yet to analyse the compiler output but I found the culprits for the difference in timings between -O2 and -O3: the GCC options
Yes, but I think the main reasons are speed and, to be honest, it is easier to do all that in assembler where you know for fact what will happen in the CPU (tries to inconspicuously throw a blanket over SPECTRE et al. ;-) ) without checking every compiler that is able to compile GMP which adds to correctness.
Ah, now I know what I have overseen in #301,
So
I know what you mean but the way you worded it sounds a bit…uhm… ;-)
It is not only for you or me, it is for the user who wants to know why something failed. If they go to Stackoverflow it is no problem but if they come here it would be easier if you can ask them for a number which they never have and you can more easily send them home ;-) But serious: you sometimes need to do remote debugging where there are not necessarily programmers in the transmission path who know what they are talking about. Or they are pretty sure it is your program that is wrong and not their code. If you don't know what I'm talking about: feel lucky. But I do not need it in LibTomMath itself, I can easily patch it myself if necessary. Especially since you introduced [bn_sieve] The functions that would get their API changed are Should I do it in the |
Close, see #160 for the reason |
Added an alternative method, much faster, by several magnitudes.
The default method uses the code from #189
The faster method, available by setting the macro
LTM_USE_FASTER_NTH_ROOT
(like it is done inmp_div
) uses some rounds of bisection to determine the best seed for the following recursive method which is either Newton's (default) or Halley's method at input with a size over 100000 bits.Although much faster than the default method, the object file is 3-4 times larger, depending on compiler and compiler flags, than the object file of the default method.
Timing the isolated test in
demo/test.c
formp_n_root
:with n_root from #189 : ~14 seconds
with the faster version: ~0.025 seconds