Skip to content

Alternative algorithm in mp_n_root_ex #202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

czurnieden
Copy link
Contributor

Added an alternative method, much faster, by several magnitudes.

The default method uses the code from #189
The faster method, available by setting the macro LTM_USE_FASTER_NTH_ROOT (like it is done in mp_div) uses some rounds of bisection to determine the best seed for the following recursive method which is either Newton's (default) or Halley's method at input with a size over 100000 bits.

Although much faster than the default method, the object file is 3-4 times larger, depending on compiler and compiler flags, than the object file of the default method.

Timing the isolated test in demo/test.c for mp_n_root:
with n_root from #189 : ~14 seconds
with the faster version: ~0.025 seconds

@sjaeckel
Copy link
Member

sjaeckel commented Apr 7, 2019

the most important question here is: which of the algorithms are constant time? that's what matters for dropbear, heimdal and all the other ltc users

@czurnieden
Copy link
Contributor Author

the most important question here is: which of the algorithms are constant time? that's what matters for dropbear, heimdal and all the other ltc users

(It doesn't get used in LTC at least I couldn't grep it)
In the cryptographic sense of "constant time"? None of them, including the original one. All their runtimes depend on the input. The algorithm that is affected least of the three is the largest, most optimized one.
You can change the bisection to do always do x steps for every input of size n. But that would be quite slow for large roots.
But we can use the same algorithm as the new one and add restrictions: a fixed number of bisection rounds, followed by a fixed number of Newton rounds instead of a computed number of bisection rounds and an input dependent number of Newton rounds.
Mmh…

@sjaeckel
Copy link
Member

sjaeckel commented Apr 7, 2019

Right it was only heimdal who used it

@czurnieden
Copy link
Contributor Author

Right it was only heimdal who used it

Now you made me curious. I couldn't find it used in their master branch but that doesn't mean anything, of course. Do you know more?

It s not that much work to change it but if it needs to be cryptographically secure it needs to be correct in every aspect and checking that is a lot of work.
So, if it is not really needed…
But a short note in the documentation would be nice, I think. Please remind me if I forget it, thanks.

@czurnieden czurnieden force-pushed the faster_n_root branch 2 times, most recently from e8ab5a9 to 5f7aa4b Compare April 8, 2019 00:02
@sjaeckel
Copy link
Member

sjaeckel commented Apr 8, 2019

Now you made me curious. I couldn't find it used in their master branch but that doesn't mean anything, of course. Do you know more?

Now you made me curious as well and I just realized that I should've checked some stuff already a long time ago...

@czurnieden
Copy link
Contributor Author

czurnieden commented May 25, 2019

Rebased and updated because of #294

I took the liberty to post a question at crypt.stackexchange to find out if such a plain truncating nth-root function is used in any cryptographic algorithm that is used by more than half a dozen people.

I still doubt it. A modular nth-root maybe, but a vanilla one? No, I don't think so.

@minad
Copy link
Member

minad commented May 25, 2019

@czurnieden Does it make sense to cut out the part which is only useful for very big numbers (Halley)? Maybe we can avoid introducing too much complexity. I would suggest to replace the original algorithm with the new one if it proves worth it.

@czurnieden
Copy link
Contributor Author

Does it make sense to cut out the part which is only useful for very big numbers (Halley)?

Like I suggested?

Cut out completely or just bracket out?

The cutoff on my machine is somewhere in the million bits range and that's at a level you certainly wouldn't use LTM anymore, so I would go for a complet cut.

@minad
Copy link
Member

minad commented May 26, 2019

The cutoff on my machine is somewhere in the million bits range and that's at a level you certainly wouldn't use LTM anymore, so I would go for a complet cut.

Yes, complete cut please. Does the same apply to FFT multiplication btw in your other PR?

@minad
Copy link
Member

minad commented May 26, 2019

that's at a level you certainly wouldn't use LTM anymore

That's a good criterion :)

The same criterion can be used to argue against crypto sensitive code in ltm. I think ltm is good for crypto exploration but not for production use. At least it shouldn't be used in cases where side channels and timings are relevant.

LTM is used for big integers for example in language runtimes (tcl, perl6, probably others, ...) where small integers are promoted. This is also my main use case.

@czurnieden
Copy link
Contributor Author

Yes, complete cut please.

Was sure you would say that, so I already did it tonight ;-)

Does the same apply to FFT multiplication btw in your other PR?

The six million-bit-cutoff?
Depends highly on the architecture and the size of MP_xBIT but they are lower.
But fast multiplication of large numbers is something everyone expects from a bigint library and it gets used, even with LTM.
I use it, for example and it is not much slower than e.g: GMP but I use MP_28BIT with 64-bit arch which is faster for larger numbers.

You can take a look at bncore.c where I listed some of the cutoffs. The MP_28BIT on 64-bit arch mix I use has a cutoff of 78,400 bits for multiplication (with TC4 and TC5) and 126,000 bits for squaring. It is much higher with the default 60-bit large mp_digit but that is mainly caused by the five 12-bit slices. It is much lower with four 15-bit slices (not yet implemented) but the upper cutoff would be quite low, too.
BTW: these are safe and "naked" cutoffs, that is the cutoffs are where FFT is always faster and there is no smoothing implemented yet which would bring the cutoffs down.

The nth-root function on the other side, who needs it regularly with such large numbers?

My question at stackexchange is now nearly a day old, so it can be safely assumed that there is no mainstream cryptographic use for a plain nth-root. But I will prepare one to be easily C&P'd into it if that changes (and publish as a gist or something alike in the meantime).

There is also optional (preprocessor) code in it for single digits. It is only useful with large MP:_xBIT and more expensive to test (needs an extra round in Travis) because it is a compile time option.
Cut that, too?

@minad
Copy link
Member

minad commented May 26, 2019

The six million-bit-cutoff?
Depends highly on the architecture and the size of MP_xBIT but they are lower.
But fast multiplication of large numbers is something everyone expects from a bigint library and it gets used, even with LTM.
I use it, for example and it is not much slower than e.g: GMP but I use MP_28BIT with 64-bit arch which is faster for larger numbers.
You can take a look at bncore.c where I listed some of the cutoffs. The MP_28BIT on 64-bit arch mix I use has a cutoff of 78,400 bits for multiplication (with TC4 and TC5) and 126,000 bits for squaring. It is much higher with the default 60-bit large mp_digit but that is mainly caused by the five 12-bit slices. It is much lower with four 15-bit slices (not yet implemented) but the upper cutoff would be quite low, too.
BTW: these are safe and "naked" cutoffs, that is the cutoffs are where FFT is always faster and there is no smoothing implemented yet which would bring the cutoffs down.

What I would suggest - we make it a separate function mp_mul_fft and then the user has the choice. Since in this case the cutoffs are so much off. In particular - you can assume that users working with very large numbers don't use MP_8BIT but rather MP_64BIT. Furthermore I think it is also reasonable to assume that people needing the best performance for such numbers for scientific use, either have their own code or simply use GMP. I see the advantage of tommath in the the freedom and simplicity to embed in other projects since there is no linking restriction.

The nth-root function on the other side, who needs it regularly with such large numbers?

Yes. And if a use case comes up we might hear about it from a user.

My question at stackexchange is now nearly a day old, so it can be safely assumed that there is no mainstream cryptographic use for a plain nth-root. But I will prepare one to be easily C&P'd into it if that changes (and publish as a gist or something alike in the meantime).

Hmm, I would recommend you don't spend time on sth which is probably not useful. But use your own judgement, I have no idea ;)

There is also optional (preprocessor) code in it for single digits. It is only useful with large MP:_xBIT and more expensive to test (needs an extra round in Travis) because it is a compile time option.
Cut that, too?

I would probably cut it too. I think single digit optimizations are mostly useful for binary operations, since these are the ones where the big+small case should be optimized.

@czurnieden
Copy link
Contributor Author

What I would suggest - we make it a separate function mp_mul_fft and then the user has the choice.

All functions in LTM are separate functions?
(FFT itself is independent of anything but needs double, the glue is in separate files)

Since in this case the cutoffs are so much off.

The values for MP_28BIT on a 64-bit arch fit well in the series kara->tc3 ->tc4->tc5, the soft-128-bit ruins it a bit for MP_60BIT, admitted.

you can assume that users working with very large numbers don't use MP_8BIT but rather MP_64BIT.

Probably MP_28BIT, at least I do it. Works well on 32-bit (MP_16BIT is also an option for 32-bit, but the difference is small) and is faster with larger numbers on 64-bit.

I see the advantage of tommath in the the freedom and simplicity to embed in other projects since there is no linking restriction.

The size of the stripped GMP libgmp.a is ~850kb + ~500kb libgmp10 + ~50kb libgmp++4 + libc, libtommath.a with tc4, tc5 and FFT stripped is ~260kb and depends on libc which can be one of the mini-libcs. Actually, we need only (objdump -T .libs/libtommath.so.1.1.0| grep GLIBC): read, free, memcpy (for GCC -O3), (__cxa_finalize), realloc, malloc, open (w/o LTM_NO_FILE). memset (for GCC -O3), (__errno_location), fgetc (w/o LTM_NO_FILE), fputc (w/o LTM_NO_FILE), close (w/o LTM_NO_FILE), calloc. These can be implemented by hand, if necessary, although I would at least use a 3rd party malloc. Libgmp and its dependencies have quite a bit more together.

My own version of LTM with a lot more bells and whistles, some of them don't even belong in it, is just shy of 600kb, stripped. I think we could deliver a bigint library for general use (no frills, no tchotchkes) that can handle the occasional large number (and print it, too) in much less than 500kb stripped and that with a no-headache-license and full backwards compatibility, too.

I'm pretty sure a viable alternative to GMP with such advantages would "sell" well.

Additions needed to reach that goal

  • faster multiplication (tc4, tc5, fft)
  • fast division (Burnikel&Ziegler, Newton)
  • fast fread/fprint
  • primesieve
  • factorizing

That's all. About an additional 150-200k in size (probably less, haven't measured it).
Everything has been written already and is running in my version of LTM for several years now (doesn't mean that they are bugfree, of course, but the teething problems are gone by now).
Work needed: adapt to current style and format and weave them in, ca 3-4 hours.

Optional without tc4, tc5, fft, and Newton division (B&Z is needed for fast read/print) to save about 72kb (measured).

(This should go to the ABI thread, I think)

Hmm, I would recommend you don't spend time on sth which is probably not useful.

I wasted so much time in my youth and still have no children ;-)

I would probably cut it too.

OK.

@minad
Copy link
Member

minad commented May 26, 2019

All functions in LTM are separate functions? (FFT itself is independent of anything but needs double, the glue is in separate files)

I mean - do not add a branch in mp_mul. s_mp_karatsuba_mul is for example not a separate function in the sense of the public api, this is what I meant.

However since we have compile time configuration people could also decide if they want it to be included. But be aware that the bundled lib in distros for example uses the default compile settings mostly, so good choices should be made here.

I'm pretty sure a viable alternative to GMP with such advantages would "sell" well.

I generally agree with your sentiment. However I don't think they points you are mentioning are the important selling points. What I consider important:

  • Clean, safe and obvious API
  • Correctness and good testing (This is mostly the case I guess, maybe we need more tests)
  • Easy to embed and strip down (this is already possible using all this macro machinery)
  • The basic arithmetic functions must be fast (I don't think they are competitive with gmp yet?)

But concerning adoption not too much should be expected. This library is already pretty mature and I am unsure if there are many new potential users. It would have been interesting to get the library inside huge projects like firefox for example (js bigint support), but they wrote their own bigint lib I've just read recently (or took the one from V8, whatever).

Concerning GMP - it probably has a better design since it separates between non allocating low level functions and high level functions. In tommath we only have highlevel functions. Maybe if fastmath would be merged in partially, things could be improved. But this is a mega project.

faster multiplication (tc4, tc5, fft)

Here I think the medium sized values are more important for general use, so fft is probably special use.

fast division (Burnikel&Ziegler, Newton)

You are right, this is probably a weak point as of now.

fast fread/fprint

I don't think there is an issue here?

primesieve
factorizing

Nice to have but this is already something for special use.

May I ask what you use the library for? Are you doing some experiments or is it mostly just for fun implementing nice algorithms? Maybe people are also using the library for some number theoretic experiments, but in that case they would probably go with the fastest lib which is around.

@czurnieden
Copy link
Contributor Author

Watch up! Incoherent ramblings of a grumpy old man below! ;-)

I mean - do not add a branch in mp_mul.

That would make it completely useless. No normal user starts to hack LTM to get FFT in.

However since we have compile time configuration people could also decide if they want it to be included.

And not to forget to add the branches in bn_mp_mul.c and bn_mp_sqr.c manually because there ain't any.
And you need branches there because FFT has an upper limit, too and doing it in any other way would be complicated, error-prone and unwanted.
Smoothing would also get quite difficult.

Clean, safe and obvious API

What does "clean" mean? Do you have an objective criterium?

"safe"? For doing what? Or: against what? Or: for whom?

And what "obvious" is depends highly on the point of view, sadly. I'm a member over at the Stackoverflow/Stackexchange community and I've seen things… *sigh*

Correctness and good testing

"Correctness" is something that can be proven (to some extent). Some of the algorithms used in LTM are, some are not. Do you plan to fill in the blanks? (Restrictions of decidability holds, of course)

"good testing"? What does "good" mean for you in this context? What must be tested and what can be dropped? We cannot test everything, that is for sure. We can only test all edge-cases and the first "all" in this sentence is quite bold even.

Some of the current tests tests just some random values and hope for the best (and I don't exclude myself here) and there is no properly documented analysis of the edge-cases of others.

There should be no need to try several hundred thousand tests against another bigint implementation assuming that the other bigint implementation is correct and that several hundred thousand tests will give any insight into the correctness of the innards of LTM. We are only wasting Travis' computing time with this spray&pray.

(This is mostly the case I guess, maybe we need more tests)

We don't need "more tests", we just need the amount of tests that will cover all edge-cases plus and minus one. Not more and definitely not less.

That is a lot of work including documentation—and I know how much you, no, we all hate documentation if we have to write it ourselves ;-) —but we won't come to an end otherwise.

But there is also a bit of pragmatism allowed, e.g.: using Valgrind to test the use of malloc and friends.
And some functions do not have any point that is edgy in any kind or form or that is easily reachable.
For example the Toom-Cook functions. They have a minimum size for the input that does not get tested, not even in the code. Will TCsquare-3-way fail with an input of 2 mp_digits? Depending on where the type in the algorithm is, it can take several tries to make it fail if the input is small. Did you find the typo in the last sentence? ;-)

Easy to embed and strip down (this is already possible using all this macro machinery)

Yes, it is possible. But easy? Really?

But that is a well known problem and many tried their teeth at it, including me. So I can tell you that autoconf and friends are hated by all and cmake is not flexible enough (might have changed in the meantime, haven't followed its growth in the last years.).
It is either "ignoring" (current status) or "roll our own" which is a lot of work, even if you don't aim at making it shiny at all, so nobody wants to do it, especially if the gain for the author(s) is so minimal.

BTW: can a stripped down LTM still be tested with ./test?

The basic arithmetic functions must be fast (I don't think they are competitive with gmp yet?)

What do you mean by "fast"? It is fast enough for me with the exceptions I listed and wrote code for to make LTM fit for my purposes.

"competitive"? Why should we compete with GMP on speed?
Does GMP run on a 16-bit machine?

This library is already pretty mature

Than why are you refurbishing it? ;-)

and I am unsure if there are many new potential users.

You'll never know until you try.

Concerning GMP - it probably has a better design since it separates between non allocating low level functions and high level functions. In tommath we only have highlevel functions.

It is a different design, yes, but it is better? I don't know.

One of the (many) reasons LTM is designed that way is readability because it was also meant to have some sort of pedagogical utility. I don't know what came first, the book or LTM but they belong together.

Here I think the medium sized values are more important for general use

What are "medium sized values"? General use is exactly that: general, so nothing is more important than others or you are back to some kind of specialization.

I don't think [fread/fwrite]there is an issue here?

(probably a misunderstanding, I mean number conversion)
Once you come to larger numbers and have many of them it gets significant. LTM's number conversion is abysmally slow and that double so because, well, mp_int2otherbases happens twice.

Nice to have but this is already something for special use.

So, no primesieve?

Factorizing is a by-product once you have a fast sieve and adding a second level algorithm like e.g.: Pollard-Rho make it fast enough for numbers with a second largest factor of up to about 80 bits. Sounds not much but is quite useful and adds just 3k (stripped) to the lib.
And a number theoretical lib without factorization?

May I ask what you use the library for?

Several things. It was a blueprint for a Javascript bigint lib; it is one of the voters together with GMP and Landon C. Noll's calc to decide if a result maybe wrong; the base for rational fixed-point math (was a bit faster than Hauser's softfloat on that special FPU-less hardware and not many special functions were needed in the first place, so it was quickly written); a base for simple cryptography (here the license was the main reason); the bigint base for my own little language; CPU thermic test: computing and printing and rereading 10^6! in a loop on every (isolated) core to get a pretty picture with the Flir;and whatever I have forgotten, too ;-).
Oh, nearly missed: some of the things above generated income.

Maybe people are also using the library for some number theoretic experiments, but in that case they would probably go with the fastest lib which is around.

Most likely not. Experimental math aims for correctness, not the highest speed. And in this day and age you just throw more metal at it—it's cheap now and won't strain the grant too much.

@czurnieden czurnieden force-pushed the faster_n_root branch 2 times, most recently from 66024f8 to 09bcb79 Compare May 27, 2019 16:53
@minad
Copy link
Member

minad commented May 27, 2019

That would make it completely useless. No normal user starts to hack LTM to get FFT in.

I mean you expose both mp_mul and mp_mul_fft.

What does "clean" mean? Do you have an objective criterium?
"safe"? For doing what? Or: against what? Or: for whom?

Sure this are subjective criterions. But for safety there are a well defined criteria, e.g. type safety, which prevents the user of misusing functions. This helps correctness. Unfortunately C is not very good at that, but we can do better as for example in #258. Clean is maybe not a good word, think coherent etc. For example having separate functions for doing different things. ioctl is a very bad example and mp_expt_ex with this fast parameter is bad. Furthermore I added these two complement functions mp_tc_and, while there still was mp_and. This is not good. What we have now is better. But I cannot define these things, I can only give you examples.

And what "obvious" is depends highly on the point of view, sadly. I'm a member over at the Stackoverflow/Stackexchange community and I've seen things… sigh

Obvious in the sense of types. I think mp_err is much more helpful than int. And it is better for static analysis.

"Correctness" is something that can be proven (to some extent). Some of the algorithms used in LTM are, some are not. Do you plan to fill in the blanks? (Restrictions of decidability holds, of course)

I would love to have things proven. There are different levels of verification. You are probably talking about proving that certain algorithms work theoretically, but there is also the possibility to prove the code top to bottom. But then program extraction would be the better solution instead of hand written code. Basically what you did by hand from pari/gp was program extraction.

But I don't mean those things. This is too much effort and completely out of scope. We should aim for correctness by having good test coverage. Randomised tests are good for that but surely not perfect.

"good testing"? What does "good" mean for you in this context? What must be tested and what can be dropped? We cannot test everything, that is for sure. We can only test all edge-cases and the first "all" in this sentence is quite bold even.

Edge cases + randomised tests are pretty good imho. I would also like to have tested the special code paths separately, as you added in #280.

There should be no need to try several hundred thousand tests against another bigint implementation assuming that the other bigint implementation is correct and that several hundred thousand tests will give any insight into the correctness of the innards of LTM. We are only wasting Travis' computing time with this spray&pray.

I disagree.

We don't need "more tests", we just need the amount of tests that will cover all edge-cases plus and minus one. Not more and definitely not less.

Edge cases are important, we agree. But randomised tests are actually good if you roughly verify the structure of the algorithm at the same time.

But there is also a bit of pragmatism allowed, e.g.: using Valgrind to test the use of malloc and friends.

Valgrind and sanitizers are very important since they also test for out of bound access et.

For example the Toom-Cook functions. They have a minimum size for the input that does not get tested, not even in the code. Will TCsquare-3-way fail with an input of 2 mp_digits? Depending on where the type in the algorithm is, it can take several tries to make it fail if the input is small. Did you find the typo in the last sentence? ;-)

Why is it not possible to test TC separately for a wide range of numbers?

Yes, it is possible. But easy? Really?

Hehe, I find so and I am actually using this for embedding. It is a quite manual process however, I mostly have to hand select the functions I want and use tommaths dependency graph.

It is either "ignoring" (current status) or "roll our own" which is a lot of work, even if you don't aim at making it shiny at all, so nobody wants to do it, especially if the gain for the author(s) is so minimal.

You mean configuring a stripped down version? I am doing roll your own, and according to my experience it is quite easy.

BTW: can a stripped down LTM still be tested with ./test?

Potentially yes, if we would add the ifdefs or MP_HAS from #262 to the test suite. I my case I have a separate test suite however which tests my stuff and the selected ltm functions on a higher level.

What do you mean by "fast"? It is fast enough for me with the exceptions I listed and wrote code for to make LTM fit for my purposes.

Could be, I have no idea. But in GMP there are some asm routines, we are not beating those. fastmath was made specifically since ltm is not fast.

"competitive"? Why should we compete with GMP on speed?

Maybe not. Probably not. I just named criteria that people commonly use. I am using ltm despite being probably slower.

Does GMP run on a 16-bit machine?

I don't know, but this is certainly a good selling point for ltm. Another one I've read is that ltm does fine grained error checking. For example GMP exits in case of malloc failure.

Than why are you refurbishing it? ;-)

Because I needed additional functionality and then I realised that I am not perfectly happy with the current status. But things got better, I am much happier now ;)

One of the (many) reasons LTM is designed that way is readability because it was also meant to have some sort of pedagogical utility. I don't know what came first, the book or LTM but they belong together.

100% agree. The pedagogical utility is great. But users who just use the library don't care about that I guess. So there are different groups of users who want and need different things.

What are "medium sized values"? General use is exactly that: general, so nothing is more important than others or you are back to some kind of specialization.

This is fuzzy. I am not writing precisely here but why should I? Our discussion is also quite lose about various things :)

I don't think [fread/fwrite]there is an issue here?
(probably a misunderstanding, I mean number conversion)
Once you come to larger numbers and have many of them it gets significant. LTM's number conversion is abysmally slow and that double so because, well, mp_int2otherbases happens twice.

Ah ok. You are right.

Nice to have but this is already something for special use.
So, no primesieve?
Factorizing is a by-product once you have a fast sieve and adding a second level algorithm like e.g.: Pollard-Rho make it fast enough for numbers with a second largest factor of up to about 80 bits. Sounds not much but is quite useful and adds just 3k (stripped) to the lib.
And a number theoretical lib without factorization?

I mentioned factorizing and primesieve together for that reason. I think they should be there, but I am not sure how many users need it. Has someone asked for this? But if you have a use case for a project of yours, it is already a perfectly valid reason to include it here. And maybe users will come, so I am not against this.

Several things. It was a blueprint for a Javascript bigint lib; it is one of the voters together with GMP and Landon C. Noll's calc to decide if a result maybe wrong; the base for rational fixed-point math (was a bit faster than Hauser's softfloat on that special FPU-less hardware and not many special functions were needed in the first place, so it was quickly written); a base for simple cryptography (here the license was the main reason); the bigint base for my own little language; CPU thermic test: computing and printing and rereading 10^6! in a loop on every (isolated) core to get a pretty picture with the Flir;and whatever I have forgotten, too ;-).
Oh, nearly missed: some of the things above generated income.

Cool many things :) So it is basically your go-to big number library? Then I guess you already have a good picture what you want inside the lib. I am mostly using it for language bigints. Maybe it will also get inside GHC.

Most likely not. Experimental math aims for correctness, not the highest speed. And in this day and age you just throw more metal at it—it's cheap now and won't strain the grant too much.

It highly depends on the experiment you are doing. If the computation is slow and scales with the size of your experiment, then you probably go with a fast lib. For example if you want to scan large ranges of numbers. E.g. like these seti@home style experiments.

@czurnieden
Copy link
Contributor Author

More ranced rants</del incoherent ramblingsbrainstorming:

I mean you expose both mp_mul and mp_mul_fft.

The smoothing needs some, maybe even all of the Toom-Cook functions, so it might get a bit more complicated with duplicated code and all. But I haven't even started yet, my original code for smoothing is quite some tangled mess and needs a good thorough clean-up first (I also have a hook in the balancing, which shouldn't be there).

For a bit more information about the concept see e.g.: Brent et al. "Faster Multiplication in GF(2)". Yes, it is a simple concept, but the devil is, as always, in the details.

Obvious in the sense of types. I think mp_err is much more helpful than int

Ah, I see. Good.

I would love to have things proven. There are different levels of verification. You are probably talking about proving that certain algorithms work theoretically, but there is also the possibility to prove the code top to bottom.

Mainly the first, but there are no new algorithms in LTM[1] so there is literature to point to.

Proving standard C is not possible, it would need a subset of C.
For more see e.g.: Eschertech's C/C++ Verification Blog (They are using Misra-C if I'm not mistaken). Or a less proprietary approach with CompCert (Inria).

I think you have something like Cadiz in mind? (ISO/IEC 13568 and its corrigenda ) Or the Latex Z-tool. Not to forget good old Spivey
Yes, only Z-notation here but there are others, of course.

But you are right, of course, that would be way out of scope.
But if we did it we would get really famous and might even get mentioned at ./ ! ;-)

Randomised tests are good for that but surely not perfect.

What do these random tests find what the "edge-cases plus one and minus one" (meaning one test for the edge case that must succeed, one beyond that must fail and one inside that also must succeed ) does not? Yes, there might be one typo in the software causing it to fail (that is giving wrong output) for one or more singular inputs. With the very large domains of an arbitrary precision function (infinite in theory, quite finite in praxi) the chance to find that input with a handful of random tests is very low.

It silences the conscience, yes, but it is a deceitful silence in my opinion.

If the domain is small enough it should be tested completely. I did that for my version of mp_radix_size where I tested all possible input for 8, 16, and 32 bit. Problem: you can't do that in Travis, at least not regularly.

We are only wasting Travis' computing time with this spray&pray.

I disagree.

Why? I'm not against testing against a different bigint implementation, far from it, I even use it myself although for single results only and against two others, not just one. But we can't do that in Travis, because Noll's calc is not in Ubuntu. No, wait, I take that back, it is, the packet is called apcalc-common! Both pari/gp and calc take input from stdin, so it might be an idea.

I'm only against the spray&pray method. We need to sit down and evaluate which tests are actually needed and if you want to do some random tests on top, ok, but just running 333,333 tests (A special reason for this number?) is wasteful.

Why is it not possible to test TC separately for a wide range of numbers?

It is, of course, it was just an example for untested edge-cases where it is debatable if they really need to be tested because they don't get used in the first place but it would be possible to press them to accept that input. An edge-case so to say ;-)

But in GMP there are some asm routines, we are not beating those.

They have the advantage of much less architectures they need to run on.
Which is what I wanted to say, but I'm wrong:

ls -d /home/czurnieden/GMP/gmp-6.1.2/mpn/*/
/home/czurnieden/GMP/gmp-6.1.2/mpn/alpha/
/home/czurnieden/GMP/gmp-6.1.2/mpn/arm/
/home/czurnieden/GMP/gmp-6.1.2/mpn/arm64/
/home/czurnieden/GMP/gmp-6.1.2/mpn/cray/
/home/czurnieden/GMP/gmp-6.1.2/mpn/generic/
/home/czurnieden/GMP/gmp-6.1.2/mpn/ia64/
/home/czurnieden/GMP/gmp-6.1.2/mpn/lisp/
/home/czurnieden/GMP/gmp-6.1.2/mpn/m68k/
/home/czurnieden/GMP/gmp-6.1.2/mpn/m88k/
/home/czurnieden/GMP/gmp-6.1.2/mpn/minithres/
/home/czurnieden/GMP/gmp-6.1.2/mpn/mips32/
/home/czurnieden/GMP/gmp-6.1.2/mpn/mips64/
/home/czurnieden/GMP/gmp-6.1.2/mpn/pa32/
/home/czurnieden/GMP/gmp-6.1.2/mpn/pa64/
/home/czurnieden/GMP/gmp-6.1.2/mpn/power/
/home/czurnieden/GMP/gmp-6.1.2/mpn/powerpc32/
/home/czurnieden/GMP/gmp-6.1.2/mpn/powerpc64/
/home/czurnieden/GMP/gmp-6.1.2/mpn/s390_32/
/home/czurnieden/GMP/gmp-6.1.2/mpn/s390_64/
/home/czurnieden/GMP/gmp-6.1.2/mpn/sh/
/home/czurnieden/GMP/gmp-6.1.2/mpn/sparc32/
/home/czurnieden/GMP/gmp-6.1.2/mpn/sparc64/
/home/czurnieden/GMP/gmp-6.1.2/mpn/thumb/
/home/czurnieden/GMP/gmp-6.1.2/mpn/vax/
/home/czurnieden/GMP/gmp-6.1.2/mpn/x86/
/home/czurnieden/GMP/gmp-6.1.2/mpn/x86_64/

That's quite a lot.
But no 16-bit! ;-)
There's not much in it for the more obscure architectures, sometimes only a list of architecture specific parameters. The most assembler code has x86_64.

But if you look at their Toom-Cook thresholds…no these are the bare numbers, they lack a multiplier and that multiplier is the x in Toom-Cook x-way. So their TC3-mul-threshold is for x86_64 81*3=243 which is what we have (they use the full 64 bit of the limbs) and really go to town with their optimizing.
So we are not that slow, are we? ;-)

Plain C is probably the language with the highest number of supported architectures and that is also a large advantage. For the price of some higher runtime, admitted.

Hehe, I find so and I am actually using this for embedding.

It gets easier if you do it regularly but it is quite overwhelming for a first time user.

You mean configuring a stripped down version?

No, make a configurer where the user can tick of some boxes of what they need and the script automatically calculates all dependencies and generates all files, such that the only thing the user has to do at the end is to type make && make install.

It doesn't have to be some shiny QT GUI, a simple ncurses frontend is more than sufficient, just make menuconfig and tick boxes.

A large part of the logic is already implemented, we can (ab)use the tommath_class.h generator, we can generate the correct makefiles, and if test,.c gets its ifdefs as you suggested we can also use it as a test for the reduced libtommath.

BTW: can a stripped down LTM still be tested with ./test?

Potentially yes, if we would add the ifdefs or MP_HAS from #262 to the test suite.

That was one of the more serious questions because we have no control over what get's stripped by the users. test.c would get quite cluttered with preprocessor brackets but I think it should be done at some time in the near future. For example when the fate of #262 is fixed.

Another one I've read is that ltm does fine grained error checking

It would be nice if the error gets an address, that is which function deep down failed first.The input is not always reproducible such that a simple bt in gdb will give you the name of the culprit.

Now that you have introduced mp_err we could do that more easily. An int has at least 16 bits (if standard compliant), a single nibble is sufficient for the number of different errors needed, leaves at least 2^12 = 4096 for numbering the functions and we have 122 (including static ones). And it can be automated. The biggest drawback would be the size of the table holding the translation from the hashes to the real names, over 3,300 characters (including the mp_ prefix but that would only save 366 characters) for the names alone. Or go the MS/IBM way and just return a number and the user has to ask Google what it means.

So there are different groups of users who want and need different things.

The source of featuritis, it's easy to fall for it, I have some experience to that regard ;-)

I mentioned factorizing and primesieve together for that reason. I think they should be there, but I am not sure how many users need it.

At least one: me ;-)

Has someone asked for this?

Not that I know, no.

[sieve]

The sieve adds about 3.8k (stripped) to the lib, 7k with Pollard-Rho minus the 2.7k (stripped) for the prime-table that isn't needed anymore. I hoped that it would speed up prime generation but it was neutral at best, slightly slower at worst.
A fast prime-sieve is highly useful for a lot of things and if I look at the numbers it doesn't look that bad. If we ignore Pollard-Rho (which also needs some glue and some trial-division to get factorization) we could get a prime-sieve for the cost of about 1k (stripped) extra size, a small increase in runtime for prime generation but also an API change (that list got smaller now, after all of the deprecations, need to take a closer look if there are any left at all)
I think it is acceptable, what is your opinion?
(I haven't rebased bn_sieve for some time, uh, oh.)

So it is basically your go-to big number library?

No, I use all three. They all have different pros and cons, I decide per use-case.
Or mood of the day ;-)

E.g. like these seti@home style experiments

Wouldn't call these experimental math ;-)
That seti@home was once one of my pet-peeves, because there were so many useful distributed projects (https://en.wikipedia.org/wiki/List_of_distributed_computing_projects [2]) that looking for aliens was quite a waste of computing power.
And then came bitcoin and friends. *sigh*
(Yes, I know, it's mostly ASICS these days, but still)

So, enough of my rantings for tonight ;-)

[1] My nth-root one is, in theory, but it is just a port of a known technique for floats to integers. There might be a paper in it but that would need a complete number theoretical analysis. *pooh *

[2] seti@home started in 1999, folding@home in 2000

@minad
Copy link
Member

minad commented May 28, 2019

Randomised tests are good for that but surely not perfect.
It silences the conscience, yes, but it is a deceitful silence in my opinion.

I think it gives a you a good assurance of correctness. It is called quick check testing. I am pretty sure there are papers which measure the effectiveness of quickcheck testing in the presence of a rough verification of the structure of the algorithm. This will capture more cases than for example only testing edge cases, where we could just return 0 on all other inputs ;)

Not a measurement, but a real world experiment, where they formally verified or performed haskell->coq translation https://arxiv.org/abs/1803.06960. However they found zero bugs since the lib has already been quick checked.

If the domain is small enough it should be tested completely. I did that for my version of mp_radix_size where I tested all possible input for 8, 16, and 32 bit. Problem: you can't do that in Travis, at least not regularly.

I fully agree. For small inputs, exhaustive. For large inputs randomised. There are also fuzzers which take the randomised approach btw and find weird code paths using a guided randomised method. This way they can somehow make the huge search tractable.

Why? I'm not against testing against a different bigint implementation, far from it, I even use it myself although for single results only and against two others, not just one. But we can't do that in Travis, because Noll's calc is not in Ubuntu. No, wait, I take that back, it is, the packet is called apcalc-common! Both pari/gp and calc take input from stdin, so it might be an idea.
I'm only against the spray&pray method. We need to sit down and evaluate which tests are actually needed and if you want to do some random tests on top, ok, but just running 333,333 tests (A special reason for this number?) is wasteful.

Yes, but we also take what we have. If you want to spend time on it, feel free :)

We could however consider restricting the test-vs-mtest travis jobs to the develop branch as I did for valgrind? Would you agree with that?

It is, of course, it was just an example for untested edge-cases where it is debatable if they really need to be tested because they don't get used in the first place but it would be possible to press them to accept that input. An edge-case so to say ;-)

Yes, I would like such edgy cases to be tested even if they are not used :)

They have the advantage of much less architectures they need to run on.
Which is what I wanted to say, but I'm wrong:

:D

Plain C is probably the language with the highest number of supported architectures and that is also a large advantage. For the price of some higher runtime, admitted.

I think only using C, one can go quite fast. I think you can achieve factor 2x or factor 1.5x with plain C if the code is written properly. It could however be that there are routines where SIMD is used to great effect. In that case one would have to write the C code in a specific way and maybe the autovectorizer would not get it.

The reason why I think they are using asm is also that they want to use the full limbs and have to catch overflow. And __builtin_add_overflow is not standard C. Could that be? Hmm, well you can also catch overflow in standard C but I think it won't be fast enough.

You mean configuring a stripped down version?
No, make a configurer where the user can tick of some boxes of what they need and the script automatically calculates all dependencies and generates all files, such that the only thing the user has to do at the end is to type make && make install.

Yes, this is what I meant ;) We could add sth like this and it wouldn't be much work. I basically need something like this - I only did the selection manually for now. Maybe such an autoconfigurer would be useful for more people.

That was one of the more serious questions because we have no control over what get's stripped by the users. test.c would get quite cluttered with preprocessor brackets but I think it should be done at some time in the near future. For example when the fate of #262 is fixed.

If we get MP_HAS we could also replace the T macro in the testsuite such that it checks for the corresponding BN_MP_X_C macro and then we would not clutter anything. However a set of basic functions would be needed always (mp_cmp etc).

Another one I've read is that ltm does fine grained error checking
It would be nice if the error gets an address, that is which function deep down failed first.The input is not always reproducible such that a simple bt in gdb will give you the name of the culprit.

No, I don't want that. The error checking is good enough now. And I write my code such that no failures occur. It is only nice to have the possibility to do full error checking without hard failure. I don't need it as a debugging instrument. For this I have other better tools.

The sieve adds about 3.8k (stripped) to the lib, 7k with Pollard-Rho minus the 2.7k (stripped) for the prime-table that isn't needed anymore. I hoped that it would speed up prime generation but it was neutral at best, slightly slower at worst.
A fast prime-sieve is highly useful for a lot of things and if I look at the numbers it doesn't look that bad. If we ignore Pollard-Rho (which also needs some glue and some trial-division to get factorization) we could get a prime-sieve for the cost of about 1k (stripped) extra size, a small increase in runtime for prime generation but also an API change (that list got smaller now, after all of the deprecations, need to take a closer look if there are any left at all)
I think it is acceptable, what is your opinion?
(I haven't rebased bn_sieve for some time, uh, oh.)

Yes I think it is acceptable.

Wouldn't call these experimental math ;-)
That seti@home was once one of my pet-peeves, because there were so many useful distributed projects (https://en.wikipedia.org/wiki/List_of_distributed_computing_projects [2]) that looking for aliens was quite a waste of computing power.
And then came bitcoin and friends. sigh
(Yes, I know, it's mostly ASICS these days, but still)

I meant the math searches, wasn't there sth about abc?

@czurnieden
Copy link
Contributor Author

(I just commited a rough sketch for a quite primitive configurator in #301)

This will capture more cases than for example only testing edge cases, where we could just return 0 on all other inputs

You should test egde-case plus one inside and one beyond. It can still return wrong results for every other input, so we test some random input. It can still return wrong results for every other input, so we test some more random input. It can still return wrong results for every other input, so we test some more…
Yes, I know, hyperbole, but the question remains: where to stop? What does the literature say? (You are most likely more up to date in this regard than me)

If you want to spend time on it, feel free

Of course I want to spend some time, I just have none.
But that will change.
I hope ;-)

We could however consider restricting the test-vs-mtest travis jobs to the develop branch as I did for valgrind? Would you agree with that?

Yes, that would be better.

Although I don't think that Travis does it without any compensation we should be parsimonious with other people's money.

Yes, I would like such edgy cases to be tested even if they are not used

I will comb through and add them if I find more.

I think only using C, one can go quite fast.

It is in almost all cases fast enough. If you are a bit careful of what you are doing. I never had the need to go to the bare metal for speed in the last 10 or so years, only for saving space.

If the compiler doesn't step in your way. I had no time yet to analyse the compiler output but I found the culprits for the difference in timings between -O2 and -O3: the GCC options -ftree-loop-distribute-patterns, -ftree-loop-vectorize, and for GCC 9.1.0 -fversion-loops-for-strides.

The reason why I think they are using asm is also that they want to use the full limbs and have to catch overflow. And __builtin_add_overflow is not standard C. Could that be?

Yes, but I think the main reasons are speed and, to be honest, it is easier to do all that in assembler where you know for fact what will happen in the CPU (tries to inconspicuously throw a blanket over SPECTRE et al. ;-) ) without checking every compiler that is able to compile GMP which adds to correctness.
BTW: it is possible to ask them ;-)

If we get MP_HAS we could also replace the T macro in the testsuite such that it checks for the corresponding BN_MP_X_C macro and then we would not clutter anything.

Ah, now I know what I have overseen in #301, tommath_class.h needs an update, too.

However a set of basic functions would be needed always (mp_cmp etc).

So make test_standalone et al. needs a bit more work.
Mmh…

And I write my code such that no failures occur.

I know what you mean but the way you worded it sounds a bit…uhm… ;-)

I don't need it as a debugging instrument. For this I have other better tools.

It is not only for you or me, it is for the user who wants to know why something failed. If they go to Stackoverflow it is no problem but if they come here it would be easier if you can ask them for a number which they never have and you can more easily send them home ;-)

But serious: you sometimes need to do remote debugging where there are not necessarily programmers in the transmission path who know what they are talking about. Or they are pretty sure it is your program that is wrong and not their code. If you don't know what I'm talking about: feel lucky.

But I do not need it in LibTomMath itself, I can easily patch it myself if necessary. Especially since you introduced mp_err et al: which makes such things very easy.

[bn_sieve]

The functions that would get their API changed are bn_mp_prime_is_prime.c and bn_s_mp_prime_is_divisible.c where the latter is already deprecated. The change is an additional argument for the pointer to the sieve.
Additionally bn_mp_prime_next_prime.c would need to do it with the sieve instead of the table and I haven't benchmarked it yet.

Should I do it in the bn_sieve branch or use a new one (which would need bn_sieve to be merged already to make things simpler)?

@minad
Copy link
Member

minad commented Feb 20, 2020

Close, see #160 for the reason

@minad minad closed this Feb 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants