-
Notifications
You must be signed in to change notification settings - Fork 203
made set_double arch/mem-layout independent #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I don't like this implementation since it is probably much slower. This was the reason why I started accessing the floating point bits directly. However I would be interested in some benchmarks. Do you have some measurements? If too slow, I would like to keep the current version available and use it on the platforms where the current version works (e.g. I tested it on x86_64, x86, arm and arm64). |
@minad It is too slow for what exactly? |
@czurnieden For my use case. But at first I would like to know if it is slower and by how much. |
By too slow I mean the slowdown compared to the current version. It is just a matter of what you willing to accept. Factor 2 or 10? But if the slowdown is only marginal and the other function is easier, better to maintain, more portable, why not? |
The while-loop does the work, so it is faster for very small input (< 10.0 or so) but the loop might run up to 1024 times. It is, as I mentioned, completely unoptimized. Skips the loop completely but needs memory that can hold It would than be as fast as your version but still independent from architecture and memory-layout (with the exception re. Want me to do it? |
@czurnieden If you have time, it is fun for you to work on it, sure go on! Maybe you have other more interesting things to hack on ;) For me the current version works well. But I am also using tommath only on the aforementioned very mainstream architectures (Therefore for me it would also be acceptable to just restrict compilation to a whitelist of supported platforms). My requirements are just that a new version of the function should produce exactly the same outputs as the current one. Furthermore I would prefer not to have a slowdown of more than about 1.5 if we get other advantages in return. |
Knowing what it does in theory does seldom reflect the behaviour in the real world. So I did a littke bit of benchmarking,
Mine (as it is, still not optimized at all) is about 5 times slower than yours, 4 times without the recursions (which involve very large numbers, hence more loops). Even a bit lower than I would have expected but in the range. |
Allowing Optimizing my homegrown Please bear in mind that the actual runtime for all of the 2,906,007 calls is ~2.067 sec., ~1.888 sec,. and ~1.075 sec. for the Nearly one and a half million calls per second with a single core of an elderly AMD A8-6600K is too slow? Even with the slowest but most portable method you still get over half a million calls per second! |
If you pay too much for every operation you perform, it accumulates quickly. But I agree that this function would hardly become a bottleneck. However I don't see the advantages of replacing the current version with a more portable version. The current function makes assumptions about the float layout, however the function is also simpler than what you are proposing. |
I would not add a long double version since I've not seen much long double usage anywhere. The float versions might be good for platforms which don't support double. |
But concerning the performance issue. I prefer to keep the current version since I have yet to see where it causes real problems. The current version is the reasonable implementation for every modern platform. |
current version:
version in this PR:
|
LTM is not restricted to what Debian supports; it needs to support a whole plethora of platforms.
Yepp. Did you patch it?
The reason for this issue was an unexpected, well, forgotten behaviour of the M68K. The patch I offered was denied by you because you decided that my patch, using an algorithm that is not only commonly used in the industry (e.g.: in Boost) but also provable correct for our domain, was not good enough for you.
LTM needs to support more than that.
Have you patched that?
I wouldn't call a nearly bare-metall operation simple. Loops on the other side get taught to beginners.
Fast in what relation? A turtle is also fast in relation to a common garden snail as a MIG is fast in relation to a common jetliner.
ARM can switch between big and little endian. Does it work with both? LTM can run on an S/390 and its relatives e.g.: IBM System z, pSeries, iSeries. Does your version run on S/390 or any of its relatives, too? Did you check both, AMD-64 and Intel-64? LTM is small enough to run on most of the bigger MCUs. Most of them don't have a FPU but some have partial support (at least that's what Digikey told me within the allotted 5 minutes of research). What about softfloat support? You were lucky that Qemu's M86K simulation took Hauser's program which allows for gutting. I do not know if it would work with an actual M68881/M68882 or any other softfloat implementation/compiler. LTM need to work with a lot of compilers, not all are standard-compliant (As far as I know: the only one with 100% c11 cover is Intel's compiler), some don't even come close, so LTM needs to rely on as little standard compliance as reasonable possible and check for the rest.
No, it doesn't assume it, it tests for it. You assume it without testing it. FLT_RADIX isn't necessarily 2 (IBM 360, for values see e.g.: www.netlib.org/fp/dtoa.c ) and the restriction to 2 is only temporary, I'm already working at it. It's FLT_RADIX=16 there, a power of 2, so it is not that much work but testing on the S/360 is quite a chore.
I wouldn't call loops complex. The bigint is used to avoid stdint.h (some users do not want it and I'm not here to judge). It is easily replaced if you want but it needs a bit more work for long double. And it is not that much faster to be honest.
A user asked for it in #159 and as it was no work to implement it…
And more.
Why have you not tested it?
I would be cautious with "every", you'll never know what is to come, but otherwise: yes, you can say that for modern platforms but LTM is used on old platforms, too. Probably even more so, because modern platforms with a lot of HP under the hood most likely use GMP if they are OK with the licence (I think even Mathematica used GMP), so it is more or less only the license that makes LTM attractive for modern platforms. LTM is lightweight, and can be made even more so if you don't need all functions, it runs on everything and your kitchen sink, and it is simple, everyone can hack it (even I can ;-) ) but there is one thing that it is not: it is not highly optimized for speed. I have a lot in my backlog but it will need more memory so some may find their way into the main version, some into a separate branch - maybe needforspeed or would we get in trouble with EA for that? ;-) It is also the main reason that I used a branch named withfloats and I'm a bit with Tom St Denis here, who did not want floats in LTM in the first place: 'It'd be nice to not have any float code "standard" in LTM.' (in #3 at the end) No, I'm sorry, but LTM has to serve a lot of different users with different demands. Changing something is a complicated process and not always welcome by everyone. At least not in the beginning. |
Floating point functions are optional. This was what was agreed upon when the float PR was merged.
I proposed to enable the float functions via whitelisting, which would solve that.
Hmm. I don't see why we need an epsilon test. On compliant platforms the result is exact.
The current version is shorter and it performs simpler low level operations. Shifting is simpler and faster than multiplication and addition.
Fast in relation to the other version. As I said before, if you accept a slowdown of factor two everywhere, your system will be half as fast as before.
Yes, it will work if you recompile. I don't know any system which performs the switching at runtime. We all have better things to do than to take care of such things.
What is Intel-64? Itanium? I suspect it to run on Itanium too, but the platform is irrelevant as of now.
Ok, I did not see that someone asked for long double support. I would probably still keep it out of the library. @sjaeckel What do you think? |
That I'm currently on mobile and only back at home on Sunday :-) |
@czurnieden Sorry, it seems I edited your comment while answering. I tried to revert to the old version manually, since the horrible interface does not support reverting :( |
@czurnieden Thanks for the changes! I am more happy now :) I would still drop long double support since I've only seen it used rarely. |
...Mac ipreprocessor
...assumes IEEE754
LTM needs to support more than that.
there are issues with detection of supported platform in the preprocessor
[etc.]
...but there is more.
You all are *assuming* the platform that is running LTM is the one that created the double. That is NOT always true. One example: I have had to manipulate floats received in binary files. The sender being IEEE754 compliant or not was known to me but not my compiler.
To condition the compile based solely on the platform being IEEE754 is incorrect. There are no FP operations performed so it matters not if the FPU is IEEE754 compliant or not. Nothing will blow up!
Any conditioning should be based of the state of the data being in the correct format. Unfortunately you cannot know that until runtime and even then you cannot really know. Nor can the get/set functions really know the intended format except by arguments like FLT_RADIX and DBL_MAX_EXP to the functions.
Please, mp_get_double() and mp_set_double() should ALWAYS be compiled (not conditioned), and the functions' signatures should be extended to include the necessary arguments. (The existing signatures could be transformed to functions that set supported default values before calling the extended signature versions.)
|
@buggywhip Yes, we assume that and it is the right thing to do. You are doing something wrong when you store invalid doubles received from an untrusted source inside a C variable of type However what you might want is a function |
There's always more and always will be, no matter what. ;-)
There are none if the FPU is compliant but there are if it is not. There is also the problem of rounding lingering we didn't even touch yet but which can only be handled with a reasonable amount of work if the machine that runs LTM is IEEE-compliant in that regard, or at least offers some method to read the rounding mode. But that's not what you meant, I think?
The conditioning that exist now is just a filter to restrict the input to something the function knows how to handle. That makes the function short and simple. For larger values of "short" and "simple" ;-) To be able to handle input with the least amount of restriction we need some more information. Let the input be a floating point number of the form The other function that does it in reverse needs the same information, of course. So a function that can do it would look like: That would be general enough to cover most of the cases (Just because I couldn't come up with something that it doesn't cover doesn't mean there isn't anything lurking in the dark waiting for the moment to spread utter despair and hopelessness all over the quivering mass of innocent little programmers—but I digress) and the user can easily roll their own little wrappers to do their deeds. At least that's what I think you want from reading:
Not that I wouldn't like the idea, I do, but: who wants to write it? Any volunteers? Yes, I can already hear you saying that you don't need such a broad function but a large general function with a handful of small wrappers is almost always easier to maintain than a handful of medium, closely related functions; and the number "handful" will grow in size, no matter what. I think, in contrast to my colleague, that it still would be a "nice to have", but just that: a "nice to have" with the "nice to have" priority that comes along with it. So: no reason to give up all hope but it might take a while and might not even find its way into LTM at the end.
Well, it was just C&P and a very small bit of adjustment.
LTM, including the floating point operations, runs well on the M68K where the FPUs have that little quirk but are otherwise compliant enough to be included. And it is only for the test and only for the M68K which is still in use (it's a tough little cpu that takes a knock or two and is still available ) and it checks if The Oh, and I put the recursions Ah…what else was there… the current S/390 OS is z/OS 1.1.0 and the compiler there has a switch to get IEEE-754 compliant floats (but only in Unix-mode?). The code will utter a I'm currently setting up a 16-bit test environment (only Freedos with qemu) with DJGPP, BCC, Mars, and OpenWatcom for some full 16-bit tests. This may take while ;-) Yes, I am aware of the limitations of that project but its better than nothing and, best of all: fun to do! That was a long one, sorry for that. |
This is similar to the function
You are not the one being lazy here. You are doing the stuff and I am coming and shooting it down ;)
Concerning the epsilon test I am ok to use it for M68K but not x86_64 and processors where the result is precise. Btw, you didn't write about |
Wasn't written in less than 5 minutes, so: yes, I want to keep it ;-) The CPUs supporting it still have quite a large distribution and it makes the thing more complete. I think I add a little measure to check if So, no quad-precision but I will add a line or two to include more architectures in
It probably is not only the M68K. See for example the GCC manual for -mfpmath. But that needs some more testing. I don't think I have actual i368 hardware with an FPU, qemu must do it for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve with minor changes
Successfully tested this branch with an architecture that has Guest:
Host:
(oldstable only, couldn't get Debian 9.8.0 to boot in qemu-system-arm, was too impatient) |
Works with SPARC64
|
Cool! Do you have a sparc or is this a qemu instance? You mentioned qemu a few times before, so I assume that. May I ask you a favor - could you somehow share these qemu images you produced for the different architectures? I know I could produce them myself and I also did that a few times but it still requires a lot of effort imho. This would allow me to also run the test on those - in particular sparc and sparc64 are interesting because of big endian etc. I think there is a website where some guy offers many images for different architectures but they were very outdated. |
Right now I have access to x86, x86_64, arm, arm64 hardware but not more (probably also some mips somwhere, but I don't know). And none of those is genuinely big endian :( It seems also that debian dropped architectures and there are not many big endians left? |
Yeah, why do I even ask *sigh* ;-) OK, started it: home page with the first paragraph from https://www.libtom.net/LibTomMath/ |
Started https://github.com/libtom/libtommath/wiki/Successes-and-Failures-of-Building-LibTomMath A small caveat: Github's markup does not support code-blocks inside (nested?) lists. Either find a way to use Markup to do so or use HTML as I did. You also cannot use Markup inside HTML lists as it seems, you need to do e.g.: HTML links. |
Learned that Travis has quite a long timeout for a single job: 50 minutes so I added the ability to run Valgrind to |
|
||
cast.ldbl = b; | ||
|
||
exp = (int)cast.ldbl_guts.exponent; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid using bitfields here? I would use shift and masking to access bits since bitfields are not well specified. I use bitfields only for packing if the layout does not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid using bitfields here?
Yes, we can (I think). But I found bit-fields simpler.
bitfields are not well specified
They are quite well defined in the standard (vid. e.g.: 6.7.2.1 in INCITS/ISO/IEC 9899-2011[2012] for a start), but this one uses "or some other implementation-defined type[s]" so it is not "standard" in the most narrow sense, admitted but long double
is not "fully standard" either.
That's why I have it restricted to the two compilers GCC and clang. The MingW port of GCC already needed some arm-twisting to make it complicit. MSVC can also be persuaded, at least that's what they told me, but I don't have one to test it and it is also moot because MSVC does not support long double
as far as I remember.
And it is for long double
, a type in widespread use but restricted to the x87 architecture. The number of different hardware implementations is: two. I think this code handles both of them sufficiently.
Yes, the world would be a happier place if long double
goes extinct and gets replaced with a proper binary128. Or binary96 if you insist. But it won't happen, at least not in the near future. Maybe together with 128-bit registers?
@czurnieden I am going to close your outdated PRs for now, since they would need a lot of rebasing and discussion somehow died off without a merge a while ago. I am also at least partially responsible for that due to my refactorings etc. For now it would probably be best if we concentrate on fixing a few correctness issues, clean up the remaining API issues (#465 etc). But since you still seem to have a lot in your pipeline, it would be nice to have some kind of overview of what you have and what would be nice to have upstreamed here. But if possible, the PRs should be small and easy to review such that the time frame for a merge is shorter and the PRs don't start to bit rot again. |
It is a bit slower (although completely unoptimized yet) but independent of anything but
float.h
and the macrosDBL_EXP_MAX
andFLT_RADIX
therein. The same code can be used forfloat
with minimal changes (just changeDBL_EXP_MAX
toFLT_EXP_MAX
). The same could be said forlong double
but that needs some more tests because of the slightly different format oflong double
and not everyone has it.It is now independent from the type of the architecture and its memory-layout. The values of DBL_EXP_MAX
and
FLT_RADIXcould be hardcoded to avoid the inclusion of
float.hbut the existence of
float.h` is a good sign for the existence for the availability of at least some floating point functions.