Test, or block, big-endian CPUs #2362

ChrisJefferson · 2018-04-14T17:02:34Z

#2292 shows we have a bug generating random numbers on big-endian CPUs.

I'm going to suggest a major change. Unless someone is willing to run regular tests, including package tests, then I suggest we print a big warning, or maybe even refuse to build, on big-endian machines. At the moment we can't have confidence we are producing correct answers, which is worrying for software like GAP.

fingolfin · 2018-07-04T16:22:00Z

@dimpase I wonder if you have access to a big endian issue, given that you report #2292 ? And if so, whether it would be possible to run regular automated tests on it, and / or give one of us SSH access to it for limited testing?

dimpase · 2018-07-04T16:36:28Z

I'll be happy to ask for access to the SPARC Solaris system for you; else I can run automatic tests if to set them up is not too hard (i.e. if a setup is more or less ready, I don't want to mess up writing scripts for this...)

PS. The idea of blocking big-endian doesn't sound right to me...

fingolfin · 2018-07-04T16:38:21Z

@dimpase if you can get me SSH access, I can probably fix #2292

dimpase · 2018-07-04T16:46:33Z

@fingolfin just emailed the admin of SPARC at Warwick, cc'd to you.

markuspf · 2018-07-06T09:45:07Z

Out of interest: What's the motivation to use SPARCs?

dimpase · 2018-07-06T09:54:29Z

@markuspf - people are interested in running Sagemath on a SPARC (SPARCs are still being made, you know...) And this is one of few big-endian platforms currently used, so this gives an added value (as well as the main problem to solve) to this port.

markuspf · 2018-07-06T10:06:18Z

@dimpase that doesn't really answer my question; what's "those people"s motivation to use SPARCs?

This is not about confrontation or dismissal, this is about curiosity.

dimpase · 2018-07-06T10:12:51Z

@markuspf - I understand that they sit (in Warwick, say) with hardware which can potentially be used, say, by students running various things via Jupyter, say... These are quite big machines, with hundreds of CPUs and hundreds of GB of RAM.

jdemeyer · 2018-07-06T10:26:19Z

My impression is that SPARCs are really slow in terms of raw computational speed. They are designed as servers, not for doing math.

fingolfin · 2018-07-06T10:32:41Z

My personal background is that I grew on big endian systems (first Motorola 68k, later PowerPC); I also have a background (from other projects) in writing highly portable software; as such the notion of "just dropping support for big endian" is highly troubling for me, so I'd like to keep support for them in, even if we may not "officially support" them anymore (i.e.: we may not claim they work, and print a message saying something like "use at your own risk). All that said, I thoroughly hope that big endian machines will simply die out in the coming decades. However, right now, new SPARC processors still get developed, so this will not be soon (though one might hope that they switch to using it in LE mode; but that, too, is unlikely for the kind of user's of SPARC systems).

But as @jdemeyer points out, I doubt that most SPARC users will have any interest in using GAP, and vice versa. If you want to do HPC, don't buy SPARC. So, I am not really particularly interested in SPARC support myself, and kind of think it's a waste of time. It can be helpful to debug big endian issues, though, which I am interested in, as explained above. As long as I waste my personal time on it, I don't see how it incurs any cost to GAP. Though of course it would be far better if those people actually interested in SPARC support would put in some coding effort themselves and/or hired somebody to do it...

jdemeyer · 2018-07-06T10:39:19Z

Let me also add that recent PowerPC chips support both little- and big-endian. In Ghent we have a POWER8 machine (running in little-endian mode) and that's a fast machine that I use for real work.

ChrisJefferson · 2019-01-26T13:35:30Z

Update: Is anyone willing to take on official testing of bigendian? If not, I will shortly submit a PR which makes GAP print a warning on bigendian systems on startup saying the results of GAP should not be trusted.

dimpase · 2019-01-26T13:39:54Z

As we have access to a bigendian system, I suppose I should volunteer to set up testing. Is there any guide, preferably short, explaining this?

…

On Sat, 26 Jan 2019 13:35 Christopher Jefferson ***@***.*** wrote: Update: Is anyone willing to take on official testing of bigendian? If not, I will shortly submit a PR which makes GAP print a warning on bigendian systems on startup saying the results of GAP should not be trusted. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2362 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABN8HPnY4F9xfh5T1MiHK1I6cZXy5imYks5vHFmigaJpZM4TVJIV> .

ChrisJefferson · 2019-01-26T13:42:05Z

@alex-konovalov will know best how to run regular testing (I don't know if it will be possible/reasonable to link it into the nightly testing already done).

fingolfin · 2019-07-11T13:48:52Z

It seems our murmur hash is also endian dependant, which I learned not because somebody reported it, but rather because I stumbled by accident over https://src.fedoraproject.org/rpms/gap/c/ac34d6a95cf8d22267f9314fb0d7ef44ac5d65ef?branch=master. Ah well, as it so often happens in the sad history of Linux package distributions: somebody notices an issue downstream (in this case: Fedora) and works around it, but never reports it back to upstream (i.e., to us in this case). Extra sad: apprently, they do have the resources to test GAP on different endian architectures, but that's not helpful for us as long as there is zero communication :-(.

This is super frustrating. I am really tempted to give in and agree with @ChrisJefferson now, throw the towel, explicitly add code to stop GAP from compiling on bigendian, and call it a day.

I am guessing the person behind this is GitHub user @jamesjer (apologies if this is wrong). Perhaps they'd be willing to let us know how come there are all these patches and yet we never hear about them? Are we doing something wrong that prevents it? I'd love to hear.

There are other weird things, like

https://src.fedoraproject.org/rpms/gap/blob/master/f/gap-stat.patch which adds kernel functions ObjInt_LongLong and ObjInt_ULongLong (what's wrong with ObjInt_Int8 and ObjInt_UInt8 ???),
https://src.fedoraproject.org/rpms/gap/blob/master/f/gap-ref.patch where I have no idea what it is about
https://src.fedoraproject.org/rpms/gap/blob/master/f/gap-doc.patch which documents functions that we chose to not document (bad: user's are told that we promise to not change things we explicitly documented; so this is kinda like making a promise in our name without authorisation)
https://src.fedoraproject.org/rpms/gap/blob/master/f/gap-help.patch which seems to be useful for upstream.

Don't get me wrong: I appreciate the work packages invest into packaging GAP for Fedora, Debian, and many other distributions. I just wish it was less of a one-sided street and there was more communication :-(.

jamesjer · 2019-07-12T02:50:51Z

I replied to the bulk of this in private email to Professor Horn. I will let him choose which parts, if any, he wishes to make public. I do have shell access to an s390x machine. I do not have administrator privileges on that machine, so I cannot give anyone else access. If you would like me to run tests, I am happy to do so. Please let me know the nature of the tests that you would like run, and I will schedule them onto my calendar at regular intervals.

dimpase · 2019-07-12T06:38:57Z

they do have the resources to test GAP on different endian architectures, but that's not helpful for us

@alex-konovalov : as you could see, my offer to set up testing on Sparc was not followed through. The Sparc we have access to is still there and running.

ChrisJefferson · 2019-07-12T07:17:18Z

@dimpase : The problem is we don't really have a good guide. We currently use travis for our day-to-day testing, so one could go through all the travis jobs, find the shell scripts they run, and run those on master and stable branches. You can observe the jobs running on travis by looking at:

https://travis-ci.org/gap-system/gap - for main tests
https://travis-ci.org/gap-infra/gap-docker-pkg-tests-master - for packages

I'd be very happy for you to run those tests, check the results regularly, and report back issues which arise just on big-endian machines. That would take a fair chunk of time, and wouldn't be entirely automated, so it's not really a reasonable request. On the other hand I'm not sure anyone else would care about tidying things up to make it easier.

Various devs have been making the tests easier and more automated to run, and that work continues, and at some future point we might get to the point where this is easier. However, there will always be the problem that many packages often fail due to changes in master, and someone has to look at what the problem is, and where it should be reported.

Fundamentally in my mind, the problem isn't just computing power, it's person power. I could probably buy a big-endian machine if I cared, but someone has to keep looking at the tests every week (month at least), track down bugs, do the week long deep dives when something REALLY weird happens, and GAP has significantly many other issues already that I'm not sure spending time keeping big-endian support going is a good investment, unless there is someone who really wants it.

ChrisJefferson · 2019-07-12T07:22:12Z

Oh, one other thing. Before releases I run test install with the configure option --enable-memory-checking. That takes a fairly modern PC about 30 CPU days. I also run in valgrind with --enable-valgrind, which takes about 10 CPU days. They check GAP's internals, and check for various memory issues. I don't know if they are ever likely to fail on a big-endian machine, given they pass on little-endian machines.

fingolfin · 2019-07-14T12:46:05Z

Folks, I'd like to clarify here that my previous comment regarding the package patches in Fedora were badly written; in particular, I did not mean to attack @jamesjer with that, who I think is doing an outstanding job with the GAP packages for Fedora. That said, of course what I intend with an action sadly does not always match how it is perceived on the outside :-/. So my apologies to @jamesjer here, who took the time to carefully explain to me in a private email why he is not always be able to send patches back upstream (in a nutshell, because he's doing a job that really should be taken care of by several people, not one person in their free time). BTW, I just noticed that he even made at least one big endian related PR #782 !

Of course that doesn't solve the bigendian issue, I think @ChrisJefferson explained it quite well. Nothing short of automated testing, the results of which we must be able to see easily, will help with this on the long run.

I still think we should not explicitly sabotage building on big endian, but we might add a banner text which is shown when starting GAP on big endian systems, and which explain that big endian is not fully supported, my give bad results, and that help is wanted.

dimpase · 2019-07-14T13:34:07Z

I am wondering whether it would be feasible to install gitlab's CI system on the Solaris system we have access to, so that tests may be automated. (It might be easier on a bigendian system running Linux, though). A first step would be to create a gitlab config to run CI of GAP on their (hosted) CI. Sage does have some kind of CI settings for gitlab, so we can try to coordinate this.

…

On Sun, 14 Jul 2019 13:46 Max Horn, ***@***.***> wrote: Folks, I'd like to clarify here that my previous comment regarding the package patches in Fedora were badly written; in particular, I did not mean to attack @jamesjer <https://github.com/jamesjer> with that, who I think is doing an outstanding job with the GAP packages for Fedora. That said, of course what I intend with an action sadly does not always match how it is perceived on the outside :-/. So my apologies to @jamesjer <https://github.com/jamesjer> here, who took the time to carefully explain to me in a private email why he is not always be able to send patches back upstream (in a nutshell, because he's doing a job that really should be taken care of by several people, not one person in their free time). BTW, I just noticed that he even made at least one big endian related PR #782 <#782> ! Of course that doesn't solve the bigendian issue, I think @ChrisJefferson <https://github.com/ChrisJefferson> explained it quite well. Nothing short of automated testing, the results of which we must be able to see easily, will help with this on the long run. I still think we should not explicitly sabotage building on big endian, but we might add a banner text which is shown when starting GAP on big endian systems, and which explain that big endian is not fully supported, my give bad results, and that help is wanted. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2362?email_source=notifications&email_token=AAJXYHHIQV73OER6YCY2Z6DP7MNY5A5CNFSM4E2USIK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ4ET5I#issuecomment-511199733>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJXYHELK53BWAYTT7AAQ2LP7MNY5ANCNFSM4E2USIKQ> .

olexandr-konovalov · 2019-07-17T17:46:13Z

@dimpase @ChrisJefferson and all: a collection of badges to regular Travis CI tests is in this README: https://github.com/gap-system/gap-distribution/blob/master/README.md - it has more than the two mentioned above.

fingolfin · 2019-11-15T09:37:37Z

Travis just added support for multiple new architectures, among them IBM s390x (read this blog post for details, which is big endian (under Linux at least). That means we now have a chance to CI test big endian systems; I started work on PR #3744. There are some diffs for HASH_FUNC_FOR_PPERM which are not so surprising (and we can now argue whether it is worth fixing them, or whether it is OK that hashes work differently on different archs. Indeed, many other languages actually intentionally use different seeds for their hashes each run, to protect against hash/dictionary poisoning; of course we are not really concerned about this, and there is some benefit in having results reproducible across systems... Alas, we already accept the difference between 32 and 64 bit hashes, so...

ChrisJefferson · 2019-11-15T10:29:53Z

Personally, I'd actually prefer shuffling the hashes on each run, or trying to make them fixed on everything. I would have a small preference for shuffling.

On new software I now try to either say orders are always fixed and we promise that, or they are always shuffled, as anything else (as I'm sure many people have experienced) is a source of subtle bugs.

dimpase · 2020-04-19T00:03:18Z

On Sat, Jan 26, 2019 at 9:39 PM Dima Pasechnik ***@***.***> wrote: As we have access to a bigendian system, I suppose I should volunteer to set up testing. Is there any guide, preferably short, explaining this?

Unfortunately our access to Sun is no more. So it's great that Travis may be used instead.

dimpase · 2021-01-21T11:10:17Z

Unfortunately our access to Sun is no more

I spoke too soon. The Sparc is still available, just due to a system upgrade (to Solaris 11.4) the access was broken, and the person responsible forgot to reply to my email from late March 2020 (well, we know, interesting times). Ping me if you want access to it.

dimpase · 2022-09-25T20:28:39Z

The Sparc is still available,

Not any more, gone about a year ago.

fingolfin added the kind: general proposed change label Mar 26, 2019

fingolfin mentioned this issue Nov 14, 2019

Add Travis tests on arm64, ppc64le, s390x #3744

Merged

fingolfin closed this as completed in #3744 Apr 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test, or block, big-endian CPUs #2362

Test, or block, big-endian CPUs #2362

ChrisJefferson commented Apr 14, 2018

fingolfin commented Jul 4, 2018

dimpase commented Jul 4, 2018

fingolfin commented Jul 4, 2018

dimpase commented Jul 4, 2018

markuspf commented Jul 6, 2018

dimpase commented Jul 6, 2018

markuspf commented Jul 6, 2018

dimpase commented Jul 6, 2018

jdemeyer commented Jul 6, 2018

fingolfin commented Jul 6, 2018

jdemeyer commented Jul 6, 2018

ChrisJefferson commented Jan 26, 2019

dimpase commented Jan 26, 2019 via email

ChrisJefferson commented Jan 26, 2019

fingolfin commented Jul 11, 2019 •

edited

Loading

jamesjer commented Jul 12, 2019

dimpase commented Jul 12, 2019

ChrisJefferson commented Jul 12, 2019 •

edited

Loading

ChrisJefferson commented Jul 12, 2019 •

edited

Loading

fingolfin commented Jul 14, 2019

dimpase commented Jul 14, 2019 via email

olexandr-konovalov commented Jul 17, 2019

fingolfin commented Nov 15, 2019

ChrisJefferson commented Nov 15, 2019

dimpase commented Apr 19, 2020 via email

dimpase commented Jan 21, 2021

dimpase commented Sep 25, 2022

Test, or block, big-endian CPUs #2362

Test, or block, big-endian CPUs #2362

Comments

ChrisJefferson commented Apr 14, 2018

fingolfin commented Jul 4, 2018

dimpase commented Jul 4, 2018

fingolfin commented Jul 4, 2018

dimpase commented Jul 4, 2018

markuspf commented Jul 6, 2018

dimpase commented Jul 6, 2018

markuspf commented Jul 6, 2018

dimpase commented Jul 6, 2018

jdemeyer commented Jul 6, 2018

fingolfin commented Jul 6, 2018

jdemeyer commented Jul 6, 2018

ChrisJefferson commented Jan 26, 2019

dimpase commented Jan 26, 2019 via email

ChrisJefferson commented Jan 26, 2019

fingolfin commented Jul 11, 2019 • edited Loading

jamesjer commented Jul 12, 2019

dimpase commented Jul 12, 2019

ChrisJefferson commented Jul 12, 2019 • edited Loading

ChrisJefferson commented Jul 12, 2019 • edited Loading

fingolfin commented Jul 14, 2019

dimpase commented Jul 14, 2019 via email

olexandr-konovalov commented Jul 17, 2019

fingolfin commented Nov 15, 2019

ChrisJefferson commented Nov 15, 2019

dimpase commented Apr 19, 2020 via email

dimpase commented Jan 21, 2021

dimpase commented Sep 25, 2022

fingolfin commented Jul 11, 2019 •

edited

Loading

ChrisJefferson commented Jul 12, 2019 •

edited

Loading

ChrisJefferson commented Jul 12, 2019 •

edited

Loading