Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test, or block, big-endian CPUs #2362

Closed
ChrisJefferson opened this issue Apr 14, 2018 · 27 comments · Fixed by #3744
Closed

Test, or block, big-endian CPUs #2362

ChrisJefferson opened this issue Apr 14, 2018 · 27 comments · Fixed by #3744

Comments

@ChrisJefferson
Copy link
Contributor

#2292 shows we have a bug generating random numbers on big-endian CPUs.

I'm going to suggest a major change. Unless someone is willing to run regular tests, including package tests, then I suggest we print a big warning, or maybe even refuse to build, on big-endian machines. At the moment we can't have confidence we are producing correct answers, which is worrying for software like GAP.

@fingolfin
Copy link
Member

@dimpase I wonder if you have access to a big endian issue, given that you report #2292 ? And if so, whether it would be possible to run regular automated tests on it, and / or give one of us SSH access to it for limited testing?

@dimpase
Copy link
Member

dimpase commented Jul 4, 2018

I'll be happy to ask for access to the SPARC Solaris system for you; else I can run automatic tests if to set them up is not too hard (i.e. if a setup is more or less ready, I don't want to mess up writing scripts for this...)

PS. The idea of blocking big-endian doesn't sound right to me...

@fingolfin
Copy link
Member

@dimpase if you can get me SSH access, I can probably fix #2292

@dimpase
Copy link
Member

dimpase commented Jul 4, 2018

@fingolfin just emailed the admin of SPARC at Warwick, cc'd to you.

@markuspf
Copy link
Member

markuspf commented Jul 6, 2018

Out of interest: What's the motivation to use SPARCs?

@dimpase
Copy link
Member

dimpase commented Jul 6, 2018

@markuspf - people are interested in running Sagemath on a SPARC (SPARCs are still being made, you know...) And this is one of few big-endian platforms currently used, so this gives an added value (as well as the main problem to solve) to this port.

@markuspf
Copy link
Member

markuspf commented Jul 6, 2018

@dimpase that doesn't really answer my question; what's "those people"s motivation to use SPARCs?

This is not about confrontation or dismissal, this is about curiosity.

@dimpase
Copy link
Member

dimpase commented Jul 6, 2018

@markuspf - I understand that they sit (in Warwick, say) with hardware which can potentially be used, say, by students running various things via Jupyter, say... These are quite big machines, with hundreds of CPUs and hundreds of GB of RAM.

@jdemeyer
Copy link

jdemeyer commented Jul 6, 2018

My impression is that SPARCs are really slow in terms of raw computational speed. They are designed as servers, not for doing math.

@fingolfin
Copy link
Member

My personal background is that I grew on big endian systems (first Motorola 68k, later PowerPC); I also have a background (from other projects) in writing highly portable software; as such the notion of "just dropping support for big endian" is highly troubling for me, so I'd like to keep support for them in, even if we may not "officially support" them anymore (i.e.: we may not claim they work, and print a message saying something like "use at your own risk). All that said, I thoroughly hope that big endian machines will simply die out in the coming decades. However, right now, new SPARC processors still get developed, so this will not be soon (though one might hope that they switch to using it in LE mode; but that, too, is unlikely for the kind of user's of SPARC systems).

But as @jdemeyer points out, I doubt that most SPARC users will have any interest in using GAP, and vice versa. If you want to do HPC, don't buy SPARC. So, I am not really particularly interested in SPARC support myself, and kind of think it's a waste of time. It can be helpful to debug big endian issues, though, which I am interested in, as explained above. As long as I waste my personal time on it, I don't see how it incurs any cost to GAP. Though of course it would be far better if those people actually interested in SPARC support would put in some coding effort themselves and/or hired somebody to do it...

@jdemeyer
Copy link

jdemeyer commented Jul 6, 2018

Let me also add that recent PowerPC chips support both little- and big-endian. In Ghent we have a POWER8 machine (running in little-endian mode) and that's a fast machine that I use for real work.

@ChrisJefferson
Copy link
Contributor Author

Update: Is anyone willing to take on official testing of bigendian? If not, I will shortly submit a PR which makes GAP print a warning on bigendian systems on startup saying the results of GAP should not be trusted.

@dimpase
Copy link
Member

dimpase commented Jan 26, 2019 via email

@ChrisJefferson
Copy link
Contributor Author

@alex-konovalov will know best how to run regular testing (I don't know if it will be possible/reasonable to link it into the nightly testing already done).

@fingolfin
Copy link
Member

fingolfin commented Jul 11, 2019

It seems our murmur hash is also endian dependant, which I learned not because somebody reported it, but rather because I stumbled by accident over https://src.fedoraproject.org/rpms/gap/c/ac34d6a95cf8d22267f9314fb0d7ef44ac5d65ef?branch=master. Ah well, as it so often happens in the sad history of Linux package distributions: somebody notices an issue downstream (in this case: Fedora) and works around it, but never reports it back to upstream (i.e., to us in this case). Extra sad: apprently, they do have the resources to test GAP on different endian architectures, but that's not helpful for us as long as there is zero communication :-(.

This is super frustrating. I am really tempted to give in and agree with @ChrisJefferson now, throw the towel, explicitly add code to stop GAP from compiling on bigendian, and call it a day.

I am guessing the person behind this is GitHub user @jamesjer (apologies if this is wrong). Perhaps they'd be willing to let us know how come there are all these patches and yet we never hear about them? Are we doing something wrong that prevents it? I'd love to hear.

There are other weird things, like

Don't get me wrong: I appreciate the work packages invest into packaging GAP for Fedora, Debian, and many other distributions. I just wish it was less of a one-sided street and there was more communication :-(.

@jamesjer
Copy link
Contributor

I replied to the bulk of this in private email to Professor Horn. I will let him choose which parts, if any, he wishes to make public. I do have shell access to an s390x machine. I do not have administrator privileges on that machine, so I cannot give anyone else access. If you would like me to run tests, I am happy to do so. Please let me know the nature of the tests that you would like run, and I will schedule them onto my calendar at regular intervals.

@dimpase
Copy link
Member

dimpase commented Jul 12, 2019

they do have the resources to test GAP on different endian architectures, but that's not helpful for us

@alex-konovalov : as you could see, my offer to set up testing on Sparc was not followed through. The Sparc we have access to is still there and running.

@ChrisJefferson
Copy link
Contributor Author

ChrisJefferson commented Jul 12, 2019

@dimpase : The problem is we don't really have a good guide. We currently use travis for our day-to-day testing, so one could go through all the travis jobs, find the shell scripts they run, and run those on master and stable branches. You can observe the jobs running on travis by looking at:

https://travis-ci.org/gap-system/gap - for main tests
https://travis-ci.org/gap-infra/gap-docker-pkg-tests-master - for packages

I'd be very happy for you to run those tests, check the results regularly, and report back issues which arise just on big-endian machines. That would take a fair chunk of time, and wouldn't be entirely automated, so it's not really a reasonable request. On the other hand I'm not sure anyone else would care about tidying things up to make it easier.

Various devs have been making the tests easier and more automated to run, and that work continues, and at some future point we might get to the point where this is easier. However, there will always be the problem that many packages often fail due to changes in master, and someone has to look at what the problem is, and where it should be reported.

Fundamentally in my mind, the problem isn't just computing power, it's person power. I could probably buy a big-endian machine if I cared, but someone has to keep looking at the tests every week (month at least), track down bugs, do the week long deep dives when something REALLY weird happens, and GAP has significantly many other issues already that I'm not sure spending time keeping big-endian support going is a good investment, unless there is someone who really wants it.

@ChrisJefferson
Copy link
Contributor Author

ChrisJefferson commented Jul 12, 2019

Oh, one other thing. Before releases I run test install with the configure option --enable-memory-checking. That takes a fairly modern PC about 30 CPU days. I also run in valgrind with --enable-valgrind, which takes about 10 CPU days. They check GAP's internals, and check for various memory issues. I don't know if they are ever likely to fail on a big-endian machine, given they pass on little-endian machines.

@fingolfin
Copy link
Member

Folks, I'd like to clarify here that my previous comment regarding the package patches in Fedora were badly written; in particular, I did not mean to attack @jamesjer with that, who I think is doing an outstanding job with the GAP packages for Fedora. That said, of course what I intend with an action sadly does not always match how it is perceived on the outside :-/. So my apologies to @jamesjer here, who took the time to carefully explain to me in a private email why he is not always be able to send patches back upstream (in a nutshell, because he's doing a job that really should be taken care of by several people, not one person in their free time). BTW, I just noticed that he even made at least one big endian related PR #782 !

Of course that doesn't solve the bigendian issue, I think @ChrisJefferson explained it quite well. Nothing short of automated testing, the results of which we must be able to see easily, will help with this on the long run.

I still think we should not explicitly sabotage building on big endian, but we might add a banner text which is shown when starting GAP on big endian systems, and which explain that big endian is not fully supported, my give bad results, and that help is wanted.

@dimpase
Copy link
Member

dimpase commented Jul 14, 2019 via email

@olexandr-konovalov
Copy link
Member

@dimpase @ChrisJefferson and all: a collection of badges to regular Travis CI tests is in this README: https://github.com/gap-system/gap-distribution/blob/master/README.md - it has more than the two mentioned above.

@fingolfin
Copy link
Member

Travis just added support for multiple new architectures, among them IBM s390x (read this blog post for details, which is big endian (under Linux at least). That means we now have a chance to CI test big endian systems; I started work on PR #3744. There are some diffs for HASH_FUNC_FOR_PPERM which are not so surprising (and we can now argue whether it is worth fixing them, or whether it is OK that hashes work differently on different archs. Indeed, many other languages actually intentionally use different seeds for their hashes each run, to protect against hash/dictionary poisoning; of course we are not really concerned about this, and there is some benefit in having results reproducible across systems... Alas, we already accept the difference between 32 and 64 bit hashes, so...

@ChrisJefferson
Copy link
Contributor Author

Personally, I'd actually prefer shuffling the hashes on each run, or trying to make them fixed on everything. I would have a small preference for shuffling.

On new software I now try to either say orders are always fixed and we promise that, or they are always shuffled, as anything else (as I'm sure many people have experienced) is a source of subtle bugs.

@dimpase
Copy link
Member

dimpase commented Apr 19, 2020 via email

@dimpase
Copy link
Member

dimpase commented Jan 21, 2021

Unfortunately our access to Sun is no more

I spoke too soon. The Sparc is still available, just due to a system upgrade (to Solaris 11.4) the access was broken, and the person responsible forgot to reply to my email from late March 2020 (well, we know, interesting times). Ping me if you want access to it.

@dimpase
Copy link
Member

dimpase commented Sep 25, 2022

The Sparc is still available,

Not any more, gone about a year ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants