Skip to content

[ML] Upgrade to gcc 10.3 for Linux compilation #2028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 20, 2021

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Sep 15, 2021

gcc 10.3 contains the fix for the bug that hinders
compilation of PyTorch on aarch64.

The binutils package used for compiling the final
distribution is also upgraded, from version 2.34 to
version 2.37, and the version used for bootstrapping
the cross compiler from version 2.25 to version 2.27.
patchelf is upgraded from version 0.10 to version
0.13.

gcc 10.3 contains the fix for the bug that hinders
compilation of PyTorch on aarch64.

The binutils package is also upgraded, from version
2.34 to version 2.37, and patchelf is upgraded from
version 0.10 to version 0.13.
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195
Copy link
Contributor Author

A retest now should work on linux-x86_64, and only fail on linux-aarch64.

@droberts195
Copy link
Contributor Author

retest

@droberts195 droberts195 marked this pull request as ready for review September 16, 2021 15:03
@droberts195
Copy link
Contributor Author

There are two test failures for linux-aarch64 now:

[Exception] - critical check distanceToSorted(selectedForTree) < 0.005 has failed [0.0069714285714285711 >= 0.0050000000000000001]
 == [File] - CBoostedTreeTest.cc
 == [Line] -1127
[Exception] - critical check maths::CBasicStatistics::mean(LLC) > 0.5 * maths::CBasicStatistics::mean(LLR) has failed [-4.8614985805830226 <= -4.5513714779611059]
 == [File] - CXMeansOnlineTest.cc
 == [Line] -757

So it seems that gcc 10.3 has changed the way some maths calculations are done, but only on aarch64.

@droberts195
Copy link
Contributor Author

droberts195 commented Sep 16, 2021

There is also another problem to be investigated, which is why are we accidentally shipping libasan.so.6 on linux-aarch64 when cross-compiling?

@droberts195
Copy link
Contributor Author

Looking in more detail at:

[Exception] - critical check maths::CBasicStatistics::mean(LLC) > 0.5 * maths::CBasicStatistics::mean(LLR) has failed [-4.8614985805830226 <= -4.5513714779611059]
 == [File] - CXMeansOnlineTest.cc
 == [Line] -757

The code actually prints out the relevant values immediately before the assertion, so we can see the values for all 4 platforms where we ran tests in the last PR build for this PR:

linux-x86_64

2021-09-16 17:49:40,018030 UTC [3558] DEBUG CXMeansOnlineTest.cc@755 gaussian log(L)  = -9.10274
2021-09-16 17:49:40,018121 UTC [3558] DEBUG CXMeansOnlineTest.cc@756 clustered log(L) = -3.31045

linux-aarch64

2021-09-16 17:46:06,446539 UTC [3931] DEBUG CXMeansOnlineTest.cc@755 gaussian log(L)  = -9.10274
2021-09-16 17:46:06,446593 UTC [3931] DEBUG CXMeansOnlineTest.cc@756 clustered log(L) = -4.8615

darwin-aarch64

2021-09-16 17:23:40,575188 UTC [37993] DEBUG CXMeansOnlineTest.cc@755 gaussian log(L)  = -9.10274
2021-09-16 17:23:40,575213 UTC [37993] DEBUG CXMeansOnlineTest.cc@756 clustered log(L) = -3.39478

windows-x86_64

2021-09-16 17:57:48,388588 UTC [6672] DEBUG CXMeansOnlineTest.cc@755 gaussian log(L)  = -9.10274
2021-09-16 17:57:48,388588 UTC [6672] DEBUG CXMeansOnlineTest.cc@756 clustered log(L) = -2.89212

This shows there's a lot of variation in the clustered log value. linux-aarch64 now falls outside the tolerated limit.

The last successful PR build (obviously not for this PR, but on the main branch) for linux-aarch64 using gcc 9.3 produced these values:

2021-09-15 19:32:01,414091 UTC [3771] DEBUG CXMeansOnlineTest.cc@755 gaussian log(L)  = -9.10274
2021-09-15 19:32:01,414141 UTC [3771] DEBUG CXMeansOnlineTest.cc@756 clustered log(L) = -4.03154

So linux-aarch64 always had the most negative value of all platforms for clustered log. It's just that upgrading to gcc 10.3 has finally pushed it past the point where the assertion fails.

/cc @tveasey

Latest CentOS 7 uses binutils 2.27, so the cross compiler should
be bootstrapped with this too, so that the native and cross
components use matching versions.

(Original CentOS 7 used the older binutils version 2.25, which
is why the Dockerfile originally used that.)
@droberts195 droberts195 merged commit f24bc79 into elastic:main Sep 20, 2021
@droberts195 droberts195 deleted the upgrade-gcc branch September 20, 2021 08:16
droberts195 added a commit to droberts195/ml-cpp that referenced this pull request Sep 20, 2021
The base image needs to match what the source code
expects, which is version 19 since elastic#2028 was merged.
droberts195 added a commit that referenced this pull request Sep 20, 2021
The base image needs to match what the source code
expects, which is version 19 since #2028 was merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants