Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Clang Linux build to CI pipeline #10767

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

czentgr
Copy link
Collaborator

@czentgr czentgr commented Aug 15, 2024

The linux build is refactored to be able to switch between clang and gcc based builds of Velox.
The gcc based linux build is run on pull and push.
The clang based linux build is added to the scheduled jobs and executed on schedule
only.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2024
Copy link

netlify bot commented Aug 15, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5fc50ea
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/671004ec8ec0f30008f99712

@czentgr
Copy link
Collaborator Author

czentgr commented Aug 15, 2024

The workflow isn't triggered. Maybe a chicken and egg problem?

@majetideepak FYI.
@assignUser Can you please take a look? Do you have any idea on how to test it?

@assignUser
Copy link
Collaborator

@czentgr the job was triggerd but failed during parsing which means it doesn't show up in the check list (not a great choice on gh's part...) https://github.com/facebookincubator/velox/actions/runs/10411698998/workflow#L53
you'll have to use either full owner/repo@version syntax or local file syntaxt with ./... https://docs.github.com/en/actions/sharing-automations/reusing-workflows#calling-a-reusable-workflow

@czentgr
Copy link
Collaborator Author

czentgr commented Aug 19, 2024

@assignUser Thank you! I fixed the path. I overlooked the part with the path in the doc.

@czentgr czentgr force-pushed the cz_add_clang_build branch 11 times, most recently from a45f534 to f12993e Compare August 21, 2024 18:20
@czentgr
Copy link
Collaborator Author

czentgr commented Aug 21, 2024

@assignUser I hope it is nothing bad but got two odd issues in my recent run:

linux-adapter with clang (the one most interested in) failed with network issue (artifact read timed out).
https://github.com/facebookincubator/velox/actions/runs/10495295073/job/29073427054?pr=10767

Run gh run download $STASH_RUN_ID --name $STASH_NAME --dir "$STASH_DIR" -R "$REPO"
error downloading ccache-linux-adapters-10767_merge: error writing zip archive: read tcp 172.18.0.2:52660->20.209.227.33:443: read: connection timed out

And MacOS14 failed during upload
https://github.com/facebookincubator/velox/actions/runs/10495295084/job/29073426792?pr=10767

with

 node:events:497
      throw er; // Unhandled 'error' event
      ^

Error: EMFILE: too many open files, open '/Users/runner/work/velox/velox/.ccache/e/a/belcl8n03chag2f4bkp5u4di8puknssR'
Emitted 'error' event on ReadStream instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -24,
  code: 'EMFILE',
  syscall: 'open',
  path: '/Users/runner/work/velox/velox/.ccache/e/a/belcl8n03chag2f4bkp5u4di8puknssR'

I will retry later.

@assignUser
Copy link
Collaborator

I have deleted all active caches for the mac 14 build (on main), that should fix this issue.

@czentgr
Copy link
Collaborator Author

czentgr commented Aug 22, 2024

The build issue for ubuntu-clang Linux with adapters is fixed in PR #10800.

@czentgr czentgr marked this pull request as ready for review August 22, 2024 15:20
@czentgr czentgr changed the title [WIP] Add Clang Linux build to CI pipeline Add Clang Linux build to CI pipeline Aug 22, 2024
@assignUser
Copy link
Collaborator

It seems the macos14 issues is related to a bug in the artifacts action, looking into it.

@czentgr czentgr force-pushed the cz_add_clang_build branch 2 times, most recently from eee46a1 to 3ead727 Compare August 28, 2024 20:09
@czentgr
Copy link
Collaborator Author

czentgr commented Aug 29, 2024

The release build with clang15 causes one of the tests in velox_dwio_common_test to SEGV. The issue only occurs with a release build.

Running main() from /builddir/build/BUILD/googletest-release-1.11.0/googletest/src/gtest_main.cc
Note: Google Test filter = ParallelForTest.E2E
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ParallelForTest
[ RUN      ] ParallelForTest.E2E

Program received signal SIGSEGV, Segmentation fault.
0x0000000000cfe2a0 in folly::InlineLikeExecutor::InlineLikeExecutor (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>, __vtt_parm=0x8, __in_chrg=<optimized out>) at /root/deps/folly/folly/executors/InlineExecutor.h:27
27      class InlineLikeExecutor : public virtual Executor {
Missing separate debuginfos, use: dnf debuginfo-install cyrus-sasl-lib-2.1.27-21.el9.x86_64 double-conversion-3.1.5-6.el9.x86_64 gflags-2.2.2-9.el9.x86_64 glibc-2.34-120.el9.x86_64 gmock-1.11.0-1.el9.x86_64 gtest-1.11.0-1.el9.x86_64 keyutils-libs-1.6.3-1.el9.x86_64 krb5-libs-1.21.1-3.el9.x86_64 libatomic-11.5.0-2.el9.x86_64 libbrotli-1.0.9-6.el9.x86_64 libcom_err-1.46.5-5.el9.x86_64 libcurl-7.76.1-29.el9.x86_64 libdwarf-0.3.4-1.el9.1.x86_64 libevent-2.1.12-6.el9.x86_64 libgcc-11.5.0-2.el9.x86_64 libgsasl-1.10.0-3.el9.x86_64 libicu-67.1-9.el9.x86_64 libidn-1.38-4.el9.x86_64 libidn2-2.3.0-7.el9.x86_64 libnghttp2-1.43.0-6.el9.x86_64 libpsl-0.21.1-5.el9.x86_64 libselinux-3.6-1.el9.x86_64 libsodium-1.0.18-8.el9.x86_64 libssh-0.10.4-13.el9.x86_64 libunistring-0.9.10-15.el9.x86_64 libxml2-2.9.13-6.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 lz4-libs-1.9.3-5.el9.x86_64 openssl-libs-3.2.2-2.el9.x86_64 pcre2-10.40-6.el9.x86_64 re2-20211101-20.el9.x86_64 xz-libs-5.2.5-8.el9.x86_64 zlib-1.2.11-41.el9.x86_64
(gdb) bt
#0  0x0000000000cfe2a0 in folly::InlineLikeExecutor::InlineLikeExecutor (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>, __vtt_parm=0x8, __in_chrg=<optimized out>)
    at /root/deps/folly/folly/executors/InlineExecutor.h:27
#1  0x0000000000cfe302 in folly::InlineExecutor::InlineExecutor (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>, __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /root/deps/folly/folly/executors/InlineExecutor.h:35
#2  0x0000000000cfe356 in folly::Indestructible<folly::InlineExecutor>::Storage::Storage<, folly::InlineExecutor>(std::in_place_t) (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>)
    at /root/deps/folly/folly/Indestructible.h:152
#3  0x0000000000cfe28c in folly::Indestructible<folly::InlineExecutor>::Indestructible<folly::InlineExecutor, folly::InlineExecutor> (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>)
    at /root/deps/folly/folly/Indestructible.h:73
#4  0x0000000000cfe1ed in folly::InlineExecutor::instance_slow () at /root/deps/folly/folly/executors/InlineExecutor.cpp:24
#5  0x0000000000952ca6 in ParallelForTest_E2E_Test::TestBody() ()
#6  0x00007ffff796ee0c in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [clone .constprop.0] () from /lib64/libgtest.so.1.11.0
#7  0x00007ffff794f826 in testing::Test::Run() () from /lib64/libgtest.so.1.11.0
#8  0x00007ffff794f9f0 in testing::TestInfo::Run() () from /lib64/libgtest.so.1.11.0
#9  0x00007ffff794faf9 in testing::TestSuite::Run() () from /lib64/libgtest.so.1.11.0
#10 0x00007ffff795efc5 in testing::internal::UnitTestImpl::RunAllTests() () from /lib64/libgtest.so.1.11.0
#11 0x00007ffff795c7c8 in testing::UnitTest::Run() () from /lib64/libgtest.so.1.11.0
#12 0x00007ffff7fac154 in main () from /lib64/libgtest_main.so.1.11.0
#13 0x00007ffff6229590 in __libc_start_call_main () from /lib64/libc.so.6
#14 0x00007ffff6229640 in __libc_start_main_impl () from /lib64/libc.so.6
#15 0x00000000008f8395 in _start ()
(gdb) p $_siginfo
$1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad = {8, 0 <repeats 27 times>}, _kill = {si_pid = 8, si_uid = 0}, _timer = {si_tid = 8, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 8,
      si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 8, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x8, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}},
    _sigpoll = {si_band = 8, si_fd = 0}}}

A SEGV at signal address 0x8. So a nullptr that tries to access at offset 0x8.
Perhaps some optimization in clang that causes the issue from the test due to instruction reordering.

The issue occurs when calling the constructor of a static object

static Indestructible<InlineExecutor> instance;

Further investigation shows that the VTT InlineExecutor symbol for a release build is not linked.
Command: nm velox_dwio_common_test -C | grep InlineExecutor

It has no address:

                 v VTT for folly::InlineExecutor

while for a debug build it shows:

000000000185d3e0 V VTT for folly::InlineExecutor

The symbol not having an address would result in the SEGV as it results in a NULL pointer access.

I don't know why this is compiled like by Clang15 and not linked properly. It seems the symbol is optimized away. It could also have something to do with the fact that it is a "cold" symbol.
It could also be that folly needs a code change to avoid the optimization. This is a base class constructor call that cannot be made. Perhaps the InlineLikeExecutor needs an explicit constructor definition.

Because this is a test executable I've added a workaround to compile this file with no optimizations which resolves the issue. We can revisit this later - perhaps open a new issue to investigate this further at a later point in time.

@czentgr czentgr force-pushed the cz_add_clang_build branch 2 times, most recently from 3566ec6 to d0a535e Compare August 29, 2024 21:58
@czentgr
Copy link
Collaborator Author

czentgr commented Sep 12, 2024

The current fuzzer failures are a known issue for which an issue is already open:

Reason: Unicode characters are not supported for conversion to integer types

@czentgr
Copy link
Collaborator Author

czentgr commented Sep 19, 2024

@assignUser @majetideepak FYI. I believe this is ready for review. We want this merged so we can also automatically get the C++20 PR run with Clang 15 (I've done this manually but would be nice to confirm in the pipeline).

# been optimized away by Clang15 and may be some kind of bug. Changing the
# optimization level to 0 results in proper creation/linkage and successful
# execution of the test. Review if this is still necessary when upgrading Clang.
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we still need this after the folly update being made?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I will have to check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the folly upgrade is likely to be rolled back due to remote functions issues lets keep this for now and re-evaluate later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@czentgr Does this test fail with the new folly upgrade made?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is still a problem. I re-ran to be sure (I did a fresh install etc).

.github/workflows/linux-build-gcc.yml Outdated Show resolved Hide resolved
Copy link
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting use of workflow_call! I agree that we should test both clang and gcc but want to note that this adds ~ 4 hours of ci time (cold cache). We do have other jobs that use gcc (like the benchmark job) so maybe we run on clang (for sanitizers) only in PRs and both on merge?

.github/workflows/linux-build-base.yml Outdated Show resolved Hide resolved
.github/workflows/linux-build-base.yml Outdated Show resolved Hide resolved
.github/workflows/linux-build-base.yml Outdated Show resolved Hide resolved
@czentgr
Copy link
Collaborator Author

czentgr commented Sep 25, 2024

@assignUser Not sure you can help with this but there is a problem with the pipeline (Build and Push Adapters job) and accessing a container image

Pulling binfmt Docker image
  /usr/bin/docker pull tonistiigi/binfmt:latest
  Error response from daemon: unauthorized: authentication required
Error: The process '/usr/bin/docker' failed with exit code 1

Copy link
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is green, looks good to me I just want to repeat a comment from my last review about the additional CI time this brings (+4hs on cold cache!).

We do have other jobs that use gcc (like the benchmark job) so maybe we run on clang (for sanitizers) only in PRs and both on merge?

@majetideepak
Copy link
Collaborator

CI is green, looks good to me I just want to repeat a comment from my last review about the additional CI time this brings (+4hs on cold cache!).

I agree. Should we run this as part of the nightly runs instead?

@czentgr czentgr force-pushed the cz_add_clang_build branch 2 times, most recently from 88dbb9d to 2a7505d Compare October 4, 2024 22:28
@czentgr
Copy link
Collaborator Author

czentgr commented Oct 4, 2024

I've moved the Clang build job to the scheduled workflow file.

Copy link
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this should be fine for now, once we focus more on clang we can switch things around.

.github/workflows/scheduled.yml Outdated Show resolved Hide resolved
The linux build is refactored to be able to switch between clang and gcc based builds of Velox.
The gcc based linux build is run on pull and push.
The clang based linux build is added to the scheduled jobs and executed on schedule
only.
@majetideepak
Copy link
Collaborator

I have one comment here #10767 (comment)
I will add the merge label once that is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants