Illegal instructions emitted when compiling certain C libraries using 'zig cc'. #7636
Description
This is an issue that I've been trying to debug for the last few days, though the behavior appears to be consistent when compiling and running test programs using certain C libraries such as LMDB, libmdbx, or sqlite3.
To make reproduction simpler, I'll focus on LMDB as it has the least amount of code (and only one include path + two C source files to compile against).
These are the versions of clang
and gcc
I tested with:
$ clang --version
clang version 7.1.0 (tags/RELEASE_710/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/3m76ry913ky4zb2frdbic3wa7gr69084-clang-7.1.0/bin
$ gcc --version
gcc (GCC) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The test program (test.c
) is as follows:
#include <stdio.h>
#include <assert.h>
#include "lmdb.h"
int main(int argc, char * argv[]) {
MDB_env * env;
MDB_dbi dbi;
MDB_val key, data;
MDB_txn * txn;
MDB_cursor * cursor;
char sval[32];
assert(mdb_env_create( & env) == MDB_SUCCESS);
assert(mdb_env_set_maxdbs(env, 2) == MDB_SUCCESS);
assert(mdb_env_open(env, "./testdb", MDB_NOSUBDIR | MDB_WRITEMAP, 0664) == MDB_SUCCESS);
assert(mdb_txn_begin(env, NULL, 0, & txn) == MDB_SUCCESS);
assert(mdb_dbi_open(txn, "test", MDB_CREATE | MDB_DUPSORT | MDB_DUPFIXED, & dbi) == MDB_SUCCESS);
for (uint64_t i = 0; i < 4096; i++) {
key.mv_data = "index";
key.mv_size = sizeof(key.mv_data) - 1;
data.mv_data = (void * )( & i);
data.mv_size = 8;
assert(mdb_put(txn, dbi, & key, & data, 0) == MDB_SUCCESS);
}
mdb_close(env, dbi);
mdb_env_close(env);
return 0;
}
These are the commands I used for compiling the test program:
$ clang test.c libraries/liblmdb/mdb.c libraries/liblmdb/midl.c -pthread -I libraries/liblmdb -o test
$ gcc test.c libraries/liblmdb/mdb.c libraries/liblmdb/midl.c -pthread -I libraries/liblmdb -o test
Running the program compiled with either clang
or gcc
, the program exits and completes successfully.
Now, if I were to use zig cc
:
$ zig cc test.c libraries/liblmdb/mdb.c libraries/liblmdb/midl.c -pthread -I libraries/liblmdb -o test
$ ./test
Illegal instruction (core dumped)
Weird. Let's open it up on gdb
:
Program received signal SIGILL, Illegal instruction.
0x00000000002136d3 in mdb_xcursor_init1 ()
(gdb) bt
#0 0x00000000002136d3 in mdb_xcursor_init1 ()
#1 0x0000000000209578 in mdb_cursor_put ()
#2 0x0000000000215e56 in mdb_put ()
#3 0x0000000000205443 in main ()
No useful backtrace. I did some printf debugging and there were no assertions hit nor dangling pointers / null pointers lurking amongst the code in mdb_xcursor_init1
. So, let's view what's going on at the assembly-level.
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 0x2136c5 <mdb_xcursor_init1+725> lea -0xdfbc(%rip),%rax # 0x205710 <mdb_cmp_long> │
│ 0x2136cc <mdb_xcursor_init1+732> mov %rax,(%rcx) │
│ 0x2136cf <mdb_xcursor_init1+735> vzeroupper │
│ 0x2136d2 <mdb_xcursor_init1+738> ret │
│ >0x2136d3 <mdb_xcursor_init1+739> ud2 │
│ 0x2136d5 data16 nopw %cs:0x0(%rax,%rax,1) │
│ 0x2136e0 <mdb_xcursor_init2> test $0x7,%dil │
│ 0x2136e4 <mdb_xcursor_init2+4> jne 0x213830 <mdb_xcursor_init2+336> │
│ 0x2136ea <mdb_xcursor_init2+10> test %rdi,%rdi │
│ 0x2136ed <mdb_xcursor_init2+13> je 0x213830 <mdb_xcursor_init2+336> │
│ 0x2136f3 <mdb_xcursor_init2+19> add $0x10,%rdi │
│ 0x2136f7 <mdb_xcursor_init2+23> test $0x7,%dil │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7ffff7c8c7 In: mdb_xcursor_init1 L?? PC: 0x2136d3
(gdb) layout asm
... and for some reason, there is a ud2
(undefined) instruction right after ret
.
Let's see how the assembly is like around the same code region for the binary emitted by gcc
:
(gdb) b mdb_xcursor_init1
Breakpoint 1 at 0x40de0b
(gdb) r
Starting program: /home/lith/Desktop/lmdb-zig/test
warning: File "/nix/store/bpgdx6qqqzzi3szb0y3di3j3660f3wkj-glibc-2.31/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/isy60my0ijjzh49rscgdb1i2457nf7lp-gcc-9.3.0-lib".
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Breakpoint 1, 0x000000000040de0b in mdb_xcursor_init1 ()
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 0x40e012 <mdb_xcursor_init1+523> jne 0x40e023 <mdb_xcursor_init1+540> │
│ 0x40e014 <mdb_xcursor_init1+525> mov -0x8(%rbp),%rax │
│ 0x40e018 <mdb_xcursor_init1+529> movq $0x407511,0x1c8(%rax) │
│ 0x40e023 <mdb_xcursor_init1+540> nop │
│ 0x40e024 <mdb_xcursor_init1+541> leave │
│ 0x40e025 <mdb_xcursor_init1+542> ret │
│ 0x40e026 <mdb_xcursor_init2> push %rbp │
│ 0x40e027 <mdb_xcursor_init2+1> mov %rsp,%rbp │
│ 0x40e02a <mdb_xcursor_init2+4> push %rbx │
│ 0x40e02b <mdb_xcursor_init2+5> mov %rdi,-0x20(%rbp) │
│ 0x40e02f <mdb_xcursor_init2+9> mov %rsi,-0x28(%rbp) │
│ 0x40e033 <mdb_xcursor_init2+13> mov %edx,-0x2c(%rbp) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
native process 28572 In: mdb_xcursor_init1 L?? PC: 0x40de0b
(gdb) layout asm
... and there is no ud2
instruction in the binary compiled with gcc
after ret
! Same goes for clang
as well.
So in conclusion, this ud2
instruction for some reason keeps being emitted with C code compiled with Zig right after ret
instructions of static methods. I reached the same issue with test code I made for libmdbx as well.
The same issue came up about illegal instructions for sqlite3 as well, which was brought to my attention by @nektro.
Might this be due to additional assertion checks emitted by the Zig compiler when analyzing static methods by chance? Or might this just be as a result of a C compiler flag that should've been set/cleared?
Would appreciate any assistance on this 🙏.