Skip to content

Illegal instructions emitted when compiling certain C libraries using 'zig cc'. #7636

Closed
@lithdew

Description

This is an issue that I've been trying to debug for the last few days, though the behavior appears to be consistent when compiling and running test programs using certain C libraries such as LMDB, libmdbx, or sqlite3.

To make reproduction simpler, I'll focus on LMDB as it has the least amount of code (and only one include path + two C source files to compile against).

These are the versions of clang and gcc I tested with:

$ clang --version
clang version 7.1.0 (tags/RELEASE_710/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/3m76ry913ky4zb2frdbic3wa7gr69084-clang-7.1.0/bin

$ gcc --version
gcc (GCC) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The test program (test.c) is as follows:

#include <stdio.h>
#include <assert.h>
#include "lmdb.h"

int main(int argc, char * argv[]) {
  MDB_env * env;
  MDB_dbi dbi;
  MDB_val key, data;
  MDB_txn * txn;
  MDB_cursor * cursor;
  char sval[32];

  assert(mdb_env_create( & env) == MDB_SUCCESS);
  assert(mdb_env_set_maxdbs(env, 2) == MDB_SUCCESS);
  assert(mdb_env_open(env, "./testdb", MDB_NOSUBDIR | MDB_WRITEMAP, 0664) == MDB_SUCCESS);
  assert(mdb_txn_begin(env, NULL, 0, & txn) == MDB_SUCCESS);
  assert(mdb_dbi_open(txn, "test", MDB_CREATE | MDB_DUPSORT | MDB_DUPFIXED, & dbi) == MDB_SUCCESS);

  for (uint64_t i = 0; i < 4096; i++) {
    key.mv_data = "index";
    key.mv_size = sizeof(key.mv_data) - 1;

    data.mv_data = (void * )( & i);
    data.mv_size = 8;

    assert(mdb_put(txn, dbi, & key, & data, 0) == MDB_SUCCESS);
  }

  mdb_close(env, dbi);
  mdb_env_close(env);
  return 0;
}

These are the commands I used for compiling the test program:

$ clang test.c libraries/liblmdb/mdb.c libraries/liblmdb/midl.c -pthread -I libraries/liblmdb -o test
$ gcc test.c libraries/liblmdb/mdb.c libraries/liblmdb/midl.c -pthread -I libraries/liblmdb -o test

Running the program compiled with either clang or gcc, the program exits and completes successfully.

Now, if I were to use zig cc:

$ zig cc test.c libraries/liblmdb/mdb.c libraries/liblmdb/midl.c -pthread -I libraries/liblmdb -o test

$ ./test
Illegal instruction (core dumped)

Weird. Let's open it up on gdb:

Program received signal SIGILL, Illegal instruction.
0x00000000002136d3 in mdb_xcursor_init1 ()
(gdb) bt
#0  0x00000000002136d3 in mdb_xcursor_init1 ()
#1  0x0000000000209578 in mdb_cursor_put ()
#2  0x0000000000215e56 in mdb_put ()
#3  0x0000000000205443 in main ()

No useful backtrace. I did some printf debugging and there were no assertions hit nor dangling pointers / null pointers lurking amongst the code in mdb_xcursor_init1. So, let's view what's going on at the assembly-level.

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│   0x2136c5 <mdb_xcursor_init1+725>        lea    -0xdfbc(%rip),%rax        # 0x205710 <mdb_cmp_long>                               │
│   0x2136cc <mdb_xcursor_init1+732>        mov    %rax,(%rcx)                                                                       │
│   0x2136cf <mdb_xcursor_init1+735>        vzeroupper                                                                               │
│   0x2136d2 <mdb_xcursor_init1+738>        ret                                                                                      │
│  >0x2136d3 <mdb_xcursor_init1+739>        ud2                                                                                      │
│   0x2136d5                                    data16 nopw %cs:0x0(%rax,%rax,1)                                                     │
│   0x2136e0 <mdb_xcursor_init2>            test   $0x7,%dil                                                                         │
│   0x2136e4 <mdb_xcursor_init2+4>          jne    0x213830 <mdb_xcursor_init2+336>                                                  │
│   0x2136ea <mdb_xcursor_init2+10>         test   %rdi,%rdi                                                                         │
│   0x2136ed <mdb_xcursor_init2+13>         je     0x213830 <mdb_xcursor_init2+336>                                                  │
│   0x2136f3 <mdb_xcursor_init2+19>         add    $0x10,%rdi                                                                        │
│   0x2136f7 <mdb_xcursor_init2+23>         test   $0x7,%dil                                                                         │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7ffff7c8c7 In: mdb_xcursor_init1                                                               L??   PC: 0x2136d3 
(gdb) layout asm

... and for some reason, there is a ud2 (undefined) instruction right after ret.

Let's see how the assembly is like around the same code region for the binary emitted by gcc:

(gdb) b mdb_xcursor_init1
Breakpoint 1 at 0x40de0b
(gdb) r
Starting program: /home/lith/Desktop/lmdb-zig/test 
warning: File "/nix/store/bpgdx6qqqzzi3szb0y3di3j3660f3wkj-glibc-2.31/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/isy60my0ijjzh49rscgdb1i2457nf7lp-gcc-9.3.0-lib".
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

Breakpoint 1, 0x000000000040de0b in mdb_xcursor_init1 ()
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│   0x40e012 <mdb_xcursor_init1+523>        jne    0x40e023 <mdb_xcursor_init1+540>                                                  │
│   0x40e014 <mdb_xcursor_init1+525>        mov    -0x8(%rbp),%rax                                                                   │
│   0x40e018 <mdb_xcursor_init1+529>        movq   $0x407511,0x1c8(%rax)                                                             │
│   0x40e023 <mdb_xcursor_init1+540>        nop                                                                                      │
│   0x40e024 <mdb_xcursor_init1+541>        leave                                                                                    │
│   0x40e025 <mdb_xcursor_init1+542>        ret                                                                                      │
│   0x40e026 <mdb_xcursor_init2>            push   %rbp                                                                              │
│   0x40e027 <mdb_xcursor_init2+1>          mov    %rsp,%rbp                                                                         │
│   0x40e02a <mdb_xcursor_init2+4>          push   %rbx                                                                              │
│   0x40e02b <mdb_xcursor_init2+5>          mov    %rdi,-0x20(%rbp)                                                                  │
│   0x40e02f <mdb_xcursor_init2+9>          mov    %rsi,-0x28(%rbp)                                                                  │
│   0x40e033 <mdb_xcursor_init2+13>         mov    %edx,-0x2c(%rbp)                                                                  │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
native process 28572 In: mdb_xcursor_init1                                                                         L??   PC: 0x40de0b 
(gdb) layout asm

... and there is no ud2 instruction in the binary compiled with gcc after ret! Same goes for clang as well.

So in conclusion, this ud2 instruction for some reason keeps being emitted with C code compiled with Zig right after ret instructions of static methods. I reached the same issue with test code I made for libmdbx as well.

The same issue came up about illegal instructions for sqlite3 as well, which was brought to my attention by @nektro.

Might this be due to additional assertion checks emitted by the Zig compiler when analyzing static methods by chance? Or might this just be as a result of a C compiler flag that should've been set/cleared?

Would appreciate any assistance on this 🙏.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions