Skip to content

Mutex issues when two versions of libprotobuf are linked to two Python libraries separately #21686

@Inokinoki

Description

@Inokinoki

What version of protobuf and what language are you using?
Version: v3.25.1 (in tink 1.9) and v3.21.3 (in pyarrow 20.0.0)
Language: C++ and Python

What operating system (Linux, Windows, ...) and version?

macOS

What runtime / compiler are you using (e.g., python version or gcc version)

Python 3.10.17

What did you do?

  1. Install latest tink (1.11.0) and pyarrow (20.0.0) in the Python version
  2. Run python terminal
  3. Import tink/pyarrow first
  4. Import pyarrow/tink

If tink is imported first, importing pyarrow will lead to dead lock because the mutex is invalid.
Otherwise, the program will crash directly.

What did you expect to see

Importing both libraries should be safe ;)

What did you see instead?

Crashing:

libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

or hanging:

[mutex.cc : 453] RAW: Lock blocking 0x156747c38   @

I did a first (and kind of deep) investigation. TL;DR, it seems that on macOS, the mutex lock is getting the wrong google::protobuf::internal::ShutdownData::get()::data to lock the mutex (before it was an internal impl of mutex - some wrappers of std::mutex in protobuf and now it's using absl::Mutex).

You should be able to find the stack trace in tink-crypto/tink-py#25 and apache/arrow#40088.

Here is my latest finding:

I set breakpoint on google::protobuf::internal::OnShutdownRun and then import pyarrow first. The assembly of it in libarrow is as follows:

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
    frame #0: 0x000000010732a0f4 libarrow.2000.dylib`google::protobuf::internal::OnShutdownRun(void (*)(void const*), void const*)
libarrow.2000.dylib`google::protobuf::internal::OnShutdownRun:
    0x10732a0f4 <+0>:  stp    x26, x25, [sp, #-0x50]!
    0x10732a0f8 <+4>:  stp    x24, x23, [sp, #0x10]
    0x10732a0fc <+8>:  stp    x22, x21, [sp, #0x20]
    0x10732a100 <+12>: stp    x20, x19, [sp, #0x30]
    0x10732a104 <+16>: stp    x29, x30, [sp, #0x40]
    0x10732a108 <+20>: add    x29, sp, #0x40
    0x10732a10c <+24>: mov    x20, x1
    0x10732a110 <+28>: mov    x21, x0
    0x10732a114 <+32>: adrp   x8, 1794
->  0x10732a118 <+36>: ldr    x8, [x8, #0x340]
    0x10732a11c <+40>: ldaprb w8, [x8]
    0x10732a120 <+44>: adrp   x19, 1796
    0x10732a124 <+48>: ldr    x19, [x19, #0x50]
    0x10732a128 <+52>: tbz    w8, #0x0, 0x10732a238 ; <+324>
    0x10732a12c <+56>: ldr    x22, [x19]
    0x10732a130 <+60>: add    x19, x22, #0x18
    0x10732a134 <+64>: mov    x0, x19
    0x10732a138 <+68>: bl     0x107531dd0    ; symbol stub for: std::__1::mutex::lock()
    0x10732a13c <+72>: ldp    x23, x8, [x22, #0x8]
    0x10732a140 <+76>: cmp    x23, x8
    0x10732a140 <+76>: cmp    x23, x8
    0x10732a144 <+80>: b.hs   0x10732a158    ; <+100>
    0x10732a148 <+84>: stp    x21, x20, [x23]
    0x10732a14c <+88>: add    x8, x23, #0x10 

where the marked instruction is to get the singleton data. When I get the register bank, it contains:

x8 = 0x0000000107b2b8c8  guard variable for google::protobuf::internal::ShutdownData::get()::data

We can notice that 0x10732a138 <+68> contains a direct call to standard C++ lib of mutex lock. And it's ok here because the singleton data also contains a std::mutex member.

I let it continue running and then import tink (with a newer version using absl mutex).

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.3
    frame #0: 0x0000000103f5b104 tink_bindings.cpython-310-darwin.so`google::protobuf::internal::OnShutdownRun(void (*)(void const*), void const*)
tink_bindings.cpython-310-darwin.so`google::protobuf::internal::OnShutdownRun:
    0x103f5b104 <+0>:  stp    x28, x27, [sp, #-0x60]!
    0x103f5b108 <+4>:  stp    x26, x25, [sp, #0x10]
    0x103f5b10c <+8>:  stp    x24, x23, [sp, #0x20]
    0x103f5b110 <+12>: stp    x22, x21, [sp, #0x30]
    0x103f5b114 <+16>: stp    x20, x19, [sp, #0x40]
    0x103f5b118 <+20>: stp    x29, x30, [sp, #0x50]
    0x103f5b11c <+24>: add    x29, sp, #0x50
    0x103f5b120 <+28>: mov    x20, x1
    0x103f5b124 <+32>: mov    x21, x0
    0x103f5b128 <+36>: adrp   x8, 501
->  0x103f5b12c <+40>: ldr    x8, [x8, #0xe8]
    0x103f5b130 <+44>: ldaprb w8, [x8]
    0x103f5b134 <+48>: adrp   x19, 502
    0x103f5b138 <+52>: ldr    x19, [x19, #0xaa0]
    0x103f5b13c <+56>: tbz    w8, #0x0, 0x103f5b218 ; <+276>
    0x103f5b140 <+60>: ldr    x22, [x19]
    0x103f5b144 <+64>: add    x19, x22, #0x18
    0x103f5b148 <+68>: mov    x0, x19
->  0x103f5b14c <+72>: bl     0x104026a88    ; absl::lts_20240722::Mutex::Lock()
    0x103f5b150 <+76>: ldp    x9, x8, [x22, #0x8]
    0x103f5b154 <+80>: cmp    x9, x8
    0x103f5b158 <+84>: b.hs   0x103f5b16c    ; <+104>

When I read the register, it's giving the same address (at the first arrow):

x8 = 0x0000000107b2b8c8  guard variable for google::protobuf::internal::ShutdownData::get()::data

The data is already created while importing pyarrow, and it has a member of std::mutex.

But then it calls the absl mutex lock (at the second arrow), which expects absl mutex, which can crash the program.


The current question is that, why they share the same address even though they are from two different separately loaded libraries (should have RTLD_LOCAL by default).

Maybe there are something going wrong in the build configurations, which prevent them from creating different data segment.

Anything else we should know about your project / environment

My previous investigations:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions