-
Notifications
You must be signed in to change notification settings - Fork 16k
Description
What version of protobuf and what language are you using?
Version: v3.25.1 (in tink 1.9) and v3.21.3 (in pyarrow 20.0.0)
Language: C++ and Python
What operating system (Linux, Windows, ...) and version?
macOS
What runtime / compiler are you using (e.g., python version or gcc version)
Python 3.10.17
What did you do?
- Install latest
tink(1.11.0) andpyarrow(20.0.0) in the Python version - Run python terminal
- Import
tink/pyarrowfirst - Import
pyarrow/tink
If tink is imported first, importing pyarrow will lead to dead lock because the mutex is invalid.
Otherwise, the program will crash directly.
What did you expect to see
Importing both libraries should be safe ;)
What did you see instead?
Crashing:
libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
or hanging:
[mutex.cc : 453] RAW: Lock blocking 0x156747c38 @
I did a first (and kind of deep) investigation. TL;DR, it seems that on macOS, the mutex lock is getting the wrong google::protobuf::internal::ShutdownData::get()::data to lock the mutex (before it was an internal impl of mutex - some wrappers of std::mutex in protobuf and now it's using absl::Mutex).
You should be able to find the stack trace in tink-crypto/tink-py#25 and apache/arrow#40088.
Here is my latest finding:
I set breakpoint on google::protobuf::internal::OnShutdownRun and then import pyarrow first. The assembly of it in libarrow is as follows:
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
frame #0: 0x000000010732a0f4 libarrow.2000.dylib`google::protobuf::internal::OnShutdownRun(void (*)(void const*), void const*)
libarrow.2000.dylib`google::protobuf::internal::OnShutdownRun:
0x10732a0f4 <+0>: stp x26, x25, [sp, #-0x50]!
0x10732a0f8 <+4>: stp x24, x23, [sp, #0x10]
0x10732a0fc <+8>: stp x22, x21, [sp, #0x20]
0x10732a100 <+12>: stp x20, x19, [sp, #0x30]
0x10732a104 <+16>: stp x29, x30, [sp, #0x40]
0x10732a108 <+20>: add x29, sp, #0x40
0x10732a10c <+24>: mov x20, x1
0x10732a110 <+28>: mov x21, x0
0x10732a114 <+32>: adrp x8, 1794
-> 0x10732a118 <+36>: ldr x8, [x8, #0x340]
0x10732a11c <+40>: ldaprb w8, [x8]
0x10732a120 <+44>: adrp x19, 1796
0x10732a124 <+48>: ldr x19, [x19, #0x50]
0x10732a128 <+52>: tbz w8, #0x0, 0x10732a238 ; <+324>
0x10732a12c <+56>: ldr x22, [x19]
0x10732a130 <+60>: add x19, x22, #0x18
0x10732a134 <+64>: mov x0, x19
0x10732a138 <+68>: bl 0x107531dd0 ; symbol stub for: std::__1::mutex::lock()
0x10732a13c <+72>: ldp x23, x8, [x22, #0x8]
0x10732a140 <+76>: cmp x23, x8
0x10732a140 <+76>: cmp x23, x8
0x10732a144 <+80>: b.hs 0x10732a158 ; <+100>
0x10732a148 <+84>: stp x21, x20, [x23]
0x10732a14c <+88>: add x8, x23, #0x10
where the marked instruction is to get the singleton data. When I get the register bank, it contains:
x8 = 0x0000000107b2b8c8 guard variable for google::protobuf::internal::ShutdownData::get()::data
We can notice that 0x10732a138 <+68> contains a direct call to standard C++ lib of mutex lock. And it's ok here because the singleton data also contains a std::mutex member.
I let it continue running and then import tink (with a newer version using absl mutex).
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.3
frame #0: 0x0000000103f5b104 tink_bindings.cpython-310-darwin.so`google::protobuf::internal::OnShutdownRun(void (*)(void const*), void const*)
tink_bindings.cpython-310-darwin.so`google::protobuf::internal::OnShutdownRun:
0x103f5b104 <+0>: stp x28, x27, [sp, #-0x60]!
0x103f5b108 <+4>: stp x26, x25, [sp, #0x10]
0x103f5b10c <+8>: stp x24, x23, [sp, #0x20]
0x103f5b110 <+12>: stp x22, x21, [sp, #0x30]
0x103f5b114 <+16>: stp x20, x19, [sp, #0x40]
0x103f5b118 <+20>: stp x29, x30, [sp, #0x50]
0x103f5b11c <+24>: add x29, sp, #0x50
0x103f5b120 <+28>: mov x20, x1
0x103f5b124 <+32>: mov x21, x0
0x103f5b128 <+36>: adrp x8, 501
-> 0x103f5b12c <+40>: ldr x8, [x8, #0xe8]
0x103f5b130 <+44>: ldaprb w8, [x8]
0x103f5b134 <+48>: adrp x19, 502
0x103f5b138 <+52>: ldr x19, [x19, #0xaa0]
0x103f5b13c <+56>: tbz w8, #0x0, 0x103f5b218 ; <+276>
0x103f5b140 <+60>: ldr x22, [x19]
0x103f5b144 <+64>: add x19, x22, #0x18
0x103f5b148 <+68>: mov x0, x19
-> 0x103f5b14c <+72>: bl 0x104026a88 ; absl::lts_20240722::Mutex::Lock()
0x103f5b150 <+76>: ldp x9, x8, [x22, #0x8]
0x103f5b154 <+80>: cmp x9, x8
0x103f5b158 <+84>: b.hs 0x103f5b16c ; <+104>
When I read the register, it's giving the same address (at the first arrow):
x8 = 0x0000000107b2b8c8 guard variable for google::protobuf::internal::ShutdownData::get()::data
The data is already created while importing pyarrow, and it has a member of std::mutex.
But then it calls the absl mutex lock (at the second arrow), which expects absl mutex, which can crash the program.
The current question is that, why they share the same address even though they are from two different separately loaded libraries (should have RTLD_LOCAL by default).
Maybe there are something going wrong in the build configurations, which prevent them from creating different data segment.
Anything else we should know about your project / environment
My previous investigations: