Closed
Description
Proposed solution
Right now access to thread locals defined by thread_local!
aren't inlined across crates, causing performance problems that wouldn't otherwise be seen within one crate. This can probably be solved with a few new minor language features:
- First, the
#[inline]
annotation could be processed onstatic
variables. If the variable does not have any internal mutability, then the definition can be inlined into other LLVM modules and tagged withavailable_externally
. That means that the contents are available for optimization, but if you're taking the address it's available elsewhere. - Second, when we inline the contents of a static across modules, function pointers should also be chased and inlined if they're tagged with
#[inline]
.
Those two pieces I believe should provide enough inlining opportunities to ensure that accesses are as fast when done from external crates as they are done with internal crates.
Original description
This hurts performance for the locks in std::sync
, as they call std::rt::unwind::panicking()
(which just reads a thread-local). For uncontended locks the cost is quite significant.
There are two problems:
std::rt::unwind::panicking()
isn't marked inline. This is trivial to solve.- Accessing a
thread_local!
goes through function pointers, which LLVM fails to see through. These are the__getit
functions inlibstd/thread/local.rs
. Consider these two files:
// tls.rs
use std::cell::Cell;
thread_local! { static FOO: Cell<bool> = Cell::new(false) }
#[inline]
pub fn get_foo() -> bool {
FOO.with(|s| s.get())
}
// other.rs
extern crate tls;
#[inline(never)]
fn call_foo() -> bool {
tls::get_foo()
}
call_foo
gets the following IR with everything compiled with full optimization. Note the call through a function pointer:
define internal fastcc zeroext i1 @_ZN8call_foo20h55fb2c2ba5981f51PaaE() unnamed_addr #3 {
entry-block:
%0 = load %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 0), align 8
%1 = tail call dereferenceable(4) %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %0()
%2 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 0
%3 = load i8* %2, align 1, !range !13
%cond.i.i = icmp eq i8 %3, 1
br i1 %cond.i.i, label %match_case.i.i, label %match_else.i.i
match_else.i.i: ; preds = %entry-block
%4 = load i8 ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 1), align 8
%5 = tail call i8 %4()
%6 = zext i8 %5 to i16
%7 = shl nuw i16 %6, 8
%8 = or i16 %7, 1
%9 = bitcast %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1 to i16*
store i16 %8, i16* %9, align 1
%10 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2, i64 0
%11 = bitcast i8* %10 to %"2.core::cell::Cell<bool>"*
br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit
match_case.i.i: ; preds = %entry-block
%12 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2
%13 = bitcast [1 x i8]* %12 to %"2.core::cell::Cell<bool>"*
br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit
_ZN7get_foo20h73631102cae2c5b4lbaE.exit: ; preds = %match_else.i.i, %match_case.i.i
%.0.i.i = phi %"2.core::cell::Cell<bool>"* [ %13, %match_case.i.i ], [ %11, %match_else.i.i ]
%.0.idx.i.i = getelementptr %"2.core::cell::Cell<bool>"* %.0.i.i, i64 0, i32 0, i32 0
%.0.idx.val.i.i = load i8* %.0.idx.i.i, align 1
%14 = icmp ne i8 %.0.idx.val.i.i, 0
ret i1 %14
}