Skip to content

Access to thread locals isn't inlined across crates #25088

Closed
@veddan

Description

@veddan

Proposed solution

Right now access to thread locals defined by thread_local! aren't inlined across crates, causing performance problems that wouldn't otherwise be seen within one crate. This can probably be solved with a few new minor language features:

  • First, the #[inline] annotation could be processed on static variables. If the variable does not have any internal mutability, then the definition can be inlined into other LLVM modules and tagged with available_externally. That means that the contents are available for optimization, but if you're taking the address it's available elsewhere.
  • Second, when we inline the contents of a static across modules, function pointers should also be chased and inlined if they're tagged with #[inline].

Those two pieces I believe should provide enough inlining opportunities to ensure that accesses are as fast when done from external crates as they are done with internal crates.

Original description

This hurts performance for the locks in std::sync, as they call std::rt::unwind::panicking() (which just reads a thread-local). For uncontended locks the cost is quite significant.

There are two problems:

  1. std::rt::unwind::panicking() isn't marked inline. This is trivial to solve.
  2. Accessing a thread_local! goes through function pointers, which LLVM fails to see through. These are the __getit functions in libstd/thread/local.rs. Consider these two files:
// tls.rs
use std::cell::Cell;
thread_local! { static FOO: Cell<bool> = Cell::new(false) }
#[inline]
pub fn get_foo() -> bool {
    FOO.with(|s| s.get())
}
// other.rs
extern crate tls;
#[inline(never)]
fn call_foo() -> bool {
    tls::get_foo()
}

call_foo gets the following IR with everything compiled with full optimization. Note the call through a function pointer:

define internal fastcc zeroext i1 @_ZN8call_foo20h55fb2c2ba5981f51PaaE() unnamed_addr #3 {
entry-block:
  %0 = load %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 0), align 8
  %1 = tail call dereferenceable(4) %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %0()
  %2 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 0
  %3 = load i8* %2, align 1, !range !13
  %cond.i.i = icmp eq i8 %3, 1
  br i1 %cond.i.i, label %match_case.i.i, label %match_else.i.i

match_else.i.i:                                   ; preds = %entry-block
  %4 = load i8 ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 1), align 8
  %5 = tail call i8 %4()
  %6 = zext i8 %5 to i16
  %7 = shl nuw i16 %6, 8
  %8 = or i16 %7, 1
  %9 = bitcast %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1 to i16*
  store i16 %8, i16* %9, align 1
  %10 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2, i64 0
  %11 = bitcast i8* %10 to %"2.core::cell::Cell<bool>"*
  br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit

match_case.i.i:                                   ; preds = %entry-block
  %12 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2
  %13 = bitcast [1 x i8]* %12 to %"2.core::cell::Cell<bool>"*
  br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit

_ZN7get_foo20h73631102cae2c5b4lbaE.exit:          ; preds = %match_else.i.i, %match_case.i.i
  %.0.i.i = phi %"2.core::cell::Cell<bool>"* [ %13, %match_case.i.i ], [ %11, %match_else.i.i ]
  %.0.idx.i.i = getelementptr %"2.core::cell::Cell<bool>"* %.0.i.i, i64 0, i32 0, i32 0
  %.0.idx.val.i.i = load i8* %.0.idx.i.i, align 1
  %14 = icmp ne i8 %.0.idx.val.i.i, 0
  ret i1 %14
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-codegenArea: Code generationA-thread-localsArea: Thread local storage (TLS)C-enhancementCategory: An issue proposing an enhancement or a PR with one.I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions