-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Use load
+store
instead of memcpy
for small integer arrays
#111999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -380,7 +380,19 @@ pub fn memcpy_ty<'a, 'tcx, Bx: BuilderMethods<'a, 'tcx>>( | |
return; | ||
} | ||
|
||
bx.memcpy(dst, dst_align, src, src_align, bx.cx().const_usize(size), flags); | ||
if flags == MemFlags::empty() | ||
&& let Some(bty) = bx.cx().scalar_copy_backend_type(layout) | ||
{ | ||
// I look forward to only supporting opaque pointers | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a FIXME so someone may remove this when llvm only has opaque ptrs? (Well, I guess removing this logic would also preclude other backends with typed ptrs, too. In that case, maybe no comment at all.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, it's a bigger question because GCC would need to move, so I'll leave it tracked by conversations like https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/llvm.20bitcasts.20in.20codegen/near/356591425 instead of something specific in this place. |
||
let pty = bx.type_ptr_to(bty); | ||
let src = bx.pointercast(src, pty); | ||
let dst = bx.pointercast(dst, pty); | ||
Comment on lines
+388
to
+389
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When do these not match pty? Why not just return a more "accurate" type in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Well, the shallow answer is that they don't match because it's Because emitting these as the backend type doesn't necessarily do what we want. Notably, this PR is emitting the load/store as I tried using the LLVM type first, but with arrays that results in exploding out the IR: https://llvm.godbolt.org/z/vjjsdea9e Optimizing the version that loads/stores arrays define void @replace_short_array_using_arrays(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
%1 = load [3 x i32], ptr %r, align 4
store [3 x i32] %1, ptr %0, align 4
%2 = load [3 x i32], ptr %v, align 4
store [3 x i32] %2, ptr %r, align 4
ret void
} gives define void @replace_short_array_using_arrays(ptr noalias nocapture noundef writeonly sret([3 x i32]) dereferenceable(12) %0, ptr noalias nocapture noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
%.unpack = load i32, ptr %r, align 4
%.elt1 = getelementptr inbounds [3 x i32], ptr %r, i64 0, i64 1
%.unpack2 = load i32, ptr %.elt1, align 4
%.elt3 = getelementptr inbounds [3 x i32], ptr %r, i64 0, i64 2
%.unpack4 = load i32, ptr %.elt3, align 4
store i32 %.unpack, ptr %0, align 4
%.repack5 = getelementptr inbounds [3 x i32], ptr %0, i64 0, i64 1
store i32 %.unpack2, ptr %.repack5, align 4
%.repack7 = getelementptr inbounds [3 x i32], ptr %0, i64 0, i64 2
store i32 %.unpack4, ptr %.repack7, align 4
%.unpack9 = load i32, ptr %v, align 4
%.elt10 = getelementptr inbounds [3 x i32], ptr %v, i64 0, i64 1
%.unpack11 = load i32, ptr %.elt10, align 4
%.elt12 = getelementptr inbounds [3 x i32], ptr %v, i64 0, i64 2
%.unpack13 = load i32, ptr %.elt12, align 4
store i32 %.unpack9, ptr %r, align 4
store i32 %.unpack11, ptr %.elt1, align 4
store i32 %.unpack13, ptr %.elt3, align 4
ret void
} whereas optimizing the version with vectors leaves the operations together define void @replace_short_array_using_vectors(ptr noalias nocapture noundef writeonly sret([3 x i32]) dereferenceable(12) %0, ptr noalias nocapture noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
%1 = load <3 x i32>, ptr %r, align 4
store <3 x i32> %1, ptr %0, align 4
%2 = load <3 x i32>, ptr %v, align 4
store <3 x i32> %2, ptr %r, align 4
ret void
} for the backend to legalize instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, and I originally expected to do this with LLVM's arbitrary-length integer support https://llvm.godbolt.org/z/hzG9aeqhM, like I did for comparisons back in #85828 define void @replace_short_array_using_long_integer(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
%1 = load i96, ptr %r, align 4
store i96 %1, ptr %0, align 4
%2 = load i96, ptr %v, align 4
store i96 %2, ptr %r, align 4
ret void
} But that broke some of the autovectorization codegen tests for operations on arrays. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah ok, so we explicitly want to be lowering these moves to vector types specifically because they can optimize better than arrays. |
||
|
||
let temp = bx.load(bty, src, src_align); | ||
bx.store(temp, dst, dst_align); | ||
} else { | ||
bx.memcpy(dst, dst_align, src, src_align, bx.cx().const_usize(size), flags); | ||
} | ||
} | ||
|
||
pub fn codegen_instance<'a, 'tcx: 'a, Bx: BuilderMethods<'a, 'tcx>>( | ||
|
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -126,6 +126,28 @@ pub trait LayoutTypeMethods<'tcx>: Backend<'tcx> { | |||||||||
index: usize, | ||||||||||
immediate: bool, | ||||||||||
) -> Self::Type; | ||||||||||
|
||||||||||
/// A type that can be used in a [`super::BuilderMethods::load`] + | ||||||||||
/// [`super::BuilderMethods::store`] pair to implement a *typed* copy, | ||||||||||
/// such as a MIR `*_0 = *_1`. | ||||||||||
/// | ||||||||||
/// It's always legal to return `None` here, as the provided impl does, | ||||||||||
/// in which case callers should use [`super::BuilderMethods::memcpy`] | ||||||||||
/// instead of the `load`+`store` pair. | ||||||||||
/// | ||||||||||
/// This can be helpful for things like arrays, where the LLVM backend type | ||||||||||
/// `[3 x i16]` optimizes to three separate loads and stores, but it can | ||||||||||
/// instead be copied via an `i48` that stays as the single `load`+`store`. | ||||||||||
/// (As of 2023-05 LLVM cannot necessarily optimize away a `memcpy` in these | ||||||||||
/// cases, due to `poison` handling, but in codegen we have more information | ||||||||||
/// about the type invariants, so can emit something better instead.) | ||||||||||
/// | ||||||||||
/// This *should* return `None` for particularly-large types, where leaving | ||||||||||
/// the `memcpy` may well be important to avoid code size explosion. | ||||||||||
fn scalar_copy_backend_type(&self, layout: TyAndLayout<'tcx>) -> Option<Self::Type> { | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having a default impl here may make it not obvious that other backends should emit typed load/store. Is it bad style to just add this to cranelift and codegen_gcc here too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, cg_clif doesn't use cg_ssa, so it's not impacted. I have no practical way to test cg_gcc, being on windows, and I also don't actually have any information that doing something other than Notably, GCC apparently doesn't have the rust/compiler/rustc_codegen_gcc/src/common.rs Lines 76 to 79 in f8447b9
so indeed it might just never need to do this. |
||||||||||
let _ = layout; | ||||||||||
None | ||||||||||
} | ||||||||||
} | ||||||||||
|
||||||||||
// For backends that support CFI using type membership (i.e., testing whether a given pointer is | ||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
// compile-flags: -O -C no-prepopulate-passes | ||
// min-llvm-version: 15.0 (for opaque pointers) | ||
|
||
#![crate_type = "lib"] | ||
|
||
// CHECK-LABEL: @array_load | ||
#[no_mangle] | ||
pub fn array_load(a: &[u8; 4]) -> [u8; 4] { | ||
// CHECK: %0 = alloca [4 x i8], align 1 | ||
// CHECK: %[[TEMP1:.+]] = load <4 x i8>, ptr %a, align 1 | ||
// CHECK: store <4 x i8> %[[TEMP1]], ptr %0, align 1 | ||
// CHECK: %[[TEMP2:.+]] = load i32, ptr %0, align 1 | ||
// CHECK: ret i32 %[[TEMP2]] | ||
*a | ||
} | ||
|
||
// CHECK-LABEL: @array_store | ||
#[no_mangle] | ||
pub fn array_store(a: [u8; 4], p: &mut [u8; 4]) { | ||
// CHECK: %a = alloca [4 x i8] | ||
// CHECK: %[[TEMP:.+]] = load <4 x i8>, ptr %a, align 1 | ||
// CHECK-NEXT: store <4 x i8> %[[TEMP]], ptr %p, align 1 | ||
*p = a; | ||
} | ||
|
||
// CHECK-LABEL: @array_copy | ||
#[no_mangle] | ||
pub fn array_copy(a: &[u8; 4], p: &mut [u8; 4]) { | ||
// CHECK: %[[LOCAL:.+]] = alloca [4 x i8], align 1 | ||
// CHECK: %[[TEMP1:.+]] = load <4 x i8>, ptr %a, align 1 | ||
// CHECK: store <4 x i8> %[[TEMP1]], ptr %[[LOCAL]], align 1 | ||
// CHECK: %[[TEMP2:.+]] = load <4 x i8>, ptr %[[LOCAL]], align 1 | ||
// CHECK: store <4 x i8> %[[TEMP2]], ptr %p, align 1 | ||
*p = *a; | ||
} |
Uh oh!
There was an error while loading. Please reload this page.