Open
Description
I'm working on a C++/Rust interop tool called Zngur, and when benchmarking it, I noticed this problem. Here is a reduced version without any Zngur related code.
This C++ code:
#include <cstdint>
#include <array>
extern "C" {
void push_to_vec(void *v, uint64_t i);
void new_vec_in_stack(void *v);
void free_vec_in_stack(void *v);
}
struct MyVec {
alignas(8) std::array<uint8_t, 24> data;
MyVec() {
new_vec_in_stack(reinterpret_cast<void*>(data.begin()));
}
// ~MyVec() {
// free_vec_in_stack(reinterpret_cast<void*>(data.begin()));
// }
};
void build_vec(int n)
{
MyVec v;
void* vec = reinterpret_cast<void*>(v.data.begin());
for (int i = 0; i < n; i++)
{
push_to_vec(vec, i);
}
free_vec_in_stack(vec);
}
extern "C" {
void do_the_job()
{
for (int i = 0; i < 100000; i++)
{
build_vec(10000);
}
}
}
Becomes significantly (2x) slower if I use the destructor (commented out) instead of manually calling free_vec_in_stack
at the end of build_vec
function. Even when I add an empty destructor, it will become 2x slower. Marking the destructor as inline
doesn't help.
Here is the Rust driver code:
use std::ffi::c_void;
use std::time::Instant;
#[unsafe(no_mangle)]
pub extern "C" fn new_vec_in_stack(v: *mut c_void) {
unsafe {
std::ptr::write(v as *mut Vec<u64>, vec![]);
}
}
#[unsafe(no_mangle)]
pub extern "C" fn free_vec_in_stack(v: *mut c_void) {
unsafe {
_ = std::ptr::read(v as *mut Vec<u64>);
}
}
#[unsafe(no_mangle)]
pub extern "C" fn push_to_vec(v: *mut c_void, i: u64) {
let v = unsafe { &mut *(v as *mut Vec<u64>) };
v.push(i);
}
unsafe extern "C" {
fn do_the_job();
}
fn build_vec(n: u64) -> Vec<u64> {
let mut r = vec![];
for i in 0..n {
r.push(i);
}
r
}
fn main() {
let start = Instant::now();
for _ in 0..100_000 {
std::hint::black_box(build_vec(10000));
}
println!("Pure rust = {:?}", start.elapsed());
let start = Instant::now();
unsafe {
do_the_job();
}
println!("Cross language = {:?}", start.elapsed());
}
Here is the result of the with destructor version:
Pure rust = 1.57105235s
Cross language = 3.138335498s
Here is the result of the without destructor version:
Pure rust = 1.633161618s
Cross language = 1.655836619s
And this one is the result of without destructor version, but when xlto is disabled:
Pure rust = 1.608407431s
Cross language = 3.019778757s
I enable xlto using this command:
cargo clean && CXX=clang++ RUSTFLAGS="-Clinker-plugin-lto -Clinker=clang -Clink-arg=-fuse-ld=lld" cargo run -r
And here is my build.rs
file:
fn main() {
cc::Build::new()
.cpp(true)
.file("job.cpp")
.flag("-flto=thin")
.compile("libjob.a");
}