-
Notifications
You must be signed in to change notification settings - Fork 954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broadcast_add is very slow #2499
Comments
I used modified code (with the loop count adjusted to 10 times), in conjunction with the following cargo.toml and the flamegraph tool, to obtain this flame graph. A significant portion of the time cost appears to be the iterator overhead in the fn main() -> Result<()> {
let a = Tensor::rand(0f32, 1.0, (32,630, 12,32), &candle_core::Device::Cpu)?;
let b = Tensor::rand(0f32, 1.0, (32,1, 1, 32), &candle_core::Device::Cpu)?;
let start = std::time::Instant::now();
for _ in 0..10{
let _ = a.broadcast_add(&b);
}
println!("broadcast add : {:?}",std::time::Instant::now()-start);
Ok(())
} broadcast add : 441.14726ms |
I modified the loop count of broadcast_add in the code to 10 times and tested the performance of both optimized and unoptimized binary products, comparing the results with the performance of the Python code. Experimental Conditions:
compiler setting [profile.release]
opt-level = 2
lto = false
debug = true
panic = 'abort' and this is my result: Obviously, Python code is 10x faster than candle I used strace to trace the system calls of both implementations and found that the Python version of broadcast seems to use multi-threading to distribute tensor operations (in this case, 48 threads). I am not sure if this implementation mechanism improves computational efficiency. In contrast, the Rust code in this case used only one thread/logical core. I tested whether multi-threading could enhance the efficiency of the code in this scenario using Rayon. strace --follow-forks --summary-only python3 broadcast_add.py
strace: Process 66105 attached
strace: Process 66106 attached
strace: Process 66107 attached
strace: Process 66108 attached
strace: Process 66109 attached
strace: Process 66110 attached
strace: Process 66111 attached
strace: Process 66112 attached
strace: Process 66113 attached
strace: Process 66114 attached
... // modified code
use rayon::prelude::*;
use std::time::Instant;
fn main() -> Result<()> {
let a = Arc::new(Tensor::rand(0f32, 1.0, (32, 630, 12, 32), &candle_core::Device::Cpu)?);
let b = Arc::new(Tensor::rand(0f32, 1.0, (32, 1, 1, 32), &candle_core::Device::Cpu)?);
let start = Instant::now();
(0..100).into_par_iter().for_each(|_| {
let a_clone = a.clone();
let b_clone = b.clone();
let _ = a_clone.broadcast_add(&b_clone);
});
println!("broadcast add with Rayon: {:?}", Instant::now() - start);
Ok(())
} rayon version:broadcast add with Rayon(100x): 781.284023ms However, such a coarse level of parallelism doesn't make practical sense. Could you tell me if there is an API or another method that could make Candle Tensor achieve the same efficiency as Torch in this scenario, and how does Torch internally implement tensor.broadcast_add? Thank you very much. @LaurentMazare |
Rust code
Python Code
Python version is around 55 times faster for a simple addition function. Is there a pathway ahead to solve this problem ?
The text was updated successfully, but these errors were encountered: