-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[on hold] Switch convert_2q_block_matrix.rs
to use matmul
from faer
and an explicit version of kron
#12193
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test the code, I checked the tests in the transpiler
folder (/test/python/transpiler
), especially test_consolidate_blocks.py, which I saw uses the Rust function through its call to _block_to_matrix. My first idea was to maybe write a new test, but perhaps this one is enough to test the good behavior of the code, what do you think?
/// This function will substitue `change_basis` once the | ||
/// `two_qubit_decompose.rs` uses Mat<c64> instead of ArrayView2 | ||
#[inline] | ||
pub fn change_basis_faer(matrix: Mat<c64>) -> Mat<c64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that this function ends up replacing change_basis
once the two_qubit_decompose.rs
uses Mat<c64>
instead of ArrayView2
. I created another function just to avoid modifying the other file until the follow-up. After that, the function will be renamed to change_basis
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arnau and I discussed this as the best approach because otherwise two_qubit_decompose.rs needed non-trivial edits.
Thank you for opening a new pull request. Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient. While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone. One or more of the the following people are requested to review this:
|
Pull Request Test Coverage Report for Build 8986784365Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good, thanks for getting this started. I have a couple of inline suggestions and comments but this is a great start! Do you have any benchmarks comparing the performance of a transpile or the ConsolidateBlocks
pass with this change? I'm curious to see what the user facing performance changes will look like with this PR.
matmul( | ||
matrix.as_mut(), | ||
result.as_ref(), | ||
aux.as_ref(), | ||
None, | ||
c64::new(1., 0.), | ||
Parallelism::None, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the faer docs, it looks like you may be able to just do:
let res = result * matrix;
https://docs.rs/faer/latest/faer/#basic-usage
It obviously doesn't give us the same level as control as calling faer::modules::core::mul::matmul;
but I wonder if there is a performance difference with it, as I think it would be a little cleaner looking than calling matmul like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think it looks cleaner too.
I wonder if there is a performance difference with it
To test it, I ran the same tests as in the issue, and it seems like we lose performance if we don't call faer::modules::core::mul::matmul;
% change avg (asterisk -> matmul) | asterisk (min) | asterisk (max) | asterisk (avg) | matmul (min) | matmul (max) | matmul (avg) | |
---|---|---|---|---|---|---|---|
2x2 | 92.80% | 524ns | 651ns | 547.67ns | 38ns | 70ns | 39.41ns |
4x4 | 78.19% | 659ns | 1.00µs | 689.28ns | 145ns | 217ns | 150.34ns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's quite the difference, I wasn't expecting that. Then yeah lets leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait what exactly were you timing here? Was it just the time to call matmul
or did you include the clone()
time? The bare matmul
call by itself is definitely gonna be faster, but you still need to allocate the result space which you are in effect doing with the clone()
call above. The multiplication operator just does that implicitly so it's not a manual operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, I was only timing the matmul
call in general, without using the same matrix as the destination either. I rewrote the test including the clone()
.
So for the matmul
call, I'm timing:
let aux = dst.clone();
matmul(dst.as_mut(), a.as_ref(), aux.as_ref(), Some(ALPHA), BETA, Parallelism::None);
For the other case:
dst = &dst + &a * &dst;
ALPHA
and BETA
are 1 for the first case.
% change avg (asterisk -> matmul) | asterisk (min) | asterisk (max) | asterisk (avg) | matmul (min) | matmul (max) | matmul (avg) | |
---|---|---|---|---|---|---|---|
2x2 | 56.61% | 589ns | 868ns | 632.74ns | 263ns | 432ns | 274.56ns |
4x4 | 44.37% | 717ns | 938ns | 763.04ns | 397ns | 558ns | 424.51ns |
Thanks for the review Matthew! I've run the transpiler tests, but I don't see any notable improvement by only modifying this script. Similar results can be obtained by running the Based on the benchmarks in the issue, I suspect this PR does improve the performance, but maybe this part of the code only makes up a small portion of the execution time in the asv benchmarks, making the impact inconsistently? Results obtained:
|
Yeah in general the |
[0] => identity.kron(op_matrix.as_ref()), | ||
[1] => op_matrix.kron(identity.as_ref()), | ||
[1, 0] => change_basis_faer(op_matrix), | ||
[] => Mat::<c64>::identity(4, 4), | ||
_ => op_matrix.to_owned(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The op_matrix.to_owned()
here is a bit weird to me. It feels like an unnecessary copy as we don't actually need an owned copy of the array from python here as it's only used in the matmul
call below. The previous logic was a bit ugly but was done that way to avoid copying with to_owned()
. The other option would be to maybe try something like:
[0] => identity.kron(op_matrix.as_ref()).as_ref(),
[1] => op_matrix.kron(identity.as_ref()).as_ref(),
[1, 0] => change_basis_faer(op_matrix).as_ref(),
[] => Mat::<c64>::identity(4, 4).as_ref(),
_ => op_matrix.to_owned(),
and change result
to be a MatRef<c64>
instead of of a Mat<c64>
. Although I'm not sure if the compiler would be happy with this (there might be a lifetime issue doing this. The other advantage is if result
becomes a MatRef
we can create the 4x4 identity matrix as a static and avoid the allocation (although in practice I'm not sure this ever gets used so it probably doesn't matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried the code you suggested, and as you said, the compiler is not happy about it. The error message is "temporary value is freed at the end of this statement" referring to each case of the match
where we use as_ref()
.
Eric and I have been trying to fix it, but the only alternative we saw to avoid copying with to_owned()
is to recover the previous logic, making the result
matrix an Option<Mat<c64>>
and "duplicating" the matmul
call using the unwrapped value of the Some
or op_matrix
in case of None
. It's not as readable as now, but performance-wise should be better.
Here is the git diff
of the change. What do you think about doing something like this again? I suspect we will see this case in two_qubit_decompose.rs
too.
let op_matrix = op_matrix.as_array().into_faer_complex();
let result = match q_list.as_slice() {
- [0] => identity.kron(op_matrix.as_ref()),
- [1] => op_matrix.kron(identity.as_ref()),
- [1, 0] => change_basis_faer(op_matrix),
- [] => Mat::<c64>::identity(4, 4),
- _ => op_matrix.to_owned(),
+ [0] => Some(identity.kron(op_matrix.as_ref())),
+ [1] => Some(op_matrix.kron(identity.as_ref())),
+ [1, 0] => Some(change_basis_faer(op_matrix)),
+ [] => Some(Mat::<c64>::identity(4, 4)),
+ _ => None,
};
let aux = matrix.clone();
- matmul(
- matrix.as_mut(),
- result.as_ref(),
- aux.as_ref(),
- None,
- c64::new(1., 0.),
- Parallelism::None,
- );
+ match result {
+ Some(x) => {
+ matmul(
+ matrix.as_mut(),
+ x.as_ref(),
+ aux.as_ref(),
+ None,
+ c64::new(1., 0.),
+ Parallelism::None,
+ );
+ },
+ None => {
+ matmul(
+ matrix.as_mut(),
+ op_matrix,
+ aux.as_ref(),
+ None,
+ c64::new(1., 0.),
+ Parallelism::None,
+ );
+ }
+ };
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this is what we were trying to do:
let result = match q_list.as_slice() {
[0] => Some(identity.kron(op_matrix.as_ref())),
[1] => Some(op_matrix.kron(identity.as_ref())),
[1, 0] => Some(change_basis_faer(op_matrix)),
[] => Some(Mat::<c64>::identity(4, 4)),
_ => None,
};
let arg1 = match result {
// Go from Mat to MatRef. This failed because the lifetime of x
Some(x) => x.as_ref(),
// op_matrix is already MatRef
None => op_matrix
}
Using result.as_ref().unwrap_or(op_matrix)
does not work. result.as_ref
results in an Option<&Mat>
rather than Option<MatRef>
and they are not the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think lets use this pattern, it's ugly but it lets us avoid a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using
result.as_ref().unwrap_op(op_matrix)
does not work.
Should this not be result.as_ref().map(|x| x.as_ref()).unwrap_op(op_matrix)
?
Fwiw, match result
is wrong because it's consuming the result
, so there's nothing for the reference to refer to. The match needs to be on result.as_ref()
to keep the underlying matrix alive, and then it's no problem if you get an &Mat
- you can just call Mat::as_ref
on one of those, no trouble - that's just the regular API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see #12193 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be result.as_ref().map(|x| x.as_ref()).unwrap_op(op_matrix)?
Yes, that works! Much better than duplicating the matmul
call, thanks Jake!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM now; thanks for doing this and all the very prompt updates and investigation on this work! I just had one small nit inline but I think after that's resolved this should be good to merge.
let op_matrix = op_matrix.as_array(); | ||
let op_matrix = op_matrix.as_array().into_faer_complex(); | ||
|
||
let result = match q_list.as_slice() { | ||
[0] => Some(kron(&identity, &op_matrix)), | ||
[1] => Some(kron(&op_matrix, &identity)), | ||
[1, 0] => Some(change_basis(op_matrix)), | ||
[] => Some(Array2::eye(4)), | ||
[0] => Some(identity.kron(op_matrix)), | ||
[1] => Some(op_matrix.kron(identity)), | ||
[1, 0] => Some(change_basis_faer(op_matrix)), | ||
[] => Some(Mat::<c64>::identity(4, 4)), | ||
_ => None, | ||
}; | ||
matrix = match result { | ||
Some(result) => result.dot(&matrix), | ||
None => op_matrix.dot(&matrix), | ||
|
||
let aux = matrix.clone(); | ||
match result { | ||
Some(x) => { | ||
matmul( | ||
matrix.as_mut(), | ||
x.as_ref(), | ||
aux.as_ref(), | ||
None, | ||
c64::new(1., 0.), | ||
Parallelism::None, | ||
); | ||
} | ||
None => { | ||
matmul( | ||
matrix.as_mut(), | ||
op_matrix, | ||
aux.as_ref(), | ||
None, | ||
c64::new(1., 0.), | ||
Parallelism::None, | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all this is just equivalent to:
let op_matrix = op_matrix.as_array().into_faer_complex();
let expanded = match q_list.as_slice() {
[0] => Some(identity.kron(op_matrix)),
[1] => Some(op_matrix.kron(identity)),
[1, 0] => Some(change_basis_faer(op_matrix)),
[] => Some(Mat::<c64>::identity(4, 4)),
_ => None,
};
matrix = expanded.as_ref().map(|x| x.as_ref()).unwrap_or(op_matrix) * matrix;
I could be a shade wrong about that if the parallelism effects are different, but otherwise I think this is what we're doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see #12193 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should be equivalent. However, it seems like using the overloaded operator *
is not as efficient as calling to matmul
. I created a simple example and these were the results using cargo bench
:
test tests::benchmark04_ndarray_dot ... bench: 1,190 ns/iter (+/- 113)
test tests::benchmark05_faer_overload_matmul ... bench: 1,454 ns/iter (+/- 118)
test tests::benchmark06_faer_matmul ... bench: 695 ns/iter (+/- 81)
The tests perform a matrix multiplication in a loop of 5 iterations, using dot()
, *
between two Mat<Complex64>
, and matmul()
let mut aux = Mat::<c64>::with_capacity(4, 4); | ||
// SAFETY: `aux` is a 4x4 matrix whose values are uninitialized and it's used only to store the | ||
// result of the `matmul` call inside the for loop | ||
unsafe { aux.set_dims(4, 4) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matrix is used to avoid having to clone the result of the matrix multiplication in each iteration of the loop. The result will be stored initially on aux
, and using a swap
operation (line 81), the result will be moved to matrix: Mat<c64>
to be ready for the next iteration or to finish the loop and continue with the code.
Re-running the tests I wrote in the issue, I realized that using
To double-check the results, I wrote similar tests but using the 1 - In this case, the tests are comparing the In this PR, because we are using
2- These 3 tests compare the functions
3- Finally, we have tests to combine both methods. The tests multiply a 4x4 matrix and another one obtained as a result of a kronecker product in a for loop storing the result in the first 4x4 matrix. Here, the four options are
All the tests were run in a loop of 100 iterations. Running the tests with few iterations or even without a for loop gave very inconsistent results in each run and with a big standard deviation in comparison with the mean. I suspect that this could be normal given that the times are so little in each operation individually. It could be worth checking the assembler code generated to make sure the compiler is not making any optimization, but I wanted to give you all an update on the investigation as I was getting very confusing results with the other method last week. |
This version should be faster than using a library for These functions change two things with respect to the library functions:
A couple of features:
It uses, for example, // Compute `kron(identity, mat)` for 2x2 matrix inputs
fn kron_id2_oneq(oneq_mat: ArrayView2<Complex64>) -> Array2<Complex64> {
let _zero = Complex64::new(0., 0.);
array![
[oneq_mat[[0,0]], oneq_mat[[0,1]], _zero, _zero],
[oneq_mat[[1,0]], oneq_mat[[1,1]], _zero, _zero],
[_zero, _zero, oneq_mat[[0,0]], oneq_mat[[0,1]]],
[_zero, _zero, oneq_mat[[1,0]], oneq_mat[[1,1]]],
]
}
I added two more functions that do the same, but that take pre-allocated arrays. I kind of doubt this will make a big performance difference. EDIT: Notice that there was a match arm that appeared to do essentially ANOTHER EDIT:
I'll try to benchmark or characterize this soon. If someone else already has code they can drop this into, feel free of course. |
Thanks for the comment, @jlapeyre! That’s a really good idea, but I have one question. I see that you still use I have been testing with your code and this could be an example of an equivalence with |
Using
|
I thought that too, but looking at the documentation, I found the function general_mat_mul. I tested it against
The only way that I know is to use the faer_ext crate that has different methods to convert a
I only tested it together with a matrix multiplication simulating the scenario we have in this PR, but I'll make a test between all
Oh, interesting. My intention was to work with pointers here to avoid having to copy any matrix. I'll investigate about it 🤔
Exactly! Sorry, I left
If I'm not wrong the other use of |
I didn't mention that one because I didn't see a way to skip the scalar and vector operations.
Huh. So
As long as we have more than one library and data type representing matrices, this kind of problem will come up repeatedly. We could use a macro to handle more than one kind of data. Or use traits. Or actually, since the code is very small, just implement these methods for both kinds of matrices. We need to move these functions from the modules where they are applied to a common module. My best thought at the moment is something called common.rs. When the number of functions in that module starts to get large, we could split the functions out by category. |
I like the idea! In the last commit, I moved there the new After testing the new For this script, it doesn't make a big impact to use the versions that need preallocated arrays, so I used the allocating versions to make the code more idiomatic in the I've tested that the let old_ptr1 = aux.as_ptr();
let old_ptr2 = matrix.as_ptr();
swap(&mut aux, &mut matrix);
assert_eq!(old_ptr1, matrix.as_ptr());
assert_eq!(old_ptr2, aux.as_ptr()); Summary of the changes in this PR:
In a follow-up, I'll change the |
convert_2q_block_matrix.rs
to use matmul
and kron
from faer
convert_2q_block_matrix.rs
to use matmul
from faer
and an explicit version of kron
convert_2q_block_matrix.rs
to use matmul
from faer
and an explicit version of kron
convert_2q_block_matrix.rs
to use matmul
from faer
and an explicit version of kron
After looking closely at This is caused by the fact that we can transform a An alternative to get a speedup is to follow @jlapeyre's suggestion and implement the matrix multiplication and kron by hand in size-fixed functions for the two cases. In this PR you can find an example of this using a new module named In the new module we can find the following functions:
As you can see, in the In the following tests, we can see the performance improvement of one single call of those functions compared with the Kronecker product 2x2 (identity ⊗ matrix)
Matrix multiplication 2x2
Matrix multiplication 4x4
Determinant
To test the performance I have also run the
I was going to add tests to |
Part of #12120
This PR switches over the
convert_2q_block_matrix.rs
script to usematmul
fromfaer
instead ofdot
fromndarray
to perform matrix multiplications. It turns out that usingfaer
for matrices of size 4x4 or smaller gives us a better performance compared tondarray
as we can find in this comment.Summary of the changes in this PR:
kron
2x2 that improves the performances of thekron
callsmatmul()
fromfaer
improving the performance of thedot()
callscommon.rs
as a module with different common functions between different scriptsIn a follow-up, the same change will be done for the
two_qubit_decompose.rs
script, which is still usingndarray
, in the meantime, and working with the oldchange_basis
function.