Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[on hold] Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron #12193

Closed
wants to merge 20 commits into from

Conversation

arnaucasau
Copy link
Contributor

@arnaucasau arnaucasau commented Apr 16, 2024

Part of #12120

This PR switches over the convert_2q_block_matrix.rs script to use matmul from faer instead of dot from ndarray to perform matrix multiplications. It turns out that using faer for matrices of size 4x4 or smaller gives us a better performance compared to ndarray as we can find in this comment.

Summary of the changes in this PR:

  • The use of an explicit/unrolled version of kron 2x2 that improves the performances of the kron calls
  • The use of matmul() from faer improving the performance of the dot() calls
  • Introduction of common.rs as a module with different common functions between different scripts

In a follow-up, the same change will be done for the two_qubit_decompose.rs script, which is still using ndarray, in the meantime, and working with the old change_basis function.

@qiskit-bot qiskit-bot added the Community PR PRs from contributors that are not 'members' of the Qiskit repo label Apr 16, 2024
Copy link
Contributor Author

@arnaucasau arnaucasau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test the code, I checked the tests in the transpiler folder (/test/python/transpiler), especially test_consolidate_blocks.py, which I saw uses the Rust function through its call to _block_to_matrix. My first idea was to maybe write a new test, but perhaps this one is enough to test the good behavior of the code, what do you think?

/// This function will substitue `change_basis` once the
/// `two_qubit_decompose.rs` uses Mat<c64> instead of ArrayView2
#[inline]
pub fn change_basis_faer(matrix: Mat<c64>) -> Mat<c64> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that this function ends up replacing change_basis once the two_qubit_decompose.rs uses Mat<c64> instead of ArrayView2. I created another function just to avoid modifying the other file until the follow-up. After that, the function will be renamed to change_basis again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arnau and I discussed this as the best approach because otherwise two_qubit_decompose.rs needed non-trivial edits.

@arnaucasau arnaucasau marked this pull request as ready for review April 16, 2024 17:56
@arnaucasau arnaucasau requested a review from a team as a code owner April 16, 2024 17:56
@qiskit-bot
Copy link
Collaborator

Thank you for opening a new pull request.

Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient.

While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone.

One or more of the the following people are requested to review this:

@coveralls
Copy link

coveralls commented Apr 16, 2024

Pull Request Test Coverage Report for Build 8986784365

Details

  • 84 of 86 (97.67%) changed or added relevant lines in 3 files are covered.
  • 7 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.02%) to 89.648%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/convert_2q_block_matrix.rs 31 33 93.94%
Files with Coverage Reduction New Missed Lines %
crates/accelerate/src/convert_2q_block_matrix.rs 1 94.0%
crates/qasm2/src/lex.rs 6 92.11%
Totals Coverage Status
Change from base Build 8983739322: 0.02%
Covered Lines: 62320
Relevant Lines: 69516

💛 - Coveralls

Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good, thanks for getting this started. I have a couple of inline suggestions and comments but this is a great start! Do you have any benchmarks comparing the performance of a transpile or the ConsolidateBlocks pass with this change? I'm curious to see what the user facing performance changes will look like with this PR.

crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved
crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved
Comment on lines 66 to 73
matmul(
matrix.as_mut(),
result.as_ref(),
aux.as_ref(),
None,
c64::new(1., 0.),
Parallelism::None,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the faer docs, it looks like you may be able to just do:

let res = result * matrix;

https://docs.rs/faer/latest/faer/#basic-usage

It obviously doesn't give us the same level as control as calling faer::modules::core::mul::matmul; but I wonder if there is a performance difference with it, as I think it would be a little cleaner looking than calling matmul like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think it looks cleaner too.

I wonder if there is a performance difference with it

To test it, I ran the same tests as in the issue, and it seems like we lose performance if we don't call faer::modules::core::mul::matmul;

% change avg (asterisk -> matmul) asterisk (min) asterisk (max) asterisk (avg) matmul (min) matmul (max) matmul (avg)
2x2 92.80% 524ns 651ns 547.67ns 38ns 70ns 39.41ns
4x4 78.19% 659ns 1.00µs 689.28ns 145ns 217ns 150.34ns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's quite the difference, I wasn't expecting that. Then yeah lets leave it as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait what exactly were you timing here? Was it just the time to call matmul or did you include the clone() time? The bare matmul call by itself is definitely gonna be faster, but you still need to allocate the result space which you are in effect doing with the clone() call above. The multiplication operator just does that implicitly so it's not a manual operation.

Copy link
Contributor Author

@arnaucasau arnaucasau Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, I was only timing the matmul call in general, without using the same matrix as the destination either. I rewrote the test including the clone().

So for the matmul call, I'm timing:

let aux = dst.clone();
matmul(dst.as_mut(), a.as_ref(), aux.as_ref(), Some(ALPHA), BETA, Parallelism::None);

For the other case:

dst = &dst + &a * &dst;

ALPHA and BETA are 1 for the first case.

% change avg (asterisk -> matmul) asterisk (min) asterisk (max) asterisk (avg) matmul (min) matmul (max) matmul (avg)
2x2 56.61% 589ns 868ns 632.74ns 263ns 432ns 274.56ns
4x4 44.37% 717ns 938ns 763.04ns 397ns 558ns 424.51ns

crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved
@mtreinish mtreinish added Changelog: None Do not include in changelog Rust This PR or issue is related to Rust code in the repository Intern PR PR submitted by IBM Quantum interns and removed Community PR PRs from contributors that are not 'members' of the Qiskit repo labels Apr 17, 2024
@mtreinish mtreinish added this to the 1.1.0 milestone Apr 17, 2024
@arnaucasau
Copy link
Contributor Author

Thanks for the review Matthew!

I've run the transpiler tests, but I don't see any notable improvement by only modifying this script. Similar results can be obtained by running the passes.py test that calls to ConsolidateBlocks. In this case, there are some tests with slightly better performance than before, but it's not very consistence across different executions.

Based on the benchmarks in the issue, I suspect this PR does improve the performance, but maybe this part of the code only makes up a small portion of the execution time in the asv benchmarks, making the impact inconsistently?

Results obtained:

Benchmarks that have stayed the same:

| Change   | Before [de4b5a21] <main>   | After [a34927f3] <AC/2q_blocks_faer>   | Ratio   | Benchmark (Parameter)                                                                                         |
|----------|----------------------------|----------------------------------------|---------|---------------------------------------------------------------------------------------------------------------|
|          | 474±5ms                    | 473±4ms                                | 1.00    | transpiler_benchmarks.TranspilerBenchSuite.time_compile_from_large_qasm                                       |
|          | 3.43±0.03ms                | 3.45±0.1ms                             | 1.01    | transpiler_benchmarks.TranspilerBenchSuite.time_cx_compile                                                    |
|          | 4.52±0.1ms                 | 4.44±0.04ms                            | 0.98    | transpiler_benchmarks.TranspilerBenchSuite.time_single_gate_compile                                           |
|          | 1.25±0s                    | 1.24±0s                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(0)                          |
|          | 520±3ms                    | 521±5ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(1)                          |
|          | 1.97±0.02s                 | 2.02±0.01s                             | 1.03    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(2)                          |
|          | 2.53±0.01s                 | 2.56±0.02s                             | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)                          |
|          | 157±2ms                    | 161±7ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(0)                                       |
|          | 136±2ms                    | 135±2ms                                | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(1)                                       |
|          | 64.8±1ms                   | 64.6±0.4ms                             | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(0)                                 |
|          | 137±1ms                    | 135±3ms                                | 0.98    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(1)                                 |
|          | 276±3ms                    | 275±3ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(2)                                 |
|          | 154±1ms                    | 158±0.7ms                              | 1.03    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)                                 |
|          | 68.3±0.8ms                 | 68.0±0.8ms                             | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(0)               |
|          | 141±2ms                    | 142±1ms                                | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(1)               |
|          | 273±2ms                    | 278±2ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(2)               |
|          | 156±1ms                    | 156±2ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3)               |
|          | 129±1ms                    | 128±0.8ms                              | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(0)                                      |
|          | 109±1ms                    | 111±7ms                                | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(1)                                      |
|          | 340±5ms                    | 346±1ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(2)                                      |
|          | 388±7ms                    | 389±7ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)                                      |
|          | 2565                       | 2565                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(0)                   |
|          | 1403                       | 1403                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(1)                   |
|          | 1403                       | 1403                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(2)                   |
|          | 1296                       | 1296                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(3)                   |
|          | 2705                       | 2705                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(0)                          |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(1)                          |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(2)                          |
|          | 7                          | 7                                      | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(3)                          |
|          | 2705                       | 2705                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(0)        |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(1)        |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(2)        |
|          | 7                          | 7                                      | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(3)        |
|          | 323                        | 323                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(0)                               |
|          | 336                        | 336                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(1)                               |
|          | 336                        | 336                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(2)                               |
|          | 272                        | 272                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(3)                               |
|          | 38.6±1ms                   | 38.8±2ms                               | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'dense')         |
|          | 37.1±0.3ms                 | 37.2±0.2ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'sabre')         |
|          | 85.2±0.3ms                 | 86.1±1ms                               | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'dense')    |
|          | 82.8±0.2ms                 | 83.0±0.3ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'sabre')    |
|          | 43.8±1ms                   | 44.0±0.7ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'dense')         |
|          | 42.9±0.6ms                 | 42.2±0.2ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'sabre')         |
|          | 92.9±3ms                   | 90.0±0.8ms                             | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'dense')    |
|          | 94.5±10ms                  | 88.6±0.7ms                             | 0.94    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'sabre')    |
|          | 61.1±1ms                   | 60.1±0.5ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')         |
|          | 58.9±0.3ms                 | 58.7±0.7ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'sabre')         |
|          | 116±9ms                    | 112±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'dense')    |
|          | 114±6ms                    | 109±0.5ms                              | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'sabre')    |
|          | 100.0±2ms                  | 99.4±0.7ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')         |
|          | 108±3ms                    | 105±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')         |
|          | 184±8ms                    | 175±1ms                                | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'dense')    |
|          | 177±3ms                    | 176±0.8ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre')    |
|          | 55.1±3ms                   | 52.5±0.2ms                             | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'dense')         |
|          | 51.8±6ms                   | 50.9±0.5ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'sabre')         |
|          | 213±4ms                    | 204±2ms                                | 0.96    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'dense')    |
|          | 209±1ms                    | 208±2ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'sabre')    |
|          | 64.5±0.6ms                 | 64.2±0.4ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'dense')         |
|          | 65.3±0.3ms                 | 64.8±0.5ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'sabre')         |
|          | 224±6ms                    | 217±2ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'dense')    |
|          | 228±3ms                    | 219±0.8ms                              | 0.96    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'sabre')    |
|          | 115±8ms                    | 105±1ms                                | ~0.91   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')         |
|          | 109±1ms                    | 108±0.9ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')         |
|          | 298±30ms                   | 272±1ms                                | 0.91    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'dense')    |
|          | 278±9ms                    | 271±2ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre')    |
|          | 213±3ms                    | 206±0.8ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')         |
|          | 227±10ms                   | 220±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')         |
|          | 571±40ms                   | 554±4ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'dense')    |
|          | 548±10ms                   | 535±2ms                                | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre')    |
|          | 51.1±0.8ms                 | 51.4±1ms                               | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'dense')             |
|          | 47.9±0.4ms                 | 47.5±0.2ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'sabre')             |
|          | 156±5ms                    | 152±0.7ms                              | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'dense')        |
|          | 191±8ms                    | 173±0.9ms                              | ~0.91   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'sabre')        |
|          | 74.1±10ms                  | 61.2±0.2ms                             | ~0.83   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'dense')             |
|          | 60.4±0.7ms                 | 60.2±0.3ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'sabre')             |
|          | 167±2ms                    | 165±0.6ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'dense')        |
|          | 173±2ms                    | 174±1ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'sabre')        |
|          | 135±3ms                    | 134±1ms                                | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'dense')             |
|          | 135±1ms                    | 137±2ms                                | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'sabre')             |
|          | 134±2ms                    | 135±2ms                                | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'dense')        |
|          | 134±0.5ms                  | 134±1ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'sabre')        |
|          | 136±0.7ms                  | 135±0.7ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'dense')             |
|          | 137±0.7ms                  | 136±0.4ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'sabre')             |
|          | 140±2ms                    | 137±2ms                                | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'dense')        |
|          | 136±1ms                    | 136±0.3ms                              | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'sabre')        |
|          | 257                        | 257                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'dense')      |
|          | 241                        | 241                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'sabre')      |
|          | 245                        | 245                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'dense') |
|          | 237                        | 237                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'sabre') |
|          | 218                        | 218                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'dense')      |
|          | 212                        | 212                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'sabre')      |
|          | 221                        | 221                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'dense') |
|          | 221                        | 221                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'sabre') |
|          | 212                        | 212                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'dense')      |
|          | 208                        | 208                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'sabre')      |
|          | 207                        | 207                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'dense') |
|          | 213                        | 213                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'sabre') |
|          | 288                        | 283                                    | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'dense')      |
|          | 259                        | 263                                    | 1.02    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'sabre')      |
|          | 332                        | 329                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'dense') |
|          | 304                        | 312                                    | 1.03    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'sabre') |
|          | 55                         | 55                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'dense')      |
|          | 58                         | 58                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'sabre')      |
|          | 58                         | 58                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'sabre') |
|          | 51                         | 51                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'dense')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'sabre')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'sabre') |
|          | 49                         | 49                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'dense')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'sabre')      |
|          | 44                         | 44                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'sabre') |
|          | 73                         | 73                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'dense')      |
|          | 55                         | 55                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'sabre')      |
|          | 85                         | 84                                     | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'dense') |
|          | 77                         | 77                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'sabre') |
|          | 610                        | 610                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'dense')      |
|          | 558                        | 558                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'sabre')      |
|          | 573                        | 573                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'dense') |
|          | 545                        | 545                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'sabre') |
|          | 507                        | 507                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'dense')      |
|          | 494                        | 494                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'sabre')      |
|          | 505                        | 505                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'dense') |
|          | 498                        | 498                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'sabre') |
|          | 488                        | 488                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'dense')      |
|          | 478                        | 478                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'sabre')      |
|          | 503                        | 503                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'dense') |
|          | 457                        | 457                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'sabre') |
|          | 647                        | 647                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'dense')      |
|          | 628                        | 624                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'sabre')      |
|          | 769                        | 772                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'dense') |
|          | 711                        | 705                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'sabre') |

@mtreinish
Copy link
Member

Based on the benchmarks in the #12120 (comment), I suspect this PR does improve the performance, but maybe this part of the code only makes up a small portion of the execution time in the asv benchmarks, making the impact inconsistently?

Yeah in general the ConsolidateBlocks pass hasn't been a bottleneck since we ported the linear algebra to rust (you can see the old tracking issue for this here: #8779). So making it 50% faster is unlikely to be noticeable from a full transpile at optimization level 3 (and at level 2 after #12149) until other things are faster. This is more future facing once everything else in the transpiler is faster and this pass becomes more of a bottleneck. The impact for two_qubit_basis_decomposer.rs will hopefully be more noticeable since while the rust implementation of the linear algebra there was significantly faster the pass still is more of a bottleneck (although not the top one anymore). I was mostly curious to confirm this though, thanks for running the benchmarks.

Comment on lines 55 to 59
[0] => identity.kron(op_matrix.as_ref()),
[1] => op_matrix.kron(identity.as_ref()),
[1, 0] => change_basis_faer(op_matrix),
[] => Mat::<c64>::identity(4, 4),
_ => op_matrix.to_owned(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The op_matrix.to_owned() here is a bit weird to me. It feels like an unnecessary copy as we don't actually need an owned copy of the array from python here as it's only used in the matmul call below. The previous logic was a bit ugly but was done that way to avoid copying with to_owned(). The other option would be to maybe try something like:

            [0] => identity.kron(op_matrix.as_ref()).as_ref(),
            [1] => op_matrix.kron(identity.as_ref()).as_ref(),
            [1, 0] => change_basis_faer(op_matrix).as_ref(),
            [] => Mat::<c64>::identity(4, 4).as_ref(),
            _ => op_matrix.to_owned(),

and change result to be a MatRef<c64> instead of of a Mat<c64>. Although I'm not sure if the compiler would be happy with this (there might be a lifetime issue doing this. The other advantage is if result becomes a MatRef we can create the 4x4 identity matrix as a static and avoid the allocation (although in practice I'm not sure this ever gets used so it probably doesn't matter).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried the code you suggested, and as you said, the compiler is not happy about it. The error message is "temporary value is freed at the end of this statement" referring to each case of the match where we use as_ref().

Eric and I have been trying to fix it, but the only alternative we saw to avoid copying with to_owned() is to recover the previous logic, making the result matrix an Option<Mat<c64>> and "duplicating" the matmul call using the unwrapped value of the Some or op_matrix in case of None. It's not as readable as now, but performance-wise should be better.

Here is the git diff of the change. What do you think about doing something like this again? I suspect we will see this case in two_qubit_decompose.rs too.

         let op_matrix = op_matrix.as_array().into_faer_complex();
 
         let result = match q_list.as_slice() {
-            [0] => identity.kron(op_matrix.as_ref()),
-            [1] => op_matrix.kron(identity.as_ref()),
-            [1, 0] => change_basis_faer(op_matrix),
-            [] => Mat::<c64>::identity(4, 4),
-            _ => op_matrix.to_owned(),
+            [0] => Some(identity.kron(op_matrix.as_ref())),
+            [1] => Some(op_matrix.kron(identity.as_ref())),
+            [1, 0] => Some(change_basis_faer(op_matrix)),
+            [] => Some(Mat::<c64>::identity(4, 4)),
+            _ => None,
         };
 
         let aux = matrix.clone();
-        matmul(
-            matrix.as_mut(),
-            result.as_ref(),
-            aux.as_ref(),
-            None,
-            c64::new(1., 0.),
-            Parallelism::None,
-        );
+        match result {
+            Some(x) => {
+                matmul(
+                    matrix.as_mut(),
+                    x.as_ref(),
+                    aux.as_ref(),
+                    None,
+                    c64::new(1., 0.),
+                    Parallelism::None,
+                );
+            },
+            None => {
+                matmul(
+                    matrix.as_mut(),
+                    op_matrix,
+                    aux.as_ref(),
+                    None,
+                    c64::new(1., 0.),
+                    Parallelism::None,
+                );
+            }
+        };
     }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this is what we were trying to do:

let result = match q_list.as_slice() {
  [0] => Some(identity.kron(op_matrix.as_ref())),
  [1] => Some(op_matrix.kron(identity.as_ref())),
  [1, 0] => Some(change_basis_faer(op_matrix)),
  [] => Some(Mat::<c64>::identity(4, 4)),
   _ => None,
};
let arg1 = match result {
   // Go from Mat to MatRef. This failed because the lifetime of x
  Some(x) => x.as_ref(),
  // op_matrix is already MatRef
  None => op_matrix
}

Using result.as_ref().unwrap_or(op_matrix) does not work. result.as_ref results in an Option<&Mat> rather than Option<MatRef> and they are not the same thing.

Copy link
Member

@mtreinish mtreinish Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think lets use this pattern, it's ugly but it lets us avoid a copy

Copy link
Member

@jakelishman jakelishman Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using result.as_ref().unwrap_op(op_matrix) does not work.

Should this not be result.as_ref().map(|x| x.as_ref()).unwrap_op(op_matrix)?

Fwiw, match result is wrong because it's consuming the result, so there's nothing for the reference to refer to. The match needs to be on result.as_ref() to keep the underlying matrix alive, and then it's no problem if you get an &Mat - you can just call Mat::as_ref on one of those, no trouble - that's just the regular API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be result.as_ref().map(|x| x.as_ref()).unwrap_op(op_matrix)?

Yes, that works! Much better than duplicating the matmul call, thanks Jake!

Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now; thanks for doing this and all the very prompt updates and investigation on this work! I just had one small nit inline but I think after that's resolved this should be good to merge.

crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved
crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved
Comment on lines 46 to 83
let op_matrix = op_matrix.as_array();
let op_matrix = op_matrix.as_array().into_faer_complex();

let result = match q_list.as_slice() {
[0] => Some(kron(&identity, &op_matrix)),
[1] => Some(kron(&op_matrix, &identity)),
[1, 0] => Some(change_basis(op_matrix)),
[] => Some(Array2::eye(4)),
[0] => Some(identity.kron(op_matrix)),
[1] => Some(op_matrix.kron(identity)),
[1, 0] => Some(change_basis_faer(op_matrix)),
[] => Some(Mat::<c64>::identity(4, 4)),
_ => None,
};
matrix = match result {
Some(result) => result.dot(&matrix),
None => op_matrix.dot(&matrix),

let aux = matrix.clone();
match result {
Some(x) => {
matmul(
matrix.as_mut(),
x.as_ref(),
aux.as_ref(),
None,
c64::new(1., 0.),
Parallelism::None,
);
}
None => {
matmul(
matrix.as_mut(),
op_matrix,
aux.as_ref(),
None,
c64::new(1., 0.),
Parallelism::None,
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this is just equivalent to:

let op_matrix = op_matrix.as_array().into_faer_complex();
let expanded = match q_list.as_slice() {
    [0] => Some(identity.kron(op_matrix)),
    [1] => Some(op_matrix.kron(identity)),
    [1, 0] => Some(change_basis_faer(op_matrix)),
    [] => Some(Mat::<c64>::identity(4, 4)),
    _ => None,
};
matrix = expanded.as_ref().map(|x| x.as_ref()).unwrap_or(op_matrix) * matrix;

I could be a shade wrong about that if the parallelism effects are different, but otherwise I think this is what we're doing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@arnaucasau arnaucasau Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should be equivalent. However, it seems like using the overloaded operator * is not as efficient as calling to matmul. I created a simple example and these were the results using cargo bench:

test tests::benchmark04_ndarray_dot          ... bench:    1,190 ns/iter (+/- 113)
test tests::benchmark05_faer_overload_matmul ... bench:    1,454 ns/iter (+/- 118)
test tests::benchmark06_faer_matmul          ... bench:     695 ns/iter (+/- 81)

The tests perform a matrix multiplication in a loop of 5 iterations, using dot(), * between two Mat<Complex64>, and matmul()

Comment on lines +54 to +57
let mut aux = Mat::<c64>::with_capacity(4, 4);
// SAFETY: `aux` is a 4x4 matrix whose values are uninitialized and it's used only to store the
// result of the `matmul` call inside the for loop
unsafe { aux.set_dims(4, 4) };
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matrix is used to avoid having to clone the result of the matrix multiplication in each iteration of the loop. The result will be stored initially on aux, and using a swap operation (line 81), the result will be moved to matrix: Mat<c64> to be ready for the next iteration or to finish the loop and continue with the code.

@arnaucasau
Copy link
Contributor Author

arnaucasau commented Apr 29, 2024

Re-running the tests I wrote in the issue, I realized that using std::time::Instant to benchmark each operation might not be the best approach. It turns out that there's a large variability in the elapsed time when running multiple tests at once or one by one. Moreover, the documentation says:

...instants are not guaranteed to be steady. In other words, each tick of the underlying clock might not be the same length (e.g. some seconds may be longer than others)

To double-check the results, I wrote similar tests but using the cargo bench command (only available in the nightly version). There, using test::black_block to avoid undesirable optimizations, I obtained the following results:

1 - In this case, the tests are comparing the kron() function from ndarray, the kron() method on a matrix object from faer, and the kron() function from faer that needs an allocation. The last one seems to be the best.

In this PR, because we are using Option to wrap the results, we would need to use an auxiliary function, but I'm not sure if it is worth changing again to use the kron with allocation given that there's almost no impact on the performance in my tests with ASV (the kron appear in 2 out 5 cases of the match). it could be interesting for two_qubit_decompose.rs, but let me know what you think.

test tests::benchmark01_ndarray_kron              ... bench:   12,480 ns/iter (+/- 3,439)
test tests::benchmark02_faer_overload_kron        ... bench:   25,293 ns/iter (+/- 3,565)
test tests::benchmark03_faer_kron                 ... bench:    4,399 ns/iter (+/- 671)

2- These 3 tests compare the functions dot(), * for faer, and matmul() simulating the case we have in the script we are modifying in this PR where we need to store the result in one of the matrices we use in the product. In this case, using matmul() with the swap technique implemented in df0edac seems to perform much better than the alternatives.

test tests::benchmark04_ndarray_dot               ... bench:   26,438 ns/iter (+/- 5,070)
test tests::benchmark05_faer_overload_matmul      ... bench:   26,455 ns/iter (+/- 4,192)
test tests::benchmark06_faer_matmul               ... bench:    7,231 ns/iter (+/- 2,002)

3- Finally, we have tests to combine both methods. The tests multiply a 4x4 matrix and another one obtained as a result of a kronecker product in a for loop storing the result in the first 4x4 matrix. Here, the four options are kron() and dot() from ndarray; kron() and matmul() from faer; kron() from ndarray and matmul() from faer; and kron() on a matrix object and * from faer. The best alternative is to use always faer with the extra allocation (uninitialized matrices, not using clone() or zeros()).

test tests::benchmark07_ndarray_kron_dot          ... bench:   70,301 ns/iter (+/- 5,617)
test tests::benchmark08_faer_kron_matmul          ... bench:   15,627 ns/iter (+/- 3,920)
test tests::benchmark09_ndarray_kron_matmul       ... bench:   22,966 ns/iter (+/- 1,890)
test tests::benchmark10_faer_overload_kron_matmul ... bench:   73,290 ns/iter (+/- 2,908)

All the tests were run in a loop of 100 iterations. Running the tests with few iterations or even without a for loop gave very inconsistent results in each run and with a big standard deviation in comparison with the mean. I suspect that this could be normal given that the times are so little in each operation individually.

It could be worth checking the assembler code generated to make sure the compiler is not making any optimization, but I wanted to give you all an update on the investigation as I was getting very confusing results with the other method last week.

@jlapeyre
Copy link
Contributor

jlapeyre commented Apr 30, 2024

This version should be faster than using a library for kron.

These functions change two things with respect to the library functions:

  • They use an explicit/unrolled version of kron for 2x2 matrices
  • They use the explicit expression that results when one of the matrices is the identity.

A couple of features:

  • There is no arithmetic at all. Just copying values to a matrix.
  • It uses only ndarrays. I think this package provides the standard implementation of heap-allocated multi-dimensional arrays for Rust. So we can't get more minimal than this if we want this data type.

It uses, for example,

// Compute `kron(identity, mat)` for 2x2 matrix inputs
fn kron_id2_oneq(oneq_mat: ArrayView2<Complex64>) -> Array2<Complex64> {
    let _zero = Complex64::new(0., 0.);
    array![
        [oneq_mat[[0,0]], oneq_mat[[0,1]], _zero, _zero],
        [oneq_mat[[1,0]], oneq_mat[[1,1]], _zero, _zero],
        [_zero, _zero, oneq_mat[[0,0]], oneq_mat[[0,1]]],
        [_zero, _zero, oneq_mat[[1,0]], oneq_mat[[1,1]]],
    ]
}

This could further be optimized by passing an array to the new kron_... functions. In fact I am usually in favor of
writing a non allocating version and then implementing an allocating version that calls this.

I added two more functions that do the same, but that take pre-allocated arrays. I kind of doubt this will make a big performance difference.

EDIT: Notice that there was a match arm that appeared to do essentially matrix = identity * matrix. I replaced this with nothing (). Tests pass. But we should make triple-sure this is correct.

ANOTHER EDIT:

  • As mentioned above, AFAICT, ndarray is the de facto standard for heap-allocated, multi-dimensional arrays. It makes sense to me to write routines for these types and convert types from other packages at the call sites. If the conversions are free and ergonomic. Just pointing this out. The implementation I have here already does this.
  • I don't recall whether ndarray has a free, convenient, way to construct its Arrays from Rust's standard fixed-size arrays. If so it would be better to return a standard (fixed-size) Rust array from kron_id2_oneq. As it is, I return an ndarray type.

I'll try to benchmark or characterize this soon. If someone else already has code they can drop this into, feel free of course.

@mtreinish mtreinish removed this from the 1.1.0 milestone May 1, 2024
@arnaucasau
Copy link
Contributor Author

Thanks for the comment, @jlapeyre! That’s a really good idea, but I have one question. I see that you still use dot() in your implementation, are getting better results with it compared to matmul()? I normally see a speedup when using Faer.

I have been testing with your code and this could be an example of an equivalence with Faer that I think should be faster as well. However, in this case, as you said, I also think it won’t make a big performance difference. What do you think?

@jlapeyre
Copy link
Contributor

jlapeyre commented May 2, 2024

Using mat_mul_to_dst

Using dot was just an oversight. I forgot to pay attention to matmul. A routine written specifically for 4x4 matrices would probably be better. But that is for later.

It looks like ndarray has no documented way to do C <- A * B where C is preallocated, right? This is a big deficit. Is there a way to more-or-less cast the ndarray type to a faer type just for the matrix multiplication, rather than carrying the faer type around? I'm trying to think of a way to make the code more future resistant. ndarray is a Rust standard for a linear algebra data type (that doesn't implement non-allocating matrix multiplication, but whatever). Or another way to look at it is that we almost certainly will depend on ndarray for the foreseeable future because of Python. But faer, cgmath, ndalgebra may come and go.

To answer your question, since ndarray has to allocate, I agree using faer for now is a good idea. It might be faster for other reasons as well.

Other things

  • Did you test the special-case kron implementations isolated from other factors? I really should do this before committing to this approach. It seems clear that it should be much faster, but I'd like to be sure. If you did not, I can do it over the next few days.

  • My implementation got rid of this [] => Array2::eye(4). I'm pretty sure multiplying by this identity matrix can be elided.

  • swap(&mut aux, &mut matrix): I think this swaps the data. It should be possible to swap some kind of pointer to these without copying the data.

  • I used a verbose identifier in my implementation of kron..., which reduced readability. I prefer the lhs[(0, 0)] that you have. But what does lhs mean here? Is this like: $\text{lhs} \otimes \text{id}$ and then $\text{id} \otimes \text{rhs}$ ? But you have lhs everywhere.

  • There are some other uses in this crate of kron with 2x2 matrices, one of which is the identity. I think it's a good idea to change all of these in this PR.

@arnaucasau
Copy link
Contributor Author

It looks like ndarray has no documented way to do C <- A * B where C is preallocated, right?

I thought that too, but looking at the documentation, I found the function general_mat_mul. I tested it against matmul, but I got worse results anyway. I was using MaybeUninit to avoid initializing a new matrix, but maybe this is not the best way to do it.

Is there a way to more-or-less cast the ndarray type to a faer type just for the matrix multiplication, rather than carrying the faer type around?

The only way that I know is to use the faer_ext crate that has different methods to convert a ndarray matrix into a faer one and vice-versa. The downside is that it returns a view or a MatRef which is okay to perform the matrix multiplication, but it will imply needing an extra copy of the matrix if we need the ownership afterward.

Did you test the special-case kron implementations isolated from other factors?

I only tested it together with a matrix multiplication simulating the scenario we have in this PR, but I'll make a test between all kron functions and the special cases (returning ndarray arrays and faer) 👍

swap(&mut aux, &mut matrix): I think this swaps the data.

Oh, interesting. My intention was to work with pointers here to avoid having to copy any matrix. I'll investigate about it 🤔

is this like: lhs ⊗ id and then id ⊗ rhs? But you have lhs everywhere

Exactly! Sorry, I left lhs in the second function by error when doing the copy/paste. It is fixed now in the code.

There are some other uses in this crate of kron with 2x2 matrices, one of which is the identity. I think it's a good idea to change all of these in this PR

If I'm not wrong the other use of kron is at two_qubit_decompose.rs. It's a good idea to change it in this PR as well, but I think it might depend on the library we choose to implement the explicit/unrolled versions of kron. If we end up using faer, we'll need to do an extra copy of the matrices after the conversion, and maybe it's better (performance-wise) to just switch it in a follow-up that will also change the dot calls to use faer there.

@jlapeyre
Copy link
Contributor

jlapeyre commented May 3, 2024

I found the function general_mat_mul

I didn't mention that one because I didn't see a way to skip the scalar and vector operations.

but it will imply needing an extra copy of the matrix if we need the ownership afterward.

Huh. So faer does not provide a way to transfer ownership of the underlying data as well? If so, in the medium term, we could bring it up with the author or make a PR.

it might depend on the library we choose to implement the explicit/unrolled versions of kron

As long as we have more than one library and data type representing matrices, this kind of problem will come up repeatedly. We could use a macro to handle more than one kind of data. Or use traits. Or actually, since the code is very small, just implement these methods for both kinds of matrices.

We need to move these functions from the modules where they are applied to a common module. My best thought at the moment is something called common.rs. When the number of functions in that module starts to get large, we could split the functions out by category.

@arnaucasau
Copy link
Contributor Author

arnaucasau commented May 7, 2024

We need to move these functions from the modules where they are applied to a common module. My best thought at the moment is something called common.rs.

I like the idea! In the last commit, I moved there the new kron functions and the change_basis we were also using in other modules 👍

After testing the new kron functions, I see that using ndarray is always faster than faer, so I changed the code to use ndarray except when we need a Mat with ownership. In this case, we could use the faer version to avoid the to_owned().

For this script, it doesn't make a big impact to use the versions that need preallocated arrays, so I used the allocating versions to make the code more idiomatic in the match, and I changed as well the other uses of kron in this crate (two_qubit_decompose.rs).

I've tested that the swap is swapping the pointers using the following assertion when running the tests, and it seems like that is the case:

let old_ptr1 = aux.as_ptr();
let old_ptr2 = matrix.as_ptr();
swap(&mut aux, &mut matrix);
assert_eq!(old_ptr1, matrix.as_ptr());
assert_eq!(old_ptr2, aux.as_ptr());

Summary of the changes in this PR:

  • The use of an explicit/unrolled version of kron 2x2 that improves the performances of the kron calls
  • The use of matmul() from faer improving the performance of the dot() calls
  • Introduction of common.rs as a module with different common functions between different scripts

In a follow-up, I'll change the two_qubit_decompose.rs script to use matmul() and maybe add more optimization like the new kron functions in the common.rs module.

@arnaucasau arnaucasau changed the title Switching over convert_2q_block_matrix.rs to use matmul and kron from faer Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron May 7, 2024
@arnaucasau arnaucasau changed the title Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron [on hold] Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron May 28, 2024
@arnaucasau arnaucasau marked this pull request as draft May 28, 2024 14:56
@arnaucasau
Copy link
Contributor Author

@mtreinish

After looking closely at two_qubit_decompose.rs, I think we should keep ndarray for matrix multiplication in the accelerate crate. Even though faer improves the performance of the multiplications for the 2x2 and 4x4 cases, we end up needing to perform extra copies of the matrices when we want to convert them into PyArray using into_pyarray_bound().

This is caused by the fact that we can transform a faer Mat to an ndarray array view, but we need to use to_owned() to get the ownership.

An alternative to get a speedup is to follow @jlapeyre's suggestion and implement the matrix multiplication and kron by hand in size-fixed functions for the two cases. In this PR you can find an example of this using a new module named common.rs

In the new module we can find the following functions:

  • kron_matrix_x_identity(): A kron function that performs the product of a given matrix and the identity
  • kron_identity_x_matrix(): A kron function that performs the product of the identity and a given matrix.
  • matrix_multiply_2x2(): A 2x2 matmul completely unrolled
  • matrix_multiply_4x4(): A 4x4 matmul using three for loops which I think are easily vectorizable/unrolled for the compiler
  • determinant_4x4(): A function to calculate the determinant of a 4x4 matrix
  • change_basis(): The same function previously living on convert_2q_block_matrix.rs and moved here to be reused in more scripts

As you can see, in the common.rs module, we have more functions than just the matrix multiplications. All of them help increase the performance of the crate as they are more efficient. The determinant_4x4 function also allows us to stop using faer for that purpose and avoid doing an extra copy of a matrix in one part of the code.

In the following tests, we can see the performance improvement of one single call of those functions compared with the ndarray and faer equivalences:

Kronecker product 2x2 (identity ⊗ matrix)

test tests::benchmark11_faer_kron_2x2              ... bench:         143 ns/iter (+/- 9)
test tests::benchmark12_ndarray_kron_2x2           ... bench:         144 ns/iter (+/- 3)
test tests::benchmark13_common_kron_2x2            ... bench:          42 ns/iter (+/- 4)

Matrix multiplication 2x2

test tests::benchmark_faer_matmul_2x2              ... bench:         150 ns/iter (+/- 2)
test tests::benchmark_ndarray_dot_2x2              ... bench:         230 ns/iter (+/- 10)
test tests::benchmark_common_matrix_multiply_2x2   ... bench:          63 ns/iter (+/- 0)

Matrix multiplication 4x4

test tests::benchmark_faer_matmul_4x4              ... bench:         285 ns/iter (+/- 6)
test tests::benchmark_ndarray_dot_4x4              ... bench:         313 ns/iter (+/- 15)
test tests::benchmark_common_matrix_multiply_4x4   ... bench:         151 ns/iter (+/- 4)

Determinant

test tests::benchmark_faer_determinant_4x4         ... bench:         521 ns/iter (+/- 19)
test tests::benchmark_common_determinant_4x4       ... bench:         124 ns/iter (+/- 9)

To test the performance I have also run the transpiler tests using ASV. The result was that the performance was increased, and these are the tests that improved:

| Change   | Before [9d03b4bf] <main>   | After [a650f588] <AC/custom-linalg>   |   Ratio | Benchmark (Parameter)                                                                                      |
|----------|----------------------------|---------------------------------------|---------|------------------------------------------------------------------------------------------------------------|
| -        | 564±2ms                    | 512±0.8ms                             |    0.91 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre') |
| -        | 302±0.2ms                  | 264±0.5ms                             |    0.87 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')      |
| -        | 126±0.8ms                  | 109±0.7ms                             |    0.86 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')      |
| -        | 249±0.4ms                  | 212±0.9ms                             |    0.85 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre') |
| -        | 158±0.4ms                  | 133±0.3ms                             |    0.84 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')      |
| -        | 264±0.7ms                  | 222±3ms                               |    0.84 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')      |
| -        | 719±0.9ms                  | 599±0.2ms                             |    0.83 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre') |
| -        | 161±1ms                    | 125±0.4ms                             |    0.78 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')      |
| -        | 391±0.5ms                  | 306±0.5ms                             |    0.78 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')      |
| -        | 347±0.7ms                  | 259±0.4ms                             |    0.75 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')      |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

I was going to add tests to common.rs, but I wanted to show you all first to know your opinion about the change of approach.

@arnaucasau arnaucasau closed this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog Intern PR PR submitted by IBM Quantum interns performance Rust This PR or issue is related to Rust code in the repository
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants