[on hold] Switch `convert_2q_block_matrix.rs` to use `matmul` from `faer` and an explicit version of `kron` #12193

arnaucasau · 2024-04-16T17:52:47Z

This PR switches over the convert_2q_block_matrix.rs script to use matmul from faer instead of dot from ndarray to perform matrix multiplications. It turns out that using faer for matrices of size 4x4 or smaller gives us a better performance compared to ndarray as we can find in this comment.

Summary of the changes in this PR:

The use of an explicit/unrolled version of kron 2x2 that improves the performances of the kron calls
The use of matmul() from faer improving the performance of the dot() calls
Introduction of common.rs as a module with different common functions between different scripts

In a follow-up, the same change will be done for the two_qubit_decompose.rs script, which is still using ndarray, in the meantime, and working with the old change_basis function.

arnaucasau

To test the code, I checked the tests in the transpiler folder (/test/python/transpiler), especially test_consolidate_blocks.py, which I saw uses the Rust function through its call to _block_to_matrix. My first idea was to maybe write a new test, but perhaps this one is enough to test the good behavior of the code, what do you think?

arnaucasau · 2024-04-16T17:53:29Z

crates/accelerate/src/convert_2q_block_matrix.rs

+/// This function will substitue `change_basis` once the
+/// `two_qubit_decompose.rs` uses Mat<c64> instead of ArrayView2
+#[inline]
+pub fn change_basis_faer(matrix: Mat<c64>) -> Mat<c64> {


The idea is that this function ends up replacing change_basis once the two_qubit_decompose.rs uses Mat<c64> instead of ArrayView2. I created another function just to avoid modifying the other file until the follow-up. After that, the function will be renamed to change_basis again.

Arnau and I discussed this as the best approach because otherwise two_qubit_decompose.rs needed non-trivial edits.

qiskit-bot · 2024-04-16T17:56:23Z

Thank you for opening a new pull request.

Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient.

While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone.

One or more of the the following people are requested to review this:

@Eric-Arellano
@Qiskit/terra-core
@kevinhartman
@mtreinish

coveralls · 2024-04-16T18:11:27Z

Pull Request Test Coverage Report for Build 8986784365

Details

84 of 86 (97.67%) changed or added relevant lines in 3 files are covered.
7 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.02%) to 89.648%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
crates/accelerate/src/convert_2q_block_matrix.rs	31	33	93.94%

Files with Coverage Reduction	New Missed Lines	%
crates/accelerate/src/convert_2q_block_matrix.rs	1	94.0%
crates/qasm2/src/lex.rs	6	92.11%

Totals
Change from base Build 8983739322:	0.02%
Covered Lines:	62320
Relevant Lines:	69516

💛 - Coveralls

mtreinish

Overall this looks good, thanks for getting this started. I have a couple of inline suggestions and comments but this is a great start! Do you have any benchmarks comparing the performance of a transpile or the ConsolidateBlocks pass with this change? I'm curious to see what the user facing performance changes will look like with this PR.

crates/accelerate/src/convert_2q_block_matrix.rs

mtreinish · 2024-04-17T17:10:17Z

crates/accelerate/src/convert_2q_block_matrix.rs

+        matmul(
+            matrix.as_mut(),
+            result.as_ref(),
+            aux.as_ref(),
+            None,
+            c64::new(1., 0.),
+            Parallelism::None,
+        );


Looking at the faer docs, it looks like you may be able to just do:

let res = result * matrix;

https://docs.rs/faer/latest/faer/#basic-usage

It obviously doesn't give us the same level as control as calling faer::modules::core::mul::matmul; but I wonder if there is a performance difference with it, as I think it would be a little cleaner looking than calling matmul like this.

Agreed, I think it looks cleaner too.

I wonder if there is a performance difference with it

To test it, I ran the same tests as in the issue, and it seems like we lose performance if we don't call faer::modules::core::mul::matmul;

% change avg (asterisk -> matmul) asterisk (min) asterisk (max) asterisk (avg) matmul (min) matmul (max) matmul (avg)

2x2 92.80% 524ns 651ns 547.67ns 38ns 70ns 39.41ns

4x4 78.19% 659ns 1.00µs 689.28ns 145ns 217ns 150.34ns

Wow, that's quite the difference, I wasn't expecting that. Then yeah lets leave it as is.

Oh, wait what exactly were you timing here? Was it just the time to call matmul or did you include the clone() time? The bare matmul call by itself is definitely gonna be faster, but you still need to allocate the result space which you are in effect doing with the clone() call above. The multiplication operator just does that implicitly so it's not a manual operation.

Oh, yeah, I was only timing the matmul call in general, without using the same matrix as the destination either. I rewrote the test including the clone().

So for the matmul call, I'm timing:

let aux = dst.clone(); matmul(dst.as_mut(), a.as_ref(), aux.as_ref(), Some(ALPHA), BETA, Parallelism::None);

For the other case:

dst = &dst + &a * &dst;

ALPHA and BETA are 1 for the first case.

% change avg (asterisk -> matmul) asterisk (min) asterisk (max) asterisk (avg) matmul (min) matmul (max) matmul (avg)

2x2 56.61% 589ns 868ns 632.74ns 263ns 432ns 274.56ns

4x4 44.37% 717ns 938ns 763.04ns 397ns 558ns 424.51ns

crates/accelerate/src/convert_2q_block_matrix.rs

arnaucasau · 2024-04-22T16:51:09Z

Thanks for the review Matthew!

I've run the transpiler tests, but I don't see any notable improvement by only modifying this script. Similar results can be obtained by running the passes.py test that calls to ConsolidateBlocks. In this case, there are some tests with slightly better performance than before, but it's not very consistence across different executions.

Based on the benchmarks in the issue, I suspect this PR does improve the performance, but maybe this part of the code only makes up a small portion of the execution time in the asv benchmarks, making the impact inconsistently?

Results obtained:

Benchmarks that have stayed the same:

| Change   | Before [de4b5a21] <main>   | After [a34927f3] <AC/2q_blocks_faer>   | Ratio   | Benchmark (Parameter)                                                                                         |
|----------|----------------------------|----------------------------------------|---------|---------------------------------------------------------------------------------------------------------------|
|          | 474±5ms                    | 473±4ms                                | 1.00    | transpiler_benchmarks.TranspilerBenchSuite.time_compile_from_large_qasm                                       |
|          | 3.43±0.03ms                | 3.45±0.1ms                             | 1.01    | transpiler_benchmarks.TranspilerBenchSuite.time_cx_compile                                                    |
|          | 4.52±0.1ms                 | 4.44±0.04ms                            | 0.98    | transpiler_benchmarks.TranspilerBenchSuite.time_single_gate_compile                                           |
|          | 1.25±0s                    | 1.24±0s                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(0)                          |
|          | 520±3ms                    | 521±5ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(1)                          |
|          | 1.97±0.02s                 | 2.02±0.01s                             | 1.03    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(2)                          |
|          | 2.53±0.01s                 | 2.56±0.02s                             | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)                          |
|          | 157±2ms                    | 161±7ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(0)                                       |
|          | 136±2ms                    | 135±2ms                                | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(1)                                       |
|          | 64.8±1ms                   | 64.6±0.4ms                             | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(0)                                 |
|          | 137±1ms                    | 135±3ms                                | 0.98    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(1)                                 |
|          | 276±3ms                    | 275±3ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(2)                                 |
|          | 154±1ms                    | 158±0.7ms                              | 1.03    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)                                 |
|          | 68.3±0.8ms                 | 68.0±0.8ms                             | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(0)               |
|          | 141±2ms                    | 142±1ms                                | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(1)               |
|          | 273±2ms                    | 278±2ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(2)               |
|          | 156±1ms                    | 156±2ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3)               |
|          | 129±1ms                    | 128±0.8ms                              | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(0)                                      |
|          | 109±1ms                    | 111±7ms                                | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(1)                                      |
|          | 340±5ms                    | 346±1ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(2)                                      |
|          | 388±7ms                    | 389±7ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)                                      |
|          | 2565                       | 2565                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(0)                   |
|          | 1403                       | 1403                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(1)                   |
|          | 1403                       | 1403                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(2)                   |
|          | 1296                       | 1296                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(3)                   |
|          | 2705                       | 2705                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(0)                          |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(1)                          |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(2)                          |
|          | 7                          | 7                                      | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(3)                          |
|          | 2705                       | 2705                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(0)        |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(1)        |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(2)        |
|          | 7                          | 7                                      | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(3)        |
|          | 323                        | 323                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(0)                               |
|          | 336                        | 336                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(1)                               |
|          | 336                        | 336                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(2)                               |
|          | 272                        | 272                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(3)                               |
|          | 38.6±1ms                   | 38.8±2ms                               | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'dense')         |
|          | 37.1±0.3ms                 | 37.2±0.2ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'sabre')         |
|          | 85.2±0.3ms                 | 86.1±1ms                               | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'dense')    |
|          | 82.8±0.2ms                 | 83.0±0.3ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'sabre')    |
|          | 43.8±1ms                   | 44.0±0.7ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'dense')         |
|          | 42.9±0.6ms                 | 42.2±0.2ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'sabre')         |
|          | 92.9±3ms                   | 90.0±0.8ms                             | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'dense')    |
|          | 94.5±10ms                  | 88.6±0.7ms                             | 0.94    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'sabre')    |
|          | 61.1±1ms                   | 60.1±0.5ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')         |
|          | 58.9±0.3ms                 | 58.7±0.7ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'sabre')         |
|          | 116±9ms                    | 112±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'dense')    |
|          | 114±6ms                    | 109±0.5ms                              | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'sabre')    |
|          | 100.0±2ms                  | 99.4±0.7ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')         |
|          | 108±3ms                    | 105±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')         |
|          | 184±8ms                    | 175±1ms                                | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'dense')    |
|          | 177±3ms                    | 176±0.8ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre')    |
|          | 55.1±3ms                   | 52.5±0.2ms                             | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'dense')         |
|          | 51.8±6ms                   | 50.9±0.5ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'sabre')         |
|          | 213±4ms                    | 204±2ms                                | 0.96    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'dense')    |
|          | 209±1ms                    | 208±2ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'sabre')    |
|          | 64.5±0.6ms                 | 64.2±0.4ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'dense')         |
|          | 65.3±0.3ms                 | 64.8±0.5ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'sabre')         |
|          | 224±6ms                    | 217±2ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'dense')    |
|          | 228±3ms                    | 219±0.8ms                              | 0.96    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'sabre')    |
|          | 115±8ms                    | 105±1ms                                | ~0.91   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')         |
|          | 109±1ms                    | 108±0.9ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')         |
|          | 298±30ms                   | 272±1ms                                | 0.91    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'dense')    |
|          | 278±9ms                    | 271±2ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre')    |
|          | 213±3ms                    | 206±0.8ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')         |
|          | 227±10ms                   | 220±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')         |
|          | 571±40ms                   | 554±4ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'dense')    |
|          | 548±10ms                   | 535±2ms                                | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre')    |
|          | 51.1±0.8ms                 | 51.4±1ms                               | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'dense')             |
|          | 47.9±0.4ms                 | 47.5±0.2ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'sabre')             |
|          | 156±5ms                    | 152±0.7ms                              | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'dense')        |
|          | 191±8ms                    | 173±0.9ms                              | ~0.91   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'sabre')        |
|          | 74.1±10ms                  | 61.2±0.2ms                             | ~0.83   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'dense')             |
|          | 60.4±0.7ms                 | 60.2±0.3ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'sabre')             |
|          | 167±2ms                    | 165±0.6ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'dense')        |
|          | 173±2ms                    | 174±1ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'sabre')        |
|          | 135±3ms                    | 134±1ms                                | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'dense')             |
|          | 135±1ms                    | 137±2ms                                | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'sabre')             |
|          | 134±2ms                    | 135±2ms                                | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'dense')        |
|          | 134±0.5ms                  | 134±1ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'sabre')        |
|          | 136±0.7ms                  | 135±0.7ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'dense')             |
|          | 137±0.7ms                  | 136±0.4ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'sabre')             |
|          | 140±2ms                    | 137±2ms                                | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'dense')        |
|          | 136±1ms                    | 136±0.3ms                              | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'sabre')        |
|          | 257                        | 257                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'dense')      |
|          | 241                        | 241                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'sabre')      |
|          | 245                        | 245                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'dense') |
|          | 237                        | 237                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'sabre') |
|          | 218                        | 218                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'dense')      |
|          | 212                        | 212                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'sabre')      |
|          | 221                        | 221                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'dense') |
|          | 221                        | 221                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'sabre') |
|          | 212                        | 212                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'dense')      |
|          | 208                        | 208                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'sabre')      |
|          | 207                        | 207                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'dense') |
|          | 213                        | 213                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'sabre') |
|          | 288                        | 283                                    | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'dense')      |
|          | 259                        | 263                                    | 1.02    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'sabre')      |
|          | 332                        | 329                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'dense') |
|          | 304                        | 312                                    | 1.03    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'sabre') |
|          | 55                         | 55                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'dense')      |
|          | 58                         | 58                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'sabre')      |
|          | 58                         | 58                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'sabre') |
|          | 51                         | 51                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'dense')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'sabre')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'sabre') |
|          | 49                         | 49                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'dense')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'sabre')      |
|          | 44                         | 44                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'sabre') |
|          | 73                         | 73                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'dense')      |
|          | 55                         | 55                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'sabre')      |
|          | 85                         | 84                                     | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'dense') |
|          | 77                         | 77                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'sabre') |
|          | 610                        | 610                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'dense')      |
|          | 558                        | 558                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'sabre')      |
|          | 573                        | 573                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'dense') |
|          | 545                        | 545                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'sabre') |
|          | 507                        | 507                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'dense')      |
|          | 494                        | 494                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'sabre')      |
|          | 505                        | 505                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'dense') |
|          | 498                        | 498                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'sabre') |
|          | 488                        | 488                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'dense')      |
|          | 478                        | 478                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'sabre')      |
|          | 503                        | 503                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'dense') |
|          | 457                        | 457                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'sabre') |
|          | 647                        | 647                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'dense')      |
|          | 628                        | 624                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'sabre')      |
|          | 769                        | 772                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'dense') |
|          | 711                        | 705                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'sabre') |

mtreinish · 2024-04-23T15:36:21Z

Based on the benchmarks in the #12120 (comment), I suspect this PR does improve the performance, but maybe this part of the code only makes up a small portion of the execution time in the asv benchmarks, making the impact inconsistently?

Yeah in general the ConsolidateBlocks pass hasn't been a bottleneck since we ported the linear algebra to rust (you can see the old tracking issue for this here: #8779). So making it 50% faster is unlikely to be noticeable from a full transpile at optimization level 3 (and at level 2 after #12149) until other things are faster. This is more future facing once everything else in the transpiler is faster and this pass becomes more of a bottleneck. The impact for two_qubit_basis_decomposer.rs will hopefully be more noticeable since while the rust implementation of the linear algebra there was significantly faster the pass still is more of a bottleneck (although not the top one anymore). I was mostly curious to confirm this though, thanks for running the benchmarks.

mtreinish · 2024-04-23T15:43:51Z

crates/accelerate/src/convert_2q_block_matrix.rs

+            [0] => identity.kron(op_matrix.as_ref()),
+            [1] => op_matrix.kron(identity.as_ref()),
+            [1, 0] => change_basis_faer(op_matrix),
+            [] => Mat::<c64>::identity(4, 4),
+            _ => op_matrix.to_owned(),


The op_matrix.to_owned() here is a bit weird to me. It feels like an unnecessary copy as we don't actually need an owned copy of the array from python here as it's only used in the matmul call below. The previous logic was a bit ugly but was done that way to avoid copying with to_owned(). The other option would be to maybe try something like:

[0] => identity.kron(op_matrix.as_ref()).as_ref(), [1] => op_matrix.kron(identity.as_ref()).as_ref(), [1, 0] => change_basis_faer(op_matrix).as_ref(), [] => Mat::<c64>::identity(4, 4).as_ref(), _ => op_matrix.to_owned(),

and change result to be a MatRef<c64> instead of of a Mat<c64>. Although I'm not sure if the compiler would be happy with this (there might be a lifetime issue doing this. The other advantage is if result becomes a MatRef we can create the 4x4 identity matrix as a static and avoid the allocation (although in practice I'm not sure this ever gets used so it probably doesn't matter).

I have tried the code you suggested, and as you said, the compiler is not happy about it. The error message is "temporary value is freed at the end of this statement" referring to each case of the match where we use as_ref().

Eric and I have been trying to fix it, but the only alternative we saw to avoid copying with to_owned() is to recover the previous logic, making the result matrix an Option<Mat<c64>> and "duplicating" the matmul call using the unwrapped value of the Some or op_matrix in case of None. It's not as readable as now, but performance-wise should be better.

Here is the git diff of the change. What do you think about doing something like this again? I suspect we will see this case in two_qubit_decompose.rs too.

let op_matrix = op_matrix.as_array().into_faer_complex(); let result = match q_list.as_slice() { - [0] => identity.kron(op_matrix.as_ref()), - [1] => op_matrix.kron(identity.as_ref()), - [1, 0] => change_basis_faer(op_matrix), - [] => Mat::<c64>::identity(4, 4), - _ => op_matrix.to_owned(), + [0] => Some(identity.kron(op_matrix.as_ref())), + [1] => Some(op_matrix.kron(identity.as_ref())), + [1, 0] => Some(change_basis_faer(op_matrix)), + [] => Some(Mat::<c64>::identity(4, 4)), + _ => None, }; let aux = matrix.clone(); - matmul( - matrix.as_mut(), - result.as_ref(), - aux.as_ref(), - None, - c64::new(1., 0.), - Parallelism::None, - ); + match result { + Some(x) => { + matmul( + matrix.as_mut(), + x.as_ref(), + aux.as_ref(), + None, + c64::new(1., 0.), + Parallelism::None, + ); + }, + None => { + matmul( + matrix.as_mut(), + op_matrix, + aux.as_ref(), + None, + c64::new(1., 0.), + Parallelism::None, + ); + } + }; }

FYI this is what we were trying to do:

let result = match q_list.as_slice() { [0] => Some(identity.kron(op_matrix.as_ref())), [1] => Some(op_matrix.kron(identity.as_ref())), [1, 0] => Some(change_basis_faer(op_matrix)), [] => Some(Mat::<c64>::identity(4, 4)), _ => None, }; let arg1 = match result { // Go from Mat to MatRef. This failed because the lifetime of x Some(x) => x.as_ref(), // op_matrix is already MatRef None => op_matrix }

Using result.as_ref().unwrap_or(op_matrix) does not work. result.as_ref results in an Option<&Mat> rather than Option<MatRef> and they are not the same thing.

Yeah, I think lets use this pattern, it's ugly but it lets us avoid a copy

Using result.as_ref().unwrap_op(op_matrix) does not work.

Should this not be result.as_ref().map(|x| x.as_ref()).unwrap_op(op_matrix)?

Fwiw, match result is wrong because it's consuming the result, so there's nothing for the reference to refer to. The match needs to be on result.as_ref() to keep the underlying matrix alive, and then it's no problem if you get an &Mat - you can just call Mat::as_ref on one of those, no trouble - that's just the regular API.

(see #12193 (comment))

Should this not be result.as_ref().map(|x| x.as_ref()).unwrap_op(op_matrix)?

Yes, that works! Much better than duplicating the matmul call, thanks Jake!

mtreinish

This LGTM now; thanks for doing this and all the very prompt updates and investigation on this work! I just had one small nit inline but I think after that's resolved this should be good to merge.

crates/accelerate/src/convert_2q_block_matrix.rs

into AC/2q_blocks_faer

jakelishman · 2024-04-25T12:09:50Z

crates/accelerate/src/convert_2q_block_matrix.rs

-        let op_matrix = op_matrix.as_array();
+        let op_matrix = op_matrix.as_array().into_faer_complex();

        let result = match q_list.as_slice() {
-            [0] => Some(kron(&identity, &op_matrix)),
-            [1] => Some(kron(&op_matrix, &identity)),
-            [1, 0] => Some(change_basis(op_matrix)),
-            [] => Some(Array2::eye(4)),
+            [0] => Some(identity.kron(op_matrix)),
+            [1] => Some(op_matrix.kron(identity)),
+            [1, 0] => Some(change_basis_faer(op_matrix)),
+            [] => Some(Mat::<c64>::identity(4, 4)),
            _ => None,
        };
-        matrix = match result {
-            Some(result) => result.dot(&matrix),
-            None => op_matrix.dot(&matrix),
+
+        let aux = matrix.clone();
+        match result {
+            Some(x) => {
+                matmul(
+                    matrix.as_mut(),
+                    x.as_ref(),
+                    aux.as_ref(),
+                    None,
+                    c64::new(1., 0.),
+                    Parallelism::None,
+                );
+            }
+            None => {
+                matmul(
+                    matrix.as_mut(),
+                    op_matrix,
+                    aux.as_ref(),
+                    None,
+                    c64::new(1., 0.),
+                    Parallelism::None,
+                );
+            }


I think all this is just equivalent to:

let op_matrix = op_matrix.as_array().into_faer_complex(); let expanded = match q_list.as_slice() { [0] => Some(identity.kron(op_matrix)), [1] => Some(op_matrix.kron(identity)), [1, 0] => Some(change_basis_faer(op_matrix)), [] => Some(Mat::<c64>::identity(4, 4)), _ => None, }; matrix = expanded.as_ref().map(|x| x.as_ref()).unwrap_or(op_matrix) * matrix;

I could be a shade wrong about that if the parallelism effects are different, but otherwise I think this is what we're doing?

(see #12193 (comment))

Yeah, this should be equivalent. However, it seems like using the overloaded operator * is not as efficient as calling to matmul. I created a simple example and these were the results using cargo bench:

test tests::benchmark04_ndarray_dot ... bench: 1,190 ns/iter (+/- 113) test tests::benchmark05_faer_overload_matmul ... bench: 1,454 ns/iter (+/- 118) test tests::benchmark06_faer_matmul ... bench: 695 ns/iter (+/- 81)

The tests perform a matrix multiplication in a loop of 5 iterations, using dot(), * between two Mat<Complex64>, and matmul()

arnaucasau · 2024-04-29T17:46:36Z

crates/accelerate/src/convert_2q_block_matrix.rs

+    let mut aux = Mat::<c64>::with_capacity(4, 4);
+    // SAFETY: `aux` is a 4x4 matrix whose values are uninitialized and it's used only to store the
+    // result of the `matmul` call inside the for loop
+    unsafe { aux.set_dims(4, 4) };


This matrix is used to avoid having to clone the result of the matrix multiplication in each iteration of the loop. The result will be stored initially on aux, and using a swap operation (line 81), the result will be moved to matrix: Mat<c64> to be ready for the next iteration or to finish the loop and continue with the code.

arnaucasau · 2024-04-29T20:24:56Z

Re-running the tests I wrote in the issue, I realized that using std::time::Instant to benchmark each operation might not be the best approach. It turns out that there's a large variability in the elapsed time when running multiple tests at once or one by one. Moreover, the documentation says:

...instants are not guaranteed to be steady. In other words, each tick of the underlying clock might not be the same length (e.g. some seconds may be longer than others)

To double-check the results, I wrote similar tests but using the cargo bench command (only available in the nightly version). There, using test::black_block to avoid undesirable optimizations, I obtained the following results:

1 - In this case, the tests are comparing the kron() function from ndarray, the kron() method on a matrix object from faer, and the kron() function from faer that needs an allocation. The last one seems to be the best.

In this PR, because we are using Option to wrap the results, we would need to use an auxiliary function, but I'm not sure if it is worth changing again to use the kron with allocation given that there's almost no impact on the performance in my tests with ASV (the kron appear in 2 out 5 cases of the match). it could be interesting for two_qubit_decompose.rs, but let me know what you think.

test tests::benchmark01_ndarray_kron              ... bench:   12,480 ns/iter (+/- 3,439)
test tests::benchmark02_faer_overload_kron        ... bench:   25,293 ns/iter (+/- 3,565)
test tests::benchmark03_faer_kron                 ... bench:    4,399 ns/iter (+/- 671)

2- These 3 tests compare the functions dot(), * for faer, and matmul() simulating the case we have in the script we are modifying in this PR where we need to store the result in one of the matrices we use in the product. In this case, using matmul() with the swap technique implemented in df0edac seems to perform much better than the alternatives.

test tests::benchmark04_ndarray_dot               ... bench:   26,438 ns/iter (+/- 5,070)
test tests::benchmark05_faer_overload_matmul      ... bench:   26,455 ns/iter (+/- 4,192)
test tests::benchmark06_faer_matmul               ... bench:    7,231 ns/iter (+/- 2,002)

3- Finally, we have tests to combine both methods. The tests multiply a 4x4 matrix and another one obtained as a result of a kronecker product in a for loop storing the result in the first 4x4 matrix. Here, the four options are kron() and dot() from ndarray; kron() and matmul() from faer; kron() from ndarray and matmul() from faer; and kron() on a matrix object and * from faer. The best alternative is to use always faer with the extra allocation (uninitialized matrices, not using clone() or zeros()).

test tests::benchmark07_ndarray_kron_dot          ... bench:   70,301 ns/iter (+/- 5,617)
test tests::benchmark08_faer_kron_matmul          ... bench:   15,627 ns/iter (+/- 3,920)
test tests::benchmark09_ndarray_kron_matmul       ... bench:   22,966 ns/iter (+/- 1,890)
test tests::benchmark10_faer_overload_kron_matmul ... bench:   73,290 ns/iter (+/- 2,908)

All the tests were run in a loop of 100 iterations. Running the tests with few iterations or even without a for loop gave very inconsistent results in each run and with a big standard deviation in comparison with the mean. I suspect that this could be normal given that the times are so little in each operation individually.

It could be worth checking the assembler code generated to make sure the compiler is not making any optimization, but I wanted to give you all an update on the investigation as I was getting very confusing results with the other method last week.

jlapeyre · 2024-04-30T22:14:11Z

This version should be faster than using a library for kron.

These functions change two things with respect to the library functions:

They use an explicit/unrolled version of kron for 2x2 matrices
They use the explicit expression that results when one of the matrices is the identity.

A couple of features:

There is no arithmetic at all. Just copying values to a matrix.
It uses only ndarrays. I think this package provides the standard implementation of heap-allocated multi-dimensional arrays for Rust. So we can't get more minimal than this if we want this data type.

It uses, for example,

// Compute `kron(identity, mat)` for 2x2 matrix inputs
fn kron_id2_oneq(oneq_mat: ArrayView2<Complex64>) -> Array2<Complex64> {
    let _zero = Complex64::new(0., 0.);
    array![
        [oneq_mat[[0,0]], oneq_mat[[0,1]], _zero, _zero],
        [oneq_mat[[1,0]], oneq_mat[[1,1]], _zero, _zero],
        [_zero, _zero, oneq_mat[[0,0]], oneq_mat[[0,1]]],
        [_zero, _zero, oneq_mat[[1,0]], oneq_mat[[1,1]]],
    ]
}

~~This could further be optimized by passing an array to the new kron_... functions. In fact I am usually in favor of~~
~~writing a non allocating version and then implementing an allocating version that calls this~~.

I added two more functions that do the same, but that take pre-allocated arrays. I kind of doubt this will make a big performance difference.

EDIT: Notice that there was a match arm that appeared to do essentially matrix = identity * matrix. I replaced this with nothing (). Tests pass. But we should make triple-sure this is correct.

ANOTHER EDIT:

As mentioned above, AFAICT, ndarray is the de facto standard for heap-allocated, multi-dimensional arrays. It makes sense to me to write routines for these types and convert types from other packages at the call sites. If the conversions are free and ergonomic. Just pointing this out. The implementation I have here already does this.
I don't recall whether ndarray has a free, convenient, way to construct its Arrays from Rust's standard fixed-size arrays. If so it would be better to return a standard (fixed-size) Rust array from kron_id2_oneq. As it is, I return an ndarray type.

I'll try to benchmark or characterize this soon. If someone else already has code they can drop this into, feel free of course.

arnaucasau · 2024-05-02T13:55:38Z

Thanks for the comment, @jlapeyre! That’s a really good idea, but I have one question. I see that you still use dot() in your implementation, are getting better results with it compared to matmul()? I normally see a speedup when using Faer.

I have been testing with your code and this could be an example of an equivalence with Faer that I think should be faster as well. However, in this case, as you said, I also think it won’t make a big performance difference. What do you think?

jlapeyre · 2024-05-02T18:22:36Z

Using `mat_mul_to_dst`

Using dot was just an oversight. I forgot to pay attention to matmul. A routine written specifically for 4x4 matrices would probably be better. But that is for later.

It looks like ndarray has no documented way to do C <- A * B where C is preallocated, right? This is a big deficit. Is there a way to more-or-less cast the ndarray type to a faer type just for the matrix multiplication, rather than carrying the faer type around? I'm trying to think of a way to make the code more future resistant. ndarray is a Rust standard for a linear algebra data type (that doesn't implement non-allocating matrix multiplication, but whatever). Or another way to look at it is that we almost certainly will depend on ndarray for the foreseeable future because of Python. But faer, cgmath, ndalgebra may come and go.

To answer your question, since ndarray has to allocate, I agree using faer for now is a good idea. It might be faster for other reasons as well.

Other things

Did you test the special-case kron implementations isolated from other factors? I really should do this before committing to this approach. It seems clear that it should be much faster, but I'd like to be sure. If you did not, I can do it over the next few days.
My implementation got rid of this [] => Array2::eye(4). I'm pretty sure multiplying by this identity matrix can be elided.
swap(&mut aux, &mut matrix): I think this swaps the data. It should be possible to swap some kind of pointer to these without copying the data.
I used a verbose identifier in my implementation of kron..., which reduced readability. I prefer the lhs[(0, 0)] that you have. But what does lhs mean here? Is this like: $\text{lhs} \otimes \text{id}$ and then $\text{id} \otimes \text{rhs}$ ? But you have lhs everywhere.
There are some other uses in this crate of kron with 2x2 matrices, one of which is the identity. I think it's a good idea to change all of these in this PR.

arnaucasau · 2024-05-03T15:22:52Z

It looks like ndarray has no documented way to do C <- A * B where C is preallocated, right?

I thought that too, but looking at the documentation, I found the function general_mat_mul. I tested it against matmul, but I got worse results anyway. I was using MaybeUninit to avoid initializing a new matrix, but maybe this is not the best way to do it.

Is there a way to more-or-less cast the ndarray type to a faer type just for the matrix multiplication, rather than carrying the faer type around?

The only way that I know is to use the faer_ext crate that has different methods to convert a ndarray matrix into a faer one and vice-versa. The downside is that it returns a view or a MatRef which is okay to perform the matrix multiplication, but it will imply needing an extra copy of the matrix if we need the ownership afterward.

Did you test the special-case kron implementations isolated from other factors?

I only tested it together with a matrix multiplication simulating the scenario we have in this PR, but I'll make a test between all kron functions and the special cases (returning ndarray arrays and faer) 👍

swap(&mut aux, &mut matrix): I think this swaps the data.

Oh, interesting. My intention was to work with pointers here to avoid having to copy any matrix. I'll investigate about it 🤔

is this like: lhs ⊗ id and then id ⊗ rhs? But you have lhs everywhere

Exactly! Sorry, I left lhs in the second function by error when doing the copy/paste. It is fixed now in the code.

There are some other uses in this crate of kron with 2x2 matrices, one of which is the identity. I think it's a good idea to change all of these in this PR

If I'm not wrong the other use of kron is at two_qubit_decompose.rs. It's a good idea to change it in this PR as well, but I think it might depend on the library we choose to implement the explicit/unrolled versions of kron. If we end up using faer, we'll need to do an extra copy of the matrices after the conversion, and maybe it's better (performance-wise) to just switch it in a follow-up that will also change the dot calls to use faer there.

jlapeyre · 2024-05-03T17:25:15Z

I found the function general_mat_mul

I didn't mention that one because I didn't see a way to skip the scalar and vector operations.

but it will imply needing an extra copy of the matrix if we need the ownership afterward.

Huh. So faer does not provide a way to transfer ownership of the underlying data as well? If so, in the medium term, we could bring it up with the author or make a PR.

it might depend on the library we choose to implement the explicit/unrolled versions of kron

As long as we have more than one library and data type representing matrices, this kind of problem will come up repeatedly. We could use a macro to handle more than one kind of data. Or use traits. Or actually, since the code is very small, just implement these methods for both kinds of matrices.

We need to move these functions from the modules where they are applied to a common module. My best thought at the moment is something called common.rs. When the number of functions in that module starts to get large, we could split the functions out by category.

arnaucasau · 2024-05-07T12:39:28Z

We need to move these functions from the modules where they are applied to a common module. My best thought at the moment is something called common.rs.

I like the idea! In the last commit, I moved there the new kron functions and the change_basis we were also using in other modules 👍

After testing the new kron functions, I see that using ndarray is always faster than faer, so I changed the code to use ndarray except when we need a Mat with ownership. In this case, we could use the faer version to avoid the to_owned().

For this script, it doesn't make a big impact to use the versions that need preallocated arrays, so I used the allocating versions to make the code more idiomatic in the match, and I changed as well the other uses of kron in this crate (two_qubit_decompose.rs).

I've tested that the swap is swapping the pointers using the following assertion when running the tests, and it seems like that is the case:

let old_ptr1 = aux.as_ptr();
let old_ptr2 = matrix.as_ptr();
swap(&mut aux, &mut matrix);
assert_eq!(old_ptr1, matrix.as_ptr());
assert_eq!(old_ptr2, aux.as_ptr());

Summary of the changes in this PR:

The use of an explicit/unrolled version of kron 2x2 that improves the performances of the kron calls
The use of matmul() from faer improving the performance of the dot() calls
Introduction of common.rs as a module with different common functions between different scripts

In a follow-up, I'll change the two_qubit_decompose.rs script to use matmul() and maybe add more optimization like the new kron functions in the common.rs module.

arnaucasau · 2024-06-03T11:30:50Z

@mtreinish

After looking closely at two_qubit_decompose.rs, I think we should keep ndarray for matrix multiplication in the accelerate crate. Even though faer improves the performance of the multiplications for the 2x2 and 4x4 cases, we end up needing to perform extra copies of the matrices when we want to convert them into PyArray using into_pyarray_bound().

This is caused by the fact that we can transform a faer Mat to an ndarray array view, but we need to use to_owned() to get the ownership.

An alternative to get a speedup is to follow @jlapeyre's suggestion and implement the matrix multiplication and kron by hand in size-fixed functions for the two cases. In this PR you can find an example of this using a new module named common.rs

In the new module we can find the following functions:

kron_matrix_x_identity(): A kron function that performs the product of a given matrix and the identity
kron_identity_x_matrix(): A kron function that performs the product of the identity and a given matrix.
matrix_multiply_2x2(): A 2x2 matmul completely unrolled
matrix_multiply_4x4(): A 4x4 matmul using three for loops which I think are easily vectorizable/unrolled for the compiler
determinant_4x4(): A function to calculate the determinant of a 4x4 matrix
change_basis(): The same function previously living on convert_2q_block_matrix.rs and moved here to be reused in more scripts

As you can see, in the common.rs module, we have more functions than just the matrix multiplications. All of them help increase the performance of the crate as they are more efficient. The determinant_4x4 function also allows us to stop using faer for that purpose and avoid doing an extra copy of a matrix in one part of the code.

In the following tests, we can see the performance improvement of one single call of those functions compared with the ndarray and faer equivalences:

Kronecker product 2x2 (identity ⊗ matrix)

test tests::benchmark11_faer_kron_2x2              ... bench:         143 ns/iter (+/- 9)
test tests::benchmark12_ndarray_kron_2x2           ... bench:         144 ns/iter (+/- 3)
test tests::benchmark13_common_kron_2x2            ... bench:          42 ns/iter (+/- 4)

Matrix multiplication 2x2

test tests::benchmark_faer_matmul_2x2              ... bench:         150 ns/iter (+/- 2)
test tests::benchmark_ndarray_dot_2x2              ... bench:         230 ns/iter (+/- 10)
test tests::benchmark_common_matrix_multiply_2x2   ... bench:          63 ns/iter (+/- 0)

Matrix multiplication 4x4

test tests::benchmark_faer_matmul_4x4              ... bench:         285 ns/iter (+/- 6)
test tests::benchmark_ndarray_dot_4x4              ... bench:         313 ns/iter (+/- 15)
test tests::benchmark_common_matrix_multiply_4x4   ... bench:         151 ns/iter (+/- 4)

Determinant

test tests::benchmark_faer_determinant_4x4         ... bench:         521 ns/iter (+/- 19)
test tests::benchmark_common_determinant_4x4       ... bench:         124 ns/iter (+/- 9)

To test the performance I have also run the transpiler tests using ASV. The result was that the performance was increased, and these are the tests that improved:

| Change   | Before [9d03b4bf] <main>   | After [a650f588] <AC/custom-linalg>   |   Ratio | Benchmark (Parameter)                                                                                      |
|----------|----------------------------|---------------------------------------|---------|------------------------------------------------------------------------------------------------------------|
| -        | 564±2ms                    | 512±0.8ms                             |    0.91 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre') |
| -        | 302±0.2ms                  | 264±0.5ms                             |    0.87 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')      |
| -        | 126±0.8ms                  | 109±0.7ms                             |    0.86 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')      |
| -        | 249±0.4ms                  | 212±0.9ms                             |    0.85 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre') |
| -        | 158±0.4ms                  | 133±0.3ms                             |    0.84 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')      |
| -        | 264±0.7ms                  | 222±3ms                               |    0.84 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')      |
| -        | 719±0.9ms                  | 599±0.2ms                             |    0.83 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre') |
| -        | 161±1ms                    | 125±0.4ms                             |    0.78 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')      |
| -        | 391±0.5ms                  | 306±0.5ms                             |    0.78 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')      |
| -        | 347±0.7ms                  | 259±0.4ms                             |    0.75 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')      |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

I was going to add tests to common.rs, but I wanted to show you all first to know your opinion about the change of approach.

arnaucasau added 7 commits April 15, 2024 14:36

use faer instead of ndarray

9c3ad66

fix two_qubit_decompose

52d2174

fix blocks_to_matrix function and duplicate change_basis

e920dfb

use faer-ext crate for the convert to ndarray function

ae7336e

Merge branch 'main' into AC/2q_blocks_faer

1e92755

add comment to change_basis_faer and rename variables

3203678

rename result variable

a34927f

qiskit-bot added the Community PR PRs from contributors that are not 'members' of the Qiskit repo label Apr 16, 2024

arnaucasau commented Apr 16, 2024

View reviewed changes

arnaucasau mentioned this pull request Apr 16, 2024

Investigate using something besides ndarray for matmul #12120

Open

arnaucasau marked this pull request as ready for review April 16, 2024 17:56

arnaucasau requested a review from a team as a code owner April 16, 2024 17:56

Eric-Arellano added the performance label Apr 16, 2024

Eric-Arellano requested review from mtreinish and jakelishman April 16, 2024 18:09

mtreinish reviewed Apr 17, 2024

View reviewed changes

mtreinish added Changelog: None Do not include in changelog Rust This PR or issue is related to Rust code in the repository Intern PR PR submitted by IBM Quantum interns and removed Community PR PRs from contributors that are not 'members' of the Qiskit repo labels Apr 17, 2024

mtreinish added this to the 1.1.0 milestone Apr 17, 2024

jakelishman assigned mtreinish Apr 18, 2024

use MatRef in change_basis_faer

188a1c1

use kron method on a matrix object

efb3876

mtreinish reviewed Apr 23, 2024

View reviewed changes

use previous logic

36af13a

arnaucasau and others added 2 commits April 24, 2024 14:38

cargo fmt

0f877a3

Merge branch 'main' into AC/2q_blocks_faer

7754ebf

mtreinish reviewed Apr 24, 2024

View reviewed changes

crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved

crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved

Eric-Arellano reviewed Apr 24, 2024

View reviewed changes

crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved

arnaucasau added 2 commits April 24, 2024 16:59

remove as_ref()

557f992

Merge branch 'AC/2q_blocks_faer' of https://github.com/arnaucasau/qiskit

04ffe65

into AC/2q_blocks_faer

jakelishman reviewed Apr 25, 2024

View reviewed changes

arnaucasau and others added 2 commits April 25, 2024 16:53

remove last match

33d687b

avoid clone() call

df0edac

arnaucasau commented Apr 29, 2024

View reviewed changes

Merge branch 'main' into AC/2q_blocks_faer

9a26ed9

mtreinish removed this from the 1.1.0 milestone May 1, 2024

incorporate feedback

065a002

arnaucasau changed the title ~~Switching over convert_2q_block_matrix.rs to use matmul and kron from faer~~ Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron May 7, 2024

arnaucasau added 2 commits May 7, 2024 14:48

Merge branch 'main' into AC/2q_blocks_faer

b811808

add header

2f85d24

arnaucasau changed the title ~~Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron~~ [on hold] Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron May 28, 2024

arnaucasau marked this pull request as draft May 28, 2024 14:56

arnaucasau mentioned this pull request Jun 3, 2024

Create custom 4x4 and 2x2 linalg operations (matmul, kron, determinant) #12494

Closed

arnaucasau closed this Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[on hold] Switch `convert_2q_block_matrix.rs` to use `matmul` from `faer` and an explicit version of `kron` #12193

[on hold] Switch `convert_2q_block_matrix.rs` to use `matmul` from `faer` and an explicit version of `kron` #12193

arnaucasau commented Apr 16, 2024 •

edited

Loading

arnaucasau left a comment

arnaucasau Apr 16, 2024

Eric-Arellano Apr 16, 2024

qiskit-bot commented Apr 16, 2024

coveralls commented Apr 16, 2024 •

edited

Loading

mtreinish left a comment

mtreinish Apr 17, 2024

arnaucasau Apr 22, 2024

mtreinish Apr 22, 2024

mtreinish Apr 22, 2024

arnaucasau Apr 22, 2024 •

edited

Loading

arnaucasau commented Apr 22, 2024

mtreinish commented Apr 23, 2024

mtreinish Apr 23, 2024

arnaucasau Apr 24, 2024

Eric-Arellano Apr 24, 2024

mtreinish Apr 24, 2024 •

edited

Loading

jakelishman Apr 25, 2024 •

edited

Loading

jakelishman Apr 25, 2024

arnaucasau Apr 25, 2024

mtreinish left a comment

jakelishman Apr 25, 2024

jakelishman Apr 25, 2024

arnaucasau Apr 29, 2024 •

edited

Loading

arnaucasau Apr 29, 2024

arnaucasau commented Apr 29, 2024 •

edited

Loading

jlapeyre commented Apr 30, 2024 •

edited

Loading

arnaucasau commented May 2, 2024

jlapeyre commented May 2, 2024

arnaucasau commented May 3, 2024

jlapeyre commented May 3, 2024

arnaucasau commented May 7, 2024 •

edited

Loading

arnaucasau commented Jun 3, 2024

	% change avg (asterisk -> matmul)	asterisk (min)	asterisk (max)	asterisk (avg)	matmul (min)	matmul (max)	matmul (avg)
2x2	92.80%	524ns	651ns	547.67ns	38ns	70ns	39.41ns
4x4	78.19%	659ns	1.00µs	689.28ns	145ns	217ns	150.34ns

	% change avg (asterisk -> matmul)	asterisk (min)	asterisk (max)	asterisk (avg)	matmul (min)	matmul (max)	matmul (avg)
2x2	56.61%	589ns	868ns	632.74ns	263ns	432ns	274.56ns
4x4	44.37%	717ns	938ns	763.04ns	397ns	558ns	424.51ns

[on hold] Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron #12193

[on hold] Switch convert_2q_block_matrix.rs to use matmul from faer and an explicit version of kron #12193

Conversation

arnaucasau commented Apr 16, 2024 • edited Loading

arnaucasau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiskit-bot commented Apr 16, 2024

coveralls commented Apr 16, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8986784365

Details

💛 - Coveralls

mtreinish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnaucasau Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

arnaucasau commented Apr 22, 2024

mtreinish commented Apr 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtreinish Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

jakelishman Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtreinish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnaucasau Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnaucasau commented Apr 29, 2024 • edited Loading

jlapeyre commented Apr 30, 2024 • edited Loading

arnaucasau commented May 2, 2024

jlapeyre commented May 2, 2024

Using mat_mul_to_dst

Other things

arnaucasau commented May 3, 2024

jlapeyre commented May 3, 2024

arnaucasau commented May 7, 2024 • edited Loading

arnaucasau commented Jun 3, 2024

[on hold] Switch `convert_2q_block_matrix.rs` to use `matmul` from `faer` and an explicit version of `kron` #12193

[on hold] Switch `convert_2q_block_matrix.rs` to use `matmul` from `faer` and an explicit version of `kron` #12193

arnaucasau commented Apr 16, 2024 •

edited

Loading

coveralls commented Apr 16, 2024 •

edited

Loading

arnaucasau Apr 22, 2024 •

edited

Loading

mtreinish Apr 24, 2024 •

edited

Loading

jakelishman Apr 25, 2024 •

edited

Loading

arnaucasau Apr 29, 2024 •

edited

Loading

arnaucasau commented Apr 29, 2024 •

edited

Loading

jlapeyre commented Apr 30, 2024 •

edited

Loading

Using `mat_mul_to_dst`

arnaucasau commented May 7, 2024 •

edited

Loading