Skip to content

Commit

Permalink
Parallelize probs with OpenMP (#800)
Browse files Browse the repository at this point in the history
### Before submitting

Please complete the following checklist when submitting a PR:

- [x] All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to
the
      [`tests`](../tests) directory!

- [x] All new functions and code must be clearly commented and
documented.
If you do make documentation changes, make sure that the docs build and
      render correctly by running `make docs`.

- [x] Ensure that the test suite passes, by running `make test`.

- [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing
the
      change, and including a link back to the PR.

- [x] Ensure that code is properly formatted by running `make format`. 

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.


------------------------------------------------------------------------------------------------------------

**Context:**
`probs` is central in circuit simulation measurements.

**Description of the Change:**
Parallelize `probs` loops using OpenMP.

**Benefits:**
Faster execution with several threads.
The following benchmarks are performed on ISAIC's AMD EPYC-Milan
Processor using a several core/threads. The times are obtained averaging
the computation of `probs(target)` 5 times for various number of
targets. We use the last release implementation as a reference. Since
#795 brings some speed-ups even for a single thread, this is why we
observe speed-ups > number of threads.


![speedup_vs_nthreads](https://github.com/user-attachments/assets/fdb762bb-8e50-4337-b2c0-7b16a42e8dad)

Another view on the data is the strong scaling efficiency. It is almost
perfect for 2-4 threads, fairly good for 8 threads and diminishes
significantly for 16 threads.


![efficiency_vs_nthreads](https://github.com/user-attachments/assets/4feca17e-a461-407c-947c-a0bc54c21b2a)

**Possible Drawbacks:**

**Related GitHub Issues:**

---------

Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai>
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>
  • Loading branch information
3 people authored Jul 16, 2024
1 parent c164fe5 commit 4ec49b8
Show file tree
Hide file tree
Showing 5 changed files with 35 additions and 13 deletions.
7 changes: 5 additions & 2 deletions .github/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,10 @@

### Improvements

* Implement probs(wires) using a bit-shift implementation akin to the gate kernels in Lightning-Qubit.
* Parallelize Lightning-Qubit `probs` with OpenMP when using the `-DLQ_ENABLE_KERNEL_OMP=1` CMake argument.
[(#800)](https://github.com/PennyLaneAI/pennylane-lightning/pull/800)

* Implement `probs(wires)` using a bit-shift implementation akin to the gate kernels in Lightning-Qubit.
[(#795)](https://github.com/PennyLaneAI/pennylane-lightning/pull/795)

* Enable setting the PennyLane version when invoking, for example, `make docker-build version=master pl_version=master`.
Expand Down Expand Up @@ -449,7 +452,7 @@ Vincent Michaud-Rioux
* The `BlockEncode` operation from PennyLane is now supported on all Lightning devices.
[(#599)](https://github.com/PennyLaneAI/pennylane-lightning/pull/599)

* OpenMP acceleration can now be enabled at compile time for all `lightning.qubit` gate kernels using the "-DLQ_ENABLE_KERNEL_OMP=1" CMake argument.
* OpenMP acceleration can now be enabled at compile time for all `lightning.qubit` gate kernels using the `-DLQ_ENABLE_KERNEL_OMP=1` CMake argument.
[(#510)](https://github.com/PennyLaneAI/pennylane-lightning/pull/510)

* Enable building Docker images for any branch or tag. Set the Docker build cron job to build images for the latest release and `master`.
Expand Down
2 changes: 1 addition & 1 deletion pennylane_lightning/core/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
Version number (major.minor.patch[-label])
"""

__version__ = "0.38.0-dev7"
__version__ = "0.38.0-dev8"
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,21 @@ else()
endif()

if(LQ_ENABLE_KERNEL_OMP)
message(STATUS "OpenMP-parallelized kernels: ON.")
add_definitions("-DPL_LQ_KERNEL_OMP")
target_compile_definitions(lightning_qubit PUBLIC -DPL_LQ_KERNEL_OMP)
else()
message(STATUS "OpenMP-parallelized kernels: OFF.")
endif()

if(LQ_ENABLE_KERNEL_AVX_STREAMING)
if(NOT LQ_ENABLE_KERNEL_OMP)
message(WARNING "AVX streaming operations require `LQ_ENABLE_KERNEL_OMP` to be enabled.")
endif()
message(STATUS "AVX streaming operations: ON.")
add_definitions("-DPL_LQ_KERNEL_AVX_STREAMING")
else()
message(STATUS "AVX streaming operations: OFF.")
endif()

target_link_libraries(lightning_qubit PUBLIC lightning_compile_options
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -382,7 +382,12 @@ auto probs_bitshift(const std::complex<PrecisionT> *arr,
PROBS_CORE_DECLARE_P(6)
PROBS_CORE_DECLARE_P(7)
PROBS_CORE_DECLARE_P(8)
std::vector<PrecisionT> probs(PUtil::exp2(n_wires), 0);
constexpr std::size_t n_probs = one << n_wires;
std::vector<PrecisionT> probabilities(n_probs, 0);
auto *probs = probabilities.data();
#if defined PL_LQ_KERNEL_OMP && defined _OPENMP
#pragma omp parallel for reduction(+ : probs[ : n_probs])
#endif
for (std::size_t k = 0; k < exp2(num_qubits - n_wires); k++) {
std::size_t i0;
PROBS_CORE_SUM_1
Expand All @@ -394,7 +399,7 @@ auto probs_bitshift(const std::complex<PrecisionT> *arr,
PROBS_CORE_SUM_7
PROBS_CORE_SUM_8
}
return probs;
return probabilities;
}
// NOLINTEND(hicpp-function-size,readability-function-size)
} // namespace Pennylane::LightningQubit::Measures
Original file line number Diff line number Diff line change
Expand Up @@ -80,13 +80,16 @@ class Measurements final
*/
auto probs() -> std::vector<PrecisionT> {
const ComplexT *arr_data = this->_statevector.getData();
std::vector<PrecisionT> basis_probs(this->_statevector.getLength(), 0);

std::transform(
arr_data, arr_data + this->_statevector.getLength(),
basis_probs.begin(),
[](const ComplexT &z) -> PrecisionT { return std::norm(z); });
return basis_probs;
const std::size_t n_probs = this->_statevector.getLength();
std::vector<PrecisionT> probabilities(n_probs, 0);
auto *probs = probabilities.data();
#if defined PL_LQ_KERNEL_OMP && defined _OPENMP
#pragma omp parallel for
#endif
for (std::size_t k = 0; k < n_probs; k++) {
probs[k] = std::norm(arr_data[k]);
}
return probabilities;
};

/**
Expand Down Expand Up @@ -128,10 +131,14 @@ class Measurements final
Gates::getIndicesAfterExclusion(wires, num_qubits), num_qubits);
const std::size_t n_probs = PUtil::exp2(n_wires);
std::vector<PrecisionT> probabilities(n_probs, 0);
auto *probs = probabilities.data();
std::size_t ind_probs = 0;
for (auto index : all_indices) {
#if defined PL_LQ_KERNEL_OMP && defined _OPENMP
#pragma omp parallel for reduction(+ : probs[ : n_probs])
#endif
for (auto offset : all_offsets) {
probabilities[ind_probs] += std::norm(arr_data[index + offset]);
probs[ind_probs] += std::norm(arr_data[index + offset]);
}
ind_probs++;
}
Expand Down

0 comments on commit 4ec49b8

Please sign in to comment.