Skip to content

Commit

Permalink
Optimize Lightning-Kokkos' probs(wires) using bitshift implementati…
Browse files Browse the repository at this point in the history
…on (#802)

### Before submitting

Please complete the following checklist when submitting a PR:

- [x] All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to
the
      [`tests`](../tests) directory!

- [x] All new functions and code must be clearly commented and
documented.
If you do make documentation changes, make sure that the docs build and
      render correctly by running `make docs`.

- [x] Ensure that the test suite passes, by running `make test`.

- [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing
the
      change, and including a link back to the PR.

- [x] Ensure that code is properly formatted by running `make format`. 

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.


------------------------------------------------------------------------------------------------------------

**Context:**
`probs` is central in circuit simulation measurements.

**Description of the Change:**
Implement `probs(wires)` using bitshift implementation akin to the gate
kernels in Lightning-Qubit.
Enable `probs(unsorted_wires)` tests.

**Benefits:**
Faster execution.
The following benchmarks are performed on ISAIC's AMD EPYC-Milan
Processor using a varying number of OpenMP threads (ranging from 1 to
32, albeit 32 threads data isn't shown for clarity and because there is
no guarantee that the benchmark application is the sole intensive
process running on the machine, and hence there is a real possibility of
oversubscribing). The times are obtained averaging the computation of
`probs(targets)` 5 times, where `targets` includes one or several wires.
The speed-ups vary quite a bit depending on the number of targets, but
they are greater than 1 in any case.


![speedup_vs_nthreads](https://github.com/user-attachments/assets/54797c41-8184-4c6a-a096-d2fcf1652e5b)

We also compute the parallelization efficiency which is displayed in the
following figure.


![efficiency_vs_nthreads](https://github.com/user-attachments/assets/ce048c4c-d24d-4a5d-bba1-2632b9bf9a98)

It is also important to validate that the CUDA backend performs equally
well. We therefore repeat the exercise and found the new kernels to
accelerate `probs` for any number of targets.


![speedup_cuda](https://github.com/user-attachments/assets/e9b2841a-139a-47ac-b45f-0ca0ba4a074f)

**Possible Drawbacks:**
Many implementation decreasing maintainability. 

**Related GitHub Issues:**
[sc-65198]

---------

Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai>
Co-authored-by: Ali Asadi <10773383+maliasadi@users.noreply.github.com>
  • Loading branch information
3 people authored Jul 24, 2024
1 parent 11bf9dc commit d1baa8f
Show file tree
Hide file tree
Showing 11 changed files with 749 additions and 253 deletions.
3 changes: 3 additions & 0 deletions .github/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@
* Optimize the OpenMP parallelization of Lightning-Qubit's `probs` for all number of targets.
[(#807)](https://github.com/PennyLaneAI/pennylane-lightning/pull/807)

* Optimize `probs(wires)` of Lightning-Kokkos using various kernels. Which kernel is to be used depends on the device, number of qubits and number of target wires.
[(#802)](https://github.com/PennyLaneAI/pennylane-lightning/pull/802)

* Add GPU device compute capability check for Lightning-Tensor.
[(#803)](https://github.com/PennyLaneAI/pennylane-lightning/pull/803)

Expand Down
2 changes: 1 addition & 1 deletion pennylane_lightning/core/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
Version number (major.minor.patch[-label])
"""

__version__ = "0.38.0-dev14"
__version__ = "0.38.0-dev15"
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,6 @@ template <typename TypeList> void testProbabilities() {
{0.67078706, 0.03062806, 0.0870997, 0.00397696, 0.17564072,
0.00801973, 0.02280642, 0.00104134}}
#else
#if defined(_ENABLE_PLQUBIT)
// LightningQubit currently supports arbitrary wire index
// ordering.
{{0, 2, 1},
Expand All @@ -112,8 +111,6 @@ template <typename TypeList> void testProbabilities() {
{0.67078706, 0.17564072, 0.0870997, 0.02280642, 0.03062806,
0.00801973, 0.00397696, 0.00104134}},
{{2, 1}, {0.84642778, 0.10990612, 0.0386478, 0.0050183}},

#endif
{{0, 1, 2},
{0.67078706, 0.03062806, 0.0870997, 0.00397696, 0.17564072,
0.00801973, 0.02280642, 0.00104134}},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,15 @@
#include "GateOperation.hpp"
#include "StateVectorBase.hpp"
#include "Util.hpp"
#include "UtilKokkos.hpp"

#include "CPUMemoryModel.hpp"

/// @cond DEV
namespace {
using namespace Pennylane::Gates::Constant;
using namespace Pennylane::LightningKokkos::Functors;
using namespace Pennylane::LightningKokkos::Util;
using Pennylane::Gates::GateOperation;
using Pennylane::Gates::GeneratorOperation;
using Pennylane::Util::array_contains;
Expand Down Expand Up @@ -151,12 +153,8 @@ class StateVectorKokkos final
void setStateVector(const std::vector<std::size_t> &indices,
const std::vector<ComplexT> &values) {
initZeros();
KokkosSizeTVector d_indices("d_indices", indices.size());
KokkosVector d_values("d_values", values.size());
Kokkos::deep_copy(d_indices, UnmanagedConstSizeTHostView(
indices.data(), indices.size()));
Kokkos::deep_copy(d_values, UnmanagedConstComplexHostView(
values.data(), values.size()));
auto d_indices = vector2view(indices);
auto d_values = vector2view(values);
KokkosVector sv_view =
getView(); // circumvent error capturing this with KOKKOS_LAMBDA
Kokkos::parallel_for(
Expand Down Expand Up @@ -283,19 +281,13 @@ class StateVectorKokkos final
PL_ABORT_IF(gate_matrix.empty(),
std::string("Operation does not exist for ") + opName +
std::string(" and no matrix provided."));
KokkosVector matrix("gate_matrix", gate_matrix.size());
Kokkos::deep_copy(
matrix, UnmanagedConstComplexHostView(gate_matrix.data(),
gate_matrix.size()));
return applyMultiQubitOp(matrix, wires, inverse);
return applyMultiQubitOp(vector2view(gate_matrix), wires, inverse);
}
}

template <bool inverse = false>
void applyControlledGlobalPhase(const std::vector<ComplexT> &diagonal) {
KokkosVector diagonal_("diagonal_", diagonal.size());
Kokkos::deep_copy(diagonal_, UnmanagedConstComplexHostView(
diagonal.data(), diagonal.size()));
auto diagonal_ = vector2view(diagonal);
auto two2N = BaseType::getLength();
auto dataview = getView();
Kokkos::parallel_for(
Expand Down Expand Up @@ -587,15 +579,11 @@ class StateVectorKokkos final
* @brief Get underlying data vector
*/
[[nodiscard]] auto getDataVector() -> std::vector<ComplexT> {
std::vector<ComplexT> data_(this->getLength());
DeviceToHost(data_.data(), data_.size());
return data_;
return view2vector(getView());
}

[[nodiscard]] auto getDataVector() const -> const std::vector<ComplexT> {
std::vector<ComplexT> data_(this->getLength());
DeviceToHost(data_.data(), data_.size());
return data_;
return view2vector(getView());
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,14 @@
#include <Kokkos_StdAlgorithms.hpp>

#include "BitUtil.hpp"
#include "BitUtilKokkos.hpp"
#include "UtilKokkos.hpp"

/// @cond DEV
namespace {
using namespace Pennylane::Util;
using Kokkos::Experimental::swap;
using Pennylane::LightningKokkos::Util::one;
using Pennylane::LightningKokkos::Util::vector2view;
using Pennylane::LightningKokkos::Util::wires2Parity;
using std::size_t;
} // namespace
Expand Down Expand Up @@ -55,11 +56,7 @@ template <class Precision> struct multiQubitOpFunctor {
multiQubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
Kokkos::View<const std::size_t *, Kokkos::HostSpace,
Kokkos::MemoryTraits<Kokkos::Unmanaged>>
wires_host(wires_.data(), wires_.size());
Kokkos::resize(wires, wires_host.size());
Kokkos::deep_copy(wires, wires_host);
wires = vector2view(wires_);
dim = one << wires_.size();
num_qubits = num_qubits_;
arr = arr_;
Expand Down Expand Up @@ -122,10 +119,9 @@ template <class PrecisionT> struct apply1QubitOpFunctor {
std::size_t wire_parity;
std::size_t wire_parity_inv;

apply1QubitOpFunctor(
KokkosComplexVector arr_, std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
[[maybe_unused]] const std::vector<std::size_t> &wires_) {
apply1QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down Expand Up @@ -169,10 +165,9 @@ template <class PrecisionT> struct apply2QubitOpFunctor {
std::size_t parity_high;
std::size_t parity_middle;

apply2QubitOpFunctor(
KokkosComplexVector arr_, std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
[[maybe_unused]] const std::vector<std::size_t> &wires_) {
apply2QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down Expand Up @@ -238,11 +233,7 @@ template <class PrecisionT> struct apply3QubitOpFunctor {
apply3QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
Kokkos::View<const std::size_t *, Kokkos::HostSpace,
Kokkos::MemoryTraits<Kokkos::Unmanaged>>
wires_host(wires_.data(), wires_.size());
Kokkos::resize(wires, wires_host.size());
Kokkos::deep_copy(wires, wires_host);
wires = vector2view(wires_);
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down Expand Up @@ -312,11 +303,7 @@ template <class PrecisionT> struct apply4QubitOpFunctor {
apply4QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
Kokkos::View<const std::size_t *, Kokkos::HostSpace,
Kokkos::MemoryTraits<Kokkos::Unmanaged>>
wires_host(wires_.data(), wires_.size());
Kokkos::resize(wires, wires_host.size());
Kokkos::deep_copy(wires, wires_host);
wires = vector2view(wires_);
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down Expand Up @@ -422,11 +409,7 @@ template <class PrecisionT> struct apply5QubitOpFunctor {
apply5QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
Kokkos::View<const std::size_t *, Kokkos::HostSpace,
Kokkos::MemoryTraits<Kokkos::Unmanaged>>
wires_host(wires_.data(), wires_.size());
Kokkos::resize(wires, wires_host.size());
Kokkos::deep_copy(wires, wires_host);
wires = vector2view(wires_);
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,14 @@
#include <Kokkos_Core.hpp>

#include "BitUtil.hpp"
#include "BitUtilKokkos.hpp"
#include "Error.hpp"
#include "UtilKokkos.hpp"

/// @cond DEV
namespace {
using namespace Pennylane::Util;
using Pennylane::LightningKokkos::Util::one;
using Pennylane::LightningKokkos::Util::vector2view;
using Pennylane::LightningKokkos::Util::wires2Parity;
} // namespace
/// @endcond
Expand Down Expand Up @@ -186,12 +188,7 @@ template <class PrecisionT> struct getExpValMultiQubitOpFunctor {
std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
Kokkos::View<const std::size_t *, Kokkos::HostSpace,
Kokkos::MemoryTraits<Kokkos::Unmanaged>>
wires_host(wires_.data(), wires_.size());
Kokkos::resize(wires, wires_.size());
Kokkos::deep_copy(wires, wires_host);

wires = vector2view(wires_);
dim = one << wires_.size();
num_qubits = num_qubits_;
arr = arr_;
Expand Down Expand Up @@ -289,10 +286,10 @@ template <class PrecisionT> struct getExpVal1QubitOpFunctor {
std::size_t wire_parity;
std::size_t wire_parity_inv;

getExpVal1QubitOpFunctor(
const KokkosComplexVector &arr_, const std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
[[maybe_unused]] const std::vector<std::size_t> &wires_) {
getExpVal1QubitOpFunctor(const KokkosComplexVector &arr_,
const std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down Expand Up @@ -343,10 +340,10 @@ template <class PrecisionT> struct getExpVal2QubitOpFunctor {
std::size_t parity_high;
std::size_t parity_middle;

getExpVal2QubitOpFunctor(
const KokkosComplexVector &arr_, const std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
[[maybe_unused]] const std::vector<std::size_t> &wires_) {
getExpVal2QubitOpFunctor(const KokkosComplexVector &arr_,
const std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down Expand Up @@ -407,12 +404,7 @@ template <class PrecisionT> struct getExpVal3QubitOpFunctor {
const std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
Kokkos::View<const std::size_t *, Kokkos::HostSpace,
Kokkos::MemoryTraits<Kokkos::Unmanaged>>
wires_host(wires_.data(), wires_.size());
Kokkos::resize(wires, wires_.size());
Kokkos::deep_copy(wires, wires_host);

wires = vector2view(wires_);
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down Expand Up @@ -478,11 +470,7 @@ template <class PrecisionT> struct getExpVal4QubitOpFunctor {
const std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
Kokkos::View<const std::size_t *, Kokkos::HostSpace,
Kokkos::MemoryTraits<Kokkos::Unmanaged>>
wires_host(wires_.data(), wires_.size());
Kokkos::resize(wires, wires_.size());
Kokkos::deep_copy(wires, wires_host);
wires = vector2view(wires_);
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down Expand Up @@ -577,11 +565,7 @@ template <class PrecisionT> struct getExpVal5QubitOpFunctor {
const std::size_t num_qubits_,
const KokkosComplexVector &matrix_,
const std::vector<std::size_t> &wires_) {
Kokkos::View<const std::size_t *, Kokkos::HostSpace,
Kokkos::MemoryTraits<Kokkos::Unmanaged>>
wires_host(wires_.data(), wires_.size());
Kokkos::resize(wires, wires_.size());
Kokkos::deep_copy(wires, wires_host);
wires = vector2view(wires_);
arr = arr_;
matrix = matrix_;
num_qubits = num_qubits_;
Expand Down
Loading

0 comments on commit d1baa8f

Please sign in to comment.