Optimize Lightning-Kokkos' probs(wires) using bitshift implementati…

…on (#802) ### Before submitting Please complete the following checklist when submitting a PR: - [x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the [`tests`](../tests) directory! - [x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running `make docs`. - [x] Ensure that the test suite passes, by running `make test`. - [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing the change, and including a link back to the PR. - [x] Ensure that code is properly formatted by running `make format`. When all the above are checked, delete everything above the dashed line and fill in the pull request template. ------------------------------------------------------------------------------------------------------------ **Context:** `probs` is central in circuit simulation measurements. **Description of the Change:** Implement `probs(wires)` using bitshift implementation akin to the gate kernels in Lightning-Qubit. Enable `probs(unsorted_wires)` tests. **Benefits:** Faster execution. The following benchmarks are performed on ISAIC's AMD EPYC-Milan Processor using a varying number of OpenMP threads (ranging from 1 to 32, albeit 32 threads data isn't shown for clarity and because there is no guarantee that the benchmark application is the sole intensive process running on the machine, and hence there is a real possibility of oversubscribing). The times are obtained averaging the computation of `probs(targets)` 5 times, where `targets` includes one or several wires. The speed-ups vary quite a bit depending on the number of targets, but they are greater than 1 in any case. ![speedup_vs_nthreads](https://github.com/user-attachments/assets/54797c41-8184-4c6a-a096-d2fcf1652e5b) We also compute the parallelization efficiency which is displayed in the following figure. ![efficiency_vs_nthreads](https://github.com/user-attachments/assets/ce048c4c-d24d-4a5d-bba1-2632b9bf9a98) It is also important to validate that the CUDA backend performs equally well. We therefore repeat the exercise and found the new kernels to accelerate `probs` for any number of targets. ![speedup_cuda](https://github.com/user-attachments/assets/e9b2841a-139a-47ac-b45f-0ca0ba4a074f) **Possible Drawbacks:** Many implementation decreasing maintainability. **Related GitHub Issues:** [sc-65198] --------- Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai> Co-authored-by: Ali Asadi <10773383+maliasadi@users.noreply.github.com>
PennyLaneAI · Jul 24, 2024 · d1baa8f · d1baa8f
1 parent 11bf9dc
commit d1baa8f
Show file tree

Hide file tree

Showing 11 changed files with 749 additions and 253 deletions.
diff --git a/.github/CHANGELOG.md b/.github/CHANGELOG.md
@@ -24,6 +24,9 @@
 * Optimize the OpenMP parallelization of Lightning-Qubit's `probs` for all number of targets.
   [(#807)](https://github.com/PennyLaneAI/pennylane-lightning/pull/807)
 
+* Optimize `probs(wires)` of Lightning-Kokkos using various kernels. Which kernel is to be used depends on the device, number of qubits and number of target wires.
+  [(#802)](https://github.com/PennyLaneAI/pennylane-lightning/pull/802)
+
 * Add GPU device compute capability check for Lightning-Tensor.
   [(#803)](https://github.com/PennyLaneAI/pennylane-lightning/pull/803)
 

diff --git a/pennylane_lightning/core/_version.py b/pennylane_lightning/core/_version.py
@@ -16,4 +16,4 @@
    Version number (major.minor.patch[-label])
 """
 
-__version__ = "0.38.0-dev14"
+__version__ = "0.38.0-dev15"
diff --git a/pennylane_lightning/core/src/measurements/tests/Test_MeasurementsBase.cpp b/pennylane_lightning/core/src/measurements/tests/Test_MeasurementsBase.cpp
@@ -93,7 +93,6 @@ template <typename TypeList> void testProbabilities() {
                  {0.67078706, 0.03062806, 0.0870997, 0.00397696, 0.17564072,
                   0.00801973, 0.02280642, 0.00104134}}
 #else
-#if defined(_ENABLE_PLQUBIT)
                 // LightningQubit currently supports arbitrary wire index
                 // ordering.
                 {{0, 2, 1},
@@ -112,8 +111,6 @@ template <typename TypeList> void testProbabilities() {
                  {0.67078706, 0.17564072, 0.0870997, 0.02280642, 0.03062806,
                   0.00801973, 0.00397696, 0.00104134}},
                 {{2, 1}, {0.84642778, 0.10990612, 0.0386478, 0.0050183}},
-
-#endif
                 {{0, 1, 2},
                  {0.67078706, 0.03062806, 0.0870997, 0.00397696, 0.17564072,
                   0.00801973, 0.02280642, 0.00104134}},

diff --git a/pennylane_lightning/core/src/simulators/lightning_kokkos/StateVectorKokkos.hpp b/pennylane_lightning/core/src/simulators/lightning_kokkos/StateVectorKokkos.hpp
@@ -36,13 +36,15 @@
 #include "GateOperation.hpp"
 #include "StateVectorBase.hpp"
 #include "Util.hpp"
+#include "UtilKokkos.hpp"
 
 #include "CPUMemoryModel.hpp"
 
 /// @cond DEV
 namespace {
 using namespace Pennylane::Gates::Constant;
 using namespace Pennylane::LightningKokkos::Functors;
+using namespace Pennylane::LightningKokkos::Util;
 using Pennylane::Gates::GateOperation;
 using Pennylane::Gates::GeneratorOperation;
 using Pennylane::Util::array_contains;
@@ -151,12 +153,8 @@ class StateVectorKokkos final
     void setStateVector(const std::vector<std::size_t> &indices,
                         const std::vector<ComplexT> &values) {
         initZeros();
-        KokkosSizeTVector d_indices("d_indices", indices.size());
-        KokkosVector d_values("d_values", values.size());
-        Kokkos::deep_copy(d_indices, UnmanagedConstSizeTHostView(
-                                         indices.data(), indices.size()));
-        Kokkos::deep_copy(d_values, UnmanagedConstComplexHostView(
-                                        values.data(), values.size()));
+        auto d_indices = vector2view(indices);
+        auto d_values = vector2view(values);
         KokkosVector sv_view =
             getView(); // circumvent error capturing this with KOKKOS_LAMBDA
         Kokkos::parallel_for(
@@ -283,19 +281,13 @@ class StateVectorKokkos final
             PL_ABORT_IF(gate_matrix.empty(),
                         std::string("Operation does not exist for ") + opName +
                             std::string(" and no matrix provided."));
-            KokkosVector matrix("gate_matrix", gate_matrix.size());
-            Kokkos::deep_copy(
-                matrix, UnmanagedConstComplexHostView(gate_matrix.data(),
-                                                      gate_matrix.size()));
-            return applyMultiQubitOp(matrix, wires, inverse);
+            return applyMultiQubitOp(vector2view(gate_matrix), wires, inverse);
         }
     }
 
     template <bool inverse = false>
     void applyControlledGlobalPhase(const std::vector<ComplexT> &diagonal) {
-        KokkosVector diagonal_("diagonal_", diagonal.size());
-        Kokkos::deep_copy(diagonal_, UnmanagedConstComplexHostView(
-                                         diagonal.data(), diagonal.size()));
+        auto diagonal_ = vector2view(diagonal);
         auto two2N = BaseType::getLength();
         auto dataview = getView();
         Kokkos::parallel_for(
@@ -587,15 +579,11 @@ class StateVectorKokkos final
      * @brief Get underlying data vector
      */
     [[nodiscard]] auto getDataVector() -> std::vector<ComplexT> {
-        std::vector<ComplexT> data_(this->getLength());
-        DeviceToHost(data_.data(), data_.size());
-        return data_;
+        return view2vector(getView());
     }
 
     [[nodiscard]] auto getDataVector() const -> const std::vector<ComplexT> {
-        std::vector<ComplexT> data_(this->getLength());
-        DeviceToHost(data_.data(), data_.size());
-        return data_;
+        return view2vector(getView());
     }
 
     /**

diff --git a/pennylane_lightning/core/src/simulators/lightning_kokkos/gates/MatrixGateFunctors.hpp b/pennylane_lightning/core/src/simulators/lightning_kokkos/gates/MatrixGateFunctors.hpp
@@ -17,13 +17,14 @@
 #include <Kokkos_StdAlgorithms.hpp>
 
 #include "BitUtil.hpp"
-#include "BitUtilKokkos.hpp"
+#include "UtilKokkos.hpp"
 
 /// @cond DEV
 namespace {
 using namespace Pennylane::Util;
 using Kokkos::Experimental::swap;
 using Pennylane::LightningKokkos::Util::one;
+using Pennylane::LightningKokkos::Util::vector2view;
 using Pennylane::LightningKokkos::Util::wires2Parity;
 using std::size_t;
 } // namespace
@@ -55,11 +56,7 @@ template <class Precision> struct multiQubitOpFunctor {
     multiQubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
                         const KokkosComplexVector &matrix_,
                         const std::vector<std::size_t> &wires_) {
-        Kokkos::View<const std::size_t *, Kokkos::HostSpace,
-                     Kokkos::MemoryTraits<Kokkos::Unmanaged>>
-            wires_host(wires_.data(), wires_.size());
-        Kokkos::resize(wires, wires_host.size());
-        Kokkos::deep_copy(wires, wires_host);
+        wires = vector2view(wires_);
         dim = one << wires_.size();
         num_qubits = num_qubits_;
         arr = arr_;
@@ -122,10 +119,9 @@ template <class PrecisionT> struct apply1QubitOpFunctor {
     std::size_t wire_parity;
     std::size_t wire_parity_inv;
 
-    apply1QubitOpFunctor(
-        KokkosComplexVector arr_, std::size_t num_qubits_,
-        const KokkosComplexVector &matrix_,
-        [[maybe_unused]] const std::vector<std::size_t> &wires_) {
+    apply1QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
+                         const KokkosComplexVector &matrix_,
+                         const std::vector<std::size_t> &wires_) {
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;
@@ -169,10 +165,9 @@ template <class PrecisionT> struct apply2QubitOpFunctor {
     std::size_t parity_high;
     std::size_t parity_middle;
 
-    apply2QubitOpFunctor(
-        KokkosComplexVector arr_, std::size_t num_qubits_,
-        const KokkosComplexVector &matrix_,
-        [[maybe_unused]] const std::vector<std::size_t> &wires_) {
+    apply2QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
+                         const KokkosComplexVector &matrix_,
+                         const std::vector<std::size_t> &wires_) {
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;
@@ -238,11 +233,7 @@ template <class PrecisionT> struct apply3QubitOpFunctor {
     apply3QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
                          const KokkosComplexVector &matrix_,
                          const std::vector<std::size_t> &wires_) {
-        Kokkos::View<const std::size_t *, Kokkos::HostSpace,
-                     Kokkos::MemoryTraits<Kokkos::Unmanaged>>
-            wires_host(wires_.data(), wires_.size());
-        Kokkos::resize(wires, wires_host.size());
-        Kokkos::deep_copy(wires, wires_host);
+        wires = vector2view(wires_);
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;
@@ -312,11 +303,7 @@ template <class PrecisionT> struct apply4QubitOpFunctor {
     apply4QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
                          const KokkosComplexVector &matrix_,
                          const std::vector<std::size_t> &wires_) {
-        Kokkos::View<const std::size_t *, Kokkos::HostSpace,
-                     Kokkos::MemoryTraits<Kokkos::Unmanaged>>
-            wires_host(wires_.data(), wires_.size());
-        Kokkos::resize(wires, wires_host.size());
-        Kokkos::deep_copy(wires, wires_host);
+        wires = vector2view(wires_);
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;
@@ -422,11 +409,7 @@ template <class PrecisionT> struct apply5QubitOpFunctor {
     apply5QubitOpFunctor(KokkosComplexVector arr_, std::size_t num_qubits_,
                          const KokkosComplexVector &matrix_,
                          const std::vector<std::size_t> &wires_) {
-        Kokkos::View<const std::size_t *, Kokkos::HostSpace,
-                     Kokkos::MemoryTraits<Kokkos::Unmanaged>>
-            wires_host(wires_.data(), wires_.size());
-        Kokkos::resize(wires, wires_host.size());
-        Kokkos::deep_copy(wires, wires_host);
+        wires = vector2view(wires_);
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;

diff --git a/pennylane_lightning/core/src/simulators/lightning_kokkos/measurements/ExpValFunctors.hpp b/pennylane_lightning/core/src/simulators/lightning_kokkos/measurements/ExpValFunctors.hpp
@@ -15,12 +15,14 @@
 #include <Kokkos_Core.hpp>
 
 #include "BitUtil.hpp"
-#include "BitUtilKokkos.hpp"
+#include "Error.hpp"
+#include "UtilKokkos.hpp"
 
 /// @cond DEV
 namespace {
 using namespace Pennylane::Util;
 using Pennylane::LightningKokkos::Util::one;
+using Pennylane::LightningKokkos::Util::vector2view;
 using Pennylane::LightningKokkos::Util::wires2Parity;
 } // namespace
 /// @endcond
@@ -186,12 +188,7 @@ template <class PrecisionT> struct getExpValMultiQubitOpFunctor {
                                  std::size_t num_qubits_,
                                  const KokkosComplexVector &matrix_,
                                  const std::vector<std::size_t> &wires_) {
-        Kokkos::View<const std::size_t *, Kokkos::HostSpace,
-                     Kokkos::MemoryTraits<Kokkos::Unmanaged>>
-            wires_host(wires_.data(), wires_.size());
-        Kokkos::resize(wires, wires_.size());
-        Kokkos::deep_copy(wires, wires_host);
-
+        wires = vector2view(wires_);
         dim = one << wires_.size();
         num_qubits = num_qubits_;
         arr = arr_;
@@ -289,10 +286,10 @@ template <class PrecisionT> struct getExpVal1QubitOpFunctor {
     std::size_t wire_parity;
     std::size_t wire_parity_inv;
 
-    getExpVal1QubitOpFunctor(
-        const KokkosComplexVector &arr_, const std::size_t num_qubits_,
-        const KokkosComplexVector &matrix_,
-        [[maybe_unused]] const std::vector<std::size_t> &wires_) {
+    getExpVal1QubitOpFunctor(const KokkosComplexVector &arr_,
+                             const std::size_t num_qubits_,
+                             const KokkosComplexVector &matrix_,
+                             const std::vector<std::size_t> &wires_) {
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;
@@ -343,10 +340,10 @@ template <class PrecisionT> struct getExpVal2QubitOpFunctor {
     std::size_t parity_high;
     std::size_t parity_middle;
 
-    getExpVal2QubitOpFunctor(
-        const KokkosComplexVector &arr_, const std::size_t num_qubits_,
-        const KokkosComplexVector &matrix_,
-        [[maybe_unused]] const std::vector<std::size_t> &wires_) {
+    getExpVal2QubitOpFunctor(const KokkosComplexVector &arr_,
+                             const std::size_t num_qubits_,
+                             const KokkosComplexVector &matrix_,
+                             const std::vector<std::size_t> &wires_) {
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;
@@ -407,12 +404,7 @@ template <class PrecisionT> struct getExpVal3QubitOpFunctor {
                              const std::size_t num_qubits_,
                              const KokkosComplexVector &matrix_,
                              const std::vector<std::size_t> &wires_) {
-        Kokkos::View<const std::size_t *, Kokkos::HostSpace,
-                     Kokkos::MemoryTraits<Kokkos::Unmanaged>>
-            wires_host(wires_.data(), wires_.size());
-        Kokkos::resize(wires, wires_.size());
-        Kokkos::deep_copy(wires, wires_host);
-
+        wires = vector2view(wires_);
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;
@@ -478,11 +470,7 @@ template <class PrecisionT> struct getExpVal4QubitOpFunctor {
                              const std::size_t num_qubits_,
                              const KokkosComplexVector &matrix_,
                              const std::vector<std::size_t> &wires_) {
-        Kokkos::View<const std::size_t *, Kokkos::HostSpace,
-                     Kokkos::MemoryTraits<Kokkos::Unmanaged>>
-            wires_host(wires_.data(), wires_.size());
-        Kokkos::resize(wires, wires_.size());
-        Kokkos::deep_copy(wires, wires_host);
+        wires = vector2view(wires_);
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;
@@ -577,11 +565,7 @@ template <class PrecisionT> struct getExpVal5QubitOpFunctor {
                              const std::size_t num_qubits_,
                              const KokkosComplexVector &matrix_,
                              const std::vector<std::size_t> &wires_) {
-        Kokkos::View<const std::size_t *, Kokkos::HostSpace,
-                     Kokkos::MemoryTraits<Kokkos::Unmanaged>>
-            wires_host(wires_.data(), wires_.size());
-        Kokkos::resize(wires, wires_.size());
-        Kokkos::deep_copy(wires, wires_host);
+        wires = vector2view(wires_);
         arr = arr_;
         matrix = matrix_;
         num_qubits = num_qubits_;