-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Lightning-Qubit's probs(wires)
using bitshift implementation
#795
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #795 +/- ##
==========================================
+ Coverage 94.95% 98.64% +3.68%
==========================================
Files 132 114 -18
Lines 19542 17653 -1889
==========================================
- Hits 18557 17414 -1143
+ Misses 985 239 -746 ☔ View full report in Codecov by Sentry. |
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp
Outdated
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp
Show resolved
Hide resolved
...lightning/core/src/simulators/lightning_qubit/measurements/tests/Test_MeasurementsLQubit.cpp
Show resolved
Hide resolved
...lightning/core/src/simulators/lightning_qubit/measurements/tests/Test_MeasurementsLQubit.cpp
Show resolved
Hide resolved
...lightning/core/src/simulators/lightning_qubit/measurements/tests/Test_MeasurementsLQubit.cpp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vincentmr ! Just a few thoughts and comments.
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp
Outdated
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp
Outdated
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp
Show resolved
Hide resolved
pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp
Show resolved
Hide resolved
…ements/MeasurementKernels.hpp Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good for now.
We could maybe come back to this in the future and try to remove the usage of macros.
Algorithm-wise it is a super cool implementation!
100% agree, I wish the compiler would be smarter optimizing the generic implementation. Or maybe there is a smarter way to write it to help the compiler ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks @vincentmr !
### Before submitting Please complete the following checklist when submitting a PR: - [x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the [`tests`](../tests) directory! - [x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running `make docs`. - [x] Ensure that the test suite passes, by running `make test`. - [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing the change, and including a link back to the PR. - [x] Ensure that code is properly formatted by running `make format`. When all the above are checked, delete everything above the dashed line and fill in the pull request template. ------------------------------------------------------------------------------------------------------------ **Context:** `probs` is central in circuit simulation measurements. **Description of the Change:** Parallelize `probs` loops using OpenMP. **Benefits:** Faster execution with several threads. The following benchmarks are performed on ISAIC's AMD EPYC-Milan Processor using a several core/threads. The times are obtained averaging the computation of `probs(target)` 5 times for various number of targets. We use the last release implementation as a reference. Since #795 brings some speed-ups even for a single thread, this is why we observe speed-ups > number of threads. ![speedup_vs_nthreads](https://github.com/user-attachments/assets/fdb762bb-8e50-4337-b2c0-7b16a42e8dad) Another view on the data is the strong scaling efficiency. It is almost perfect for 2-4 threads, fairly good for 8 threads and diminishes significantly for 16 threads. ![efficiency_vs_nthreads](https://github.com/user-attachments/assets/4feca17e-a461-407c-947c-a0bc54c21b2a) **Possible Drawbacks:** **Related GitHub Issues:** --------- Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai> Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>
### Before submitting Please complete the following checklist when submitting a PR: - [x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the [`tests`](../tests) directory! - [x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running `make docs`. - [x] Ensure that the test suite passes, by running `make test`. - [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing the change, and including a link back to the PR. - [x] Ensure that code is properly formatted by running `make format`. When all the above are checked, delete everything above the dashed line and fill in the pull request template. ------------------------------------------------------------------------------------------------------------ **Context:** `sample` call `generate_samples` which computes the full probabilities and uses the alias method to generate samples for all wires. This is wasteful whenever samples are required only for a subset of all wires. **Description of the Change:** Move alias method logic to the `discrete_random_variable` class. **Benefits:** Compute minimal probs and samples. We benchmark the current changes against `master`, which already benefits from some good speed-ups introduce in #795 and #800 . We use ISAIC's AMD EPYC-Milan Processor using a single core/thread. The times are obtained using at least 5 experiments and running for at least 250 milliseconds. We begin comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0}, num_samples)`. For 4-12 qubits, overheads dominate the calculation (the absolute times range from 6 microseconds to 18 milliseconds, which is not a lot. Already at 12 qubits however, a trend appears where our implementation is significantly faster. This is to be expected for two reason: `probs(wires)` itself is faster than `probs()` (for enough qubits) and `sample(wires)` starts also requiring significantly less work than `sample()`. ![speedup_vs_nthreads_1w](https://github.com/user-attachments/assets/472748e9-d812-489c-a00f-2b2b74c7e682) Next we turn to comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0..num_qubits/2}, num_samples)`. The situation there is similar, with speed-ups close to 1 for the smaller qubit counts and (sometimes) beyond 20x for qubit counts above 20. ![speedup_vs_nthreads_hfw](https://github.com/user-attachments/assets/f39e3ccd-8051-4a57-a857-9cd13f547865) Finally we compare `master`'s `generate_samples(num_samples)` with ours `generate_samples({0..num_qubits-1}, num_samples)` (i.e. computing samples on all wires). We expect similar performance since the main difference comes from the caching mechanism in `master`'s discrete random variable generator. The data suggests caching samples is counter-productive compared with calculating the sample values on the fly. ![speedup_vs_nthreads_fullw](https://github.com/user-attachments/assets/2c70ed21-2236-479e-be3d-6017b42fdc5e) Turning OMP ON, using 16 threads and comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0}, num_samples)` we get good speed-ups above 12 qubits. Below that the overhead of spawning threads isn't repaid, but absolute times remain low. ![speedup_vs_omp16_1w](https://github.com/user-attachments/assets/e3e90a55-399f-4a5b-b90e-7059a0486228) **Possible Drawbacks:** **Related GitHub Issues:** [sc-65127] --------- Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai> Co-authored-by: Ali Asadi <10773383+maliasadi@users.noreply.github.com>
Before submitting
Please complete the following checklist when submitting a PR:
All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to the
tests
directory!All new functions and code must be clearly commented and documented.
If you do make documentation changes, make sure that the docs build and
render correctly by running
make docs
.Ensure that the test suite passes, by running
make test
.Add a new entry to the
.github/CHANGELOG.md
file, summarizing thechange, and including a link back to the PR.
Ensure that code is properly formatted by running
make format
.When all the above are checked, delete everything above the dashed
line and fill in the pull request template.
Context:
probs
is central in circuit simulation measurements.Description of the Change:
Implement
probs(wires)
using bitshift implementation akin to the gate kernels in Lightning-Qubit. This does away with the NDpermuter dependency, which is removed.Benefits:
![speedup_target0](https://private-user-images.githubusercontent.com/8711156/348345223-798a3385-d048-44e2-a0ce-4b34a8d7321b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzA4NTEsIm5iZiI6MTczOTIzMDU1MSwicGF0aCI6Ii84NzExMTU2LzM0ODM0NTIyMy03OThhMzM4NS1kMDQ4LTQ0ZTItYTBjZS00YjM0YThkNzMyMWIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTBUMjMzNTUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Mzc2ZWVjOTVhYmVhZmU1NTFjMTAxYjE0ODI1YmQ2MTc1YmNmNzAyODdjYmFiYTJhN2RiMmQwZTJhY2FlMjcyYyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ._3h9t7eCIWWL5736SOxqSag57Ik-2PcmQkIkfy-0Bmw)
![speedup_targetavg](https://private-user-images.githubusercontent.com/8711156/348345227-afa26519-c1ca-4e04-92b5-508304ea0020.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzA4NTEsIm5iZiI6MTczOTIzMDU1MSwicGF0aCI6Ii84NzExMTU2LzM0ODM0NTIyNy1hZmEyNjUxOS1jMWNhLTRlMDQtOTJiNS01MDgzMDRlYTAwMjAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTBUMjMzNTUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Nzg5MzYyZDJlNmY0YTNiOGQyZTZkNzY0MzUzOGViZjVkYWVlM2I3ODJmMDQwNWI4NGE0YjEzM2Q5NTZiMDE4MiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.1lDpMh0I28W9FN5SMUcivf-Tyhd1V8Fox1BoSZU9Jhg)
Enable on-the-fly diagonalization.
Faster execution.
The following benchmarks are performed on ISAIC's AMD EPYC-Milan Processor using a single core/thread. The times are obtained averaging the computation of
probs(target)
5 times, wheretarget
includes a single wire. The speed-ups are uniform across all possible targets, and hence we showtarget={0}
and the average speed-up across all targets below.The situation is more ambiguous when comparing w.r.t. the number of targets. The following shows timings for
targets = {0..ntargets-1}
and the conclusions are:probs(Hermitian)
would be requested (since at some pointHermitian
becomes too large to represent as a matrix).Possible Drawbacks:
Related GitHub Issues:
[sc-65198]