-
Notifications
You must be signed in to change notification settings - Fork 163
Description
Context
It may be worth a quick, prior read about QuEST's software architecture.
SWAP
The SWAP gate is ubiquitous in quantum computing and effects a change upon the quantum state as if the targeted qubits were physically swapped or relabelled.
Let
When
QuEST's code to simulate a SWAP gate in non-distributed settings resembles:
QuEST/quest/src/cpu/cpu_subroutines.cpp
Lines 258 to 282 in 53f8f3a
template <int NumCtrls> | |
void cpu_statevec_anyCtrlSwap_subA(Qureg qureg, vector<int> ctrls, vector<int> ctrlStates, int targ1, int targ2) { | |
assert_numCtrlsMatchesNumCtrlStatesAndTemplateParam(ctrls.size(), ctrlStates.size(), NumCtrls); | |
// each control qubit halves the number of iterations, each of which modifies 2 amplitudes, and skips 2 | |
qindex numIts = qureg.numAmpsPerNode / powerOf2(2 + ctrls.size()); | |
auto sortedQubits = util_getSorted(ctrls, {targ2, targ1}); | |
auto qubitStateMask = util_getBitMask(ctrls, ctrlStates, {targ2, targ1}, {0, 1}); | |
// use template param to compile-time unroll loop in insertBits() | |
SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrls.size()); | |
int numQubitBits = numCtrlBits + 2; | |
#pragma omp parallel for if(qureg.isMultithreaded) | |
for (qindex n=0; n<numIts; n++) { | |
// i01 = nth local index where ctrls are active, targ2=0 and targ1=1 | |
qindex i01 = insertBitsWithMaskedValues(n, sortedQubits.data(), numQubitBits, qubitStateMask); | |
qindex i10 = flipTwoBits(i01, targ2, targ1); | |
std::swap(qureg.cpuAmps[i01], qureg.cpuAmps[i10]); | |
} | |
} |
Read how to simulate this operation in distributed settings here (Sec IV C).
multi-SWAP
Many quantum algorithms feature circuits containing a contiguous sequence of non-overlapping SWAP gates.
Some unrelated QuEST routines even use a sequence of SWAPs to reorder qubits, in order to change the communication pattern during distributed simulation.
This is called "cache blocking" and is used by distributed simulators like cuQuantum, Qiskit, mpiQulacs, often as the primary means of distribution. In QuEST, this technique is necessary when effecting arbitrary matrices targeting two or more qubits, like via applyCompMatr2
. See Sec. IV D (pg 21) here for an explanation.
It is prudent to simulate a multi-SWAP (i.e. a sequence of contiguous SWAP gates) as fast as possible, and to reduce its communication costs when simulated in distributed settings.
Problem
Currently, when QuEST needs to perform a multi-SWAP, it simply performs each individual SWAP in-turn:
QuEST/quest/src/core/localiser.cpp
Lines 906 to 932 in 53f8f3a
void anyCtrlMultiSwapBetweenPrefixAndSuffix(Qureg qureg, vector<int> ctrls, vector<int> ctrlStates, vector<int> targsA, vector<int> targsB) { | |
// this is an internal function called by the below routines which require | |
// performing a sequence of SWAPs to reorder qubits, or move them into suffix. | |
// the SWAPs act on unique qubit pairs and so commute. | |
/// @todo | |
/// - the sequence of pair-wise full-swaps should be more efficient as a | |
/// "single" sequence of smaller messages sending amps directly to their | |
/// final destination node. This could use a new "multiSwap" function. | |
/// - if the user has compiled cuQuantum, and Qureg is GPU-accelerated, the | |
/// multiSwap function should use custatevecSwapIndexBits() if local, | |
/// or custatevecDistIndexBitSwapSchedulerSetIndexBitSwaps() if distributed, | |
/// although the latter requires substantially more work like setting up | |
/// a communicator which may be inelegant alongside our own distribution scheme. | |
// perform necessary swaps to move all targets into suffix, each of which invokes communication | |
for (size_t i=0; i<targsA.size(); i++) { | |
if (targsA[i] == targsB[i]) | |
continue; | |
int suffixTarg = std::min(targsA[i], targsB[i]); | |
int prefixTarg = std::max(targsA[i], targsB[i]); | |
anyCtrlSwapBetweenPrefixAndSuffix(qureg, ctrls, ctrlStates, suffixTarg, prefixTarg); | |
} | |
} |
This is inefficient! Each successive SWAP is not modifying any amplitudes, but merely moving them. It should be much faster to move amplitude each straight to their final locations, performing all swaps "at once". This is called "SWAP fusion", first reported here, and requires more complicated communication logic than the single-SWAP case. SWAP fusion is implemented by e.g. cuQuantum and mpiQulacs.
Solution
Optimising multi-SWAP via implementing SWAP fusion would greatly benefit QuEST's distributed simulation of many-target matrix gates and Kraus maps (which use the aforementioned reordering trick). Implementing SWAP fusion will involve replacing the one-by-one swaps in localiser.cpp
's anyCtrlMultiSwapBetweenPrefixAndSuffix
with a new, custom, distributed routine in comm_routines.cpp
. It may also require locally-parallelised (OpenMP and CUDA) routines to pack communication buffers, implemented in cpu_subroutines.cpp
and gpu_subroutines.cpp
.
An external contribution need not attempt implementing the GPU logic; the OpenMP logic alone is fine.
Implementing this will also require benchmarking to confirm the fused SWAPs are faster, including when each distributed node is not parallelised (with multithreading or GPU-acceleration), to ensure the communication benefits are not outweighed by additional local processing/branching (which is of course very unlikely).