Skip to content

Implement SWAP fusion #595

@TysonRayJones

Description

@TysonRayJones

Context

It may be worth a quick, prior read about QuEST's software architecture.

SWAP

The SWAP gate is ubiquitous in quantum computing and effects a change upon the quantum state as if the targeted qubits were physically swapped or relabelled.

Image

Let $i_{[t]}$ notate the $t$-th bit of an unsigned integer $i$, and let $i_{\neg {a,b}}$ notate the unsigned integer resulting from flipping two bits of $i$; those at indices $a$ and $b$. As outlined here (Sec IV C), the SWAP gate targeting qubits $t_1$ and $t_2$ modifies the $i$-th basis state $\ket{i}$ as

$$ SWAP_{t_1,t_2} \ket{i} = \begin{cases} \ket{i} \text{ when } i_{[t_1]} = i_{[t_2]}, \text{ else } \ket{i_{\neg {t_1,t_2}}} \end{cases}$$

When $SWAP_{t_1,t_2}$ operates upon a statevector $\ket{\psi} = \sum\limits_i \alpha_i \ket{i}$, only amplitudes $\alpha_i$ which satisfy $i_{[t_1]} \ne i_{[t_2]}$ are changed; they are swapped with a pair amplitude at index $i\neg{t_1,t_2}$.

$$ \alpha_i \rightarrow \alpha_{i\neg{t_1,t_2}} $$

QuEST's code to simulate a SWAP gate in non-distributed settings resembles:

template <int NumCtrls>
void cpu_statevec_anyCtrlSwap_subA(Qureg qureg, vector<int> ctrls, vector<int> ctrlStates, int targ1, int targ2) {
assert_numCtrlsMatchesNumCtrlStatesAndTemplateParam(ctrls.size(), ctrlStates.size(), NumCtrls);
// each control qubit halves the number of iterations, each of which modifies 2 amplitudes, and skips 2
qindex numIts = qureg.numAmpsPerNode / powerOf2(2 + ctrls.size());
auto sortedQubits = util_getSorted(ctrls, {targ2, targ1});
auto qubitStateMask = util_getBitMask(ctrls, ctrlStates, {targ2, targ1}, {0, 1});
// use template param to compile-time unroll loop in insertBits()
SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrls.size());
int numQubitBits = numCtrlBits + 2;
#pragma omp parallel for if(qureg.isMultithreaded)
for (qindex n=0; n<numIts; n++) {
// i01 = nth local index where ctrls are active, targ2=0 and targ1=1
qindex i01 = insertBitsWithMaskedValues(n, sortedQubits.data(), numQubitBits, qubitStateMask);
qindex i10 = flipTwoBits(i01, targ2, targ1);
std::swap(qureg.cpuAmps[i01], qureg.cpuAmps[i10]);
}
}

Read how to simulate this operation in distributed settings here (Sec IV C).

multi-SWAP

Many quantum algorithms feature circuits containing a contiguous sequence of non-overlapping SWAP gates.

Image

Some unrelated QuEST routines even use a sequence of SWAPs to reorder qubits, in order to change the communication pattern during distributed simulation.

Image

This is called "cache blocking" and is used by distributed simulators like cuQuantum, Qiskit, mpiQulacs, often as the primary means of distribution. In QuEST, this technique is necessary when effecting arbitrary matrices targeting two or more qubits, like via applyCompMatr2. See Sec. IV D (pg 21) here for an explanation.

It is prudent to simulate a multi-SWAP (i.e. a sequence of contiguous SWAP gates) as fast as possible, and to reduce its communication costs when simulated in distributed settings.

Problem

Currently, when QuEST needs to perform a multi-SWAP, it simply performs each individual SWAP in-turn:

void anyCtrlMultiSwapBetweenPrefixAndSuffix(Qureg qureg, vector<int> ctrls, vector<int> ctrlStates, vector<int> targsA, vector<int> targsB) {
// this is an internal function called by the below routines which require
// performing a sequence of SWAPs to reorder qubits, or move them into suffix.
// the SWAPs act on unique qubit pairs and so commute.
/// @todo
/// - the sequence of pair-wise full-swaps should be more efficient as a
/// "single" sequence of smaller messages sending amps directly to their
/// final destination node. This could use a new "multiSwap" function.
/// - if the user has compiled cuQuantum, and Qureg is GPU-accelerated, the
/// multiSwap function should use custatevecSwapIndexBits() if local,
/// or custatevecDistIndexBitSwapSchedulerSetIndexBitSwaps() if distributed,
/// although the latter requires substantially more work like setting up
/// a communicator which may be inelegant alongside our own distribution scheme.
// perform necessary swaps to move all targets into suffix, each of which invokes communication
for (size_t i=0; i<targsA.size(); i++) {
if (targsA[i] == targsB[i])
continue;
int suffixTarg = std::min(targsA[i], targsB[i]);
int prefixTarg = std::max(targsA[i], targsB[i]);
anyCtrlSwapBetweenPrefixAndSuffix(qureg, ctrls, ctrlStates, suffixTarg, prefixTarg);
}
}

This is inefficient! Each successive SWAP is not modifying any amplitudes, but merely moving them. It should be much faster to move amplitude each straight to their final locations, performing all swaps "at once". This is called "SWAP fusion", first reported here, and requires more complicated communication logic than the single-SWAP case. SWAP fusion is implemented by e.g. cuQuantum and mpiQulacs.

Solution

Optimising multi-SWAP via implementing SWAP fusion would greatly benefit QuEST's distributed simulation of many-target matrix gates and Kraus maps (which use the aforementioned reordering trick). Implementing SWAP fusion will involve replacing the one-by-one swaps in localiser.cpp's anyCtrlMultiSwapBetweenPrefixAndSuffix with a new, custom, distributed routine in comm_routines.cpp. It may also require locally-parallelised (OpenMP and CUDA) routines to pack communication buffers, implemented in cpu_subroutines.cpp and gpu_subroutines.cpp.

An external contribution need not attempt implementing the GPU logic; the OpenMP logic alone is fine.

Implementing this will also require benchmarking to confirm the fused SWAPs are faster, including when each distributed node is not parallelised (with multithreading or GPU-acceleration), to ensure the communication benefits are not outweighed by additional local processing/branching (which is of course very unlikely).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementunitaryhack2025Issues associated with challenges in unitaryHACK 2025

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions