cuFINUFFT vs. FINUFFT: Slowdown in cuFINUFFT for 3D Stacked Type 1 Transform #649

remy-abergel · 2025-04-01T12:55:01Z

remy-abergel
Apr 1, 2025

Hi,

My apologies for posting multiple times these days. I am currently working with 3D stacked Type 1 transforms (through the Python interface), and for the first time since using (cu)FINUFFT, I have observed a slower execution time with cuFINUFFT compared to FINUFFT. The settings are as follows:

M = 595350
N1 = N2 = N3 = 32
n_trans = 684
dtype = 'complex64' (single precision)

I am unsure whether this should be reported as an issue. Here is the code (sorry, it is a bit messy, but my attempts to simplify it bring me back to a situation similar to issue 648 that I failed to solve for the moment). To reproduce the experiment, you can run the following (changing the setting at line 14: lib = cp for GPU or lib = np for CPU).

@DiamonDinoia edits to run both in one go and synchronize the GPU:

# -------------- #
# import modules #
# -------------- #
import math
import numpy as np
import cupy as cp
import finufft
import cufinufft
import time

# --------------------------------------- #
# Configure execution ('cupy' or 'numpy') #
# --------------------------------------- #
for lib in [cp, np]:
    libnufft = cufinufft if lib == cp else finufft

    # -------------------------------------------------- #
    # Prepare 3D type 1 NUFFT (nothing interesting here) #
    # -------------------------------------------------- #

    # set up dimensions & precision
    Nb = 683
    Nx = Ny = Nz = 32
    Nproj = 3375
    dtype = 'float32'
    eps = 1e-6

    # compute frequency nodes (x, y, z)
    delta = .12
    dB = 0.01320076
    grad = lib.linspace(-.3, .3, 15, dtype=dtype)
    gx, gy, gz = lib.meshgrid(grad, grad, grad, indexing='xy')
    gx = gx.reshape((-1,))
    gy = gy.reshape((-1,))
    gz = gz.reshape((-1,))
    mu = lib.sqrt(gx**2 + gy**2 + gz**2).reshape((-1, 1))
    alf = lib.arange(1 + Nb//2, dtype=dtype)
    T = (mu * alf < .5 * Nb * dB / delta) & (alf < .5 * Nb)
    indexes = lib.argwhere(T.reshape((-1,))).reshape((-1,))
    xi = ((2. * math.pi * alf) / (Nb * dB)).reshape((1,-1))
    x = -((delta * gx).reshape((-1,1)) * xi).reshape((-1,))[indexes]
    y = -((delta * gy).reshape((-1,1)) * xi).reshape((-1,))[indexes]
    z = -((delta * gz).reshape((-1,1)) * xi).reshape((-1,))[indexes]

    # compute some Fourier coefficients
    t = -((2 * math.pi * alf / Nb).reshape((1, -1)) * lib.ones((Nproj, 1), dtype=dtype)).reshape((-1,))[indexes]
    t, idt = lib.unique(t, return_inverse=True)
    lt = lib.arange(Nb, dtype=dtype).reshape((-1, 1)) * t.reshape((1, -1))
    c = (delta**3 / float(Nb)) * lib.exp(-1j * lt)
    c = c[:, idt].ravel().reshape((Nb, len(idt)))

    # ------------------------------------------------------- #
    # Perform stacked type 1 transform (with time monitoring) #
    # ------------------------------------------------------- #
    plan = libnufft.Plan(1, (Nx, Ny, Nz), n_trans=Nb, dtype=c.dtype, eps=eps)
    plan.setpts(x, y, z)
    if lib == cp:
        # sybchronize GPU
        start_gpu = cp.cuda.Event()
        start_gpu.synchronize()
    if lib == np:
        start_cpu = time.perf_counter()
        f = plan.execute(c) # 3D STACKED TYPE 1 TRANSFORMATION HERE
        end_cpu = time.perf_counter()
        elapsed_time_ms = (end_cpu - start_cpu) * 1e3
    else:
        start_gpu = cp.cuda.Event()
        end_gpu = cp.cuda.Event()
        start_gpu.record()
        f = plan.execute(c) # 3D STACKED TYPE 1 TRANSFORMATION HERE
        end_gpu.record()
        end_gpu.synchronize()
        elapsed_time_ms = cp.cuda.get_elapsed_time(start_gpu, end_gpu)

    print("lib = %s : elapsed time = %.3g s" % (lib.__name__, elapsed_time_ms * 1e-3))

I checked that the computed f values are the same (up to machine epsilon) for lib=np and lib=cp, so I believe that I am not in the situation reported in issue 648. Here are the measured times on my laptop:

lib	elapsed time
`cp`	5.9 sec
`np`	2.4 sec

I usually achieve a nice ~10x speedup when using cuFINUFFT instead of FINUFFT. For instance, the Type 2 transformation applied to f is faster with cuFINUFFT (~0.6 sec) compared to FINUFFT (~4 sec), which seems more typical to me.

I would be glad to hear your comments if you have any.

Many thanks,
Rémy

Environment

OS: Ubuntu 24.04
Python: 3.12.3
finufft version: 2.4.0
cufinufft version: 2.3.1 (installed using pip install cufinufft, could not be installed using --no-binary cufinufft option yet)
numpy version: 2.2.4
cupy-cuda12x version: 13.3.0
RAM: 64Go
CPU: Intel Core i9-13950HX (24 cores HT)
GPU: NVIDIA RTX 4000 ADA (12Go)

DiamonDinoia · 2025-04-01T18:24:07Z

DiamonDinoia
Apr 1, 2025
Maintainer

Hi @remy-abergel,

does pip install cufinufft --no-binary cufinufft fail?
Also, I would try to build from master. I made some changes to cufinufft so I would like to see if the issue persists in master.

git clone https://github.com/flatironinstitute/finufft.git
cd finufft
pip install python/cufinufft

Edit:

If I do pip install finufft cufinufft I get:

lib = cupy : elapsed time = 7.01 s
lib = numpy : elapsed time = 8.12 s

If I do pip install --no-binary finufft --no-binary cufinufft finufft cufinufft which uses march=native for CPU I get:

lib = cupy : elapsed time = 7.14 s
lib = numpy : elapsed time = 3.79 s

4 replies

remy-abergel Apr 1, 2025
Author

Hi @DiamonDinoia,

Thank you for your reply.

does pip install cufinufft --no-binary cufinufft fail?

Yes it does (and also from master), I am going to try again on a fresh conda environment with cudatoolkit installed.

If I do pip install finufft cufinufft I get:
lib = cupy : elapsed time = 7.01 s
lib = numpy : elapsed time = 8.12 s
If I do pip install --no-binary finufft --no-binary cufinufft finufft cufinufft which uses march=native for CPU I get:
lib = cupy : elapsed time = 7.14 s
lib = numpy : elapsed time = 3.79 s

In your second experiment, the computation is roughly twice faster on CPU than on GPU (as I get on my laptop with a simple pip install finufft cufinufft).

Could there be an issue with cuFINUFFT, or is this kind of behavior somehow expected? I think this is the first time I've seen cuFINUFFT being slower than FINUFFT on my system.

DiamonDinoia Apr 1, 2025
Maintainer

installing finufft from source enables AVX/AVX2/AVX512 (if supported by the CPU) which does make finufft 2-4x faster. We cannot ship (yet) binaries with those enabled as they are not portable instruction sets.

Nonetheless, it is not expected cufinufft to be slower. I believe that is a problem.

remy-abergel Apr 1, 2025
Author

Nonetheless, it is not expected cufinufft to be slower. I believe that is a problem.

Same thought here, I was more expecting something like 0.5 sec on this example with cufinufft 😉

mreineck Apr 1, 2025
Maintainer

Could it be an issue with lock contention/memory write conflicts, since the uniform grid is really small?

remy-abergel · 2025-04-02T08:49:32Z

remy-abergel
Apr 2, 2025
Author

Just to give you a bit of context, I am developping forward and backward operators related to 4D spectral-spatial image reconstruction for Electron Paramagnetic Resonance (to be included in the next release of PyEPRI). The image to be reconstructed from the EPR measurements (a.k.a projections) are four-dimensional (they can be viewed as 3D images in which each "voxel" contains a 1D EPR spectrum).

Forward operator (spectral-spatial projection operator)

Given a 4D image $u$, I need to compute a bunch of projections $A(u) := p = (p_1, p_2, ..., p_N)$ such that each projection $p_n$ is a 1D signal with size $N_B$ and whose discrete Fourier transform coefficients $\widehat{p_n}(\alpha)$ are given by

$$\widehat{p_n}(\alpha) = \sum\limits_{\substack{k \in \Omega \ 0 \leq \ell < N_B}} u(k,\ell) \cdot e^{-2 i \pi \frac{\alpha}{N_B} \left( \ell - \langle k , \gamma_n \rangle \right)} \quad \text{for}\quad |\alpha| < \frac{M_n}{2}$$

where

$\gamma_n \in \mathbb{R}^3$ is a field gradient vector associated to the $n$-th EPR projection $p_n$ (the same field gradient as in MRI),
$k = (k_1, k_2, k_3) \in \Omega$ denotes a spatial (voxel) location and $\ell \in$ {0, $1$, $\dots$, $N_B-1$} is a spectral index.

Since 4D NUFFT is not available, I compute $\widehat{p_n}$ using 3D stacked type 2 transforms of the stack of 3D images $u_\ell = k\mapsto u(k,\ell)$ and compute the weighted sum along the $\ell$ index afterwards:

$$\widehat{p_n}(\alpha) = \sum_{\ell = 0}^{N_B} \mathrm{NUFFT}(u_\ell)\left(-2 \pi \frac{\alpha}{N_B} \gamma_n\right) \cdot e^{-2 i \pi \frac{\alpha \ell}{N_B}}$$

where $\mathrm{NUFFT}(u_\ell)(\xi)$ denotes here the 3D Type 2 transformation of $u_\ell$ at the frequency node $\xi$.

Adjoint operator (spectral-spatial backprojection operator)

I use a similar strategy for evaluating $A^*$, the adjoint of $A$, involving 3D stacked Type 1 transforms of reweighted Fourier coefficients.

Toeplitz kernel

On the top of that, I need to compute a Toeplitz convolution kernel enabling the evaluation of $A^* (A(u))$ as a circular convolution (the kernel is 4D with domain twice larger as that of $u$ along each direction). This allows addressing the image reconstruction using variational models and efficient optimization algorithms.

Typical sizes

3D spatial domain : $|\Omega| \approx 32 \times 32 \times 32$ (roughly 1 mm / pixel along each axis)
spectral domain (4-th axis): $N_B \approx 500$ or $1000$
number of projections: $N \approx 4000$

EPR imaging does not allow high spatial resolution (contrary to MRI) so increasing the spatial domain is not a priority. Users are not equipped with HPC workstations so the memory budget is only several GB.

By the way, I use Hermitian symmetry properties to reduce the memory usage: $p_n$ and $u$ are real valued, so we can divide by two the number of frequency nodes to be computed in the forward operator. Similarly for the adjoint, the sum over half of the coefficients can be easily reweighted to actually compute the sum over all coefficients). However, I am forced to cast $u$ into complex datatype (this is not a big deal but this could be discussed here).

8 replies

remy-abergel Apr 2, 2025
Author

Sorry I didn't see the additional code that you put at the end of your script. Thank you for the tip! I will keep it in mind for my unit tests.

remy-abergel Apr 2, 2025
Author

Here is what I get by running your script:

lib = numpy : elapsed time = 2.39 s
L2 error:  6.446270648789783e-06

mreineck Apr 2, 2025
Maintainer

Thank you so much for double-checking that!
It seems that I have someting fishy going on with my own finufft installation ... I'll report back as soon as I know more!
Sorry for the noise!

remy-abergel Apr 2, 2025
Author

Thank you so much 😊

mreineck Apr 3, 2025
Maintainer

I can confirm that the issue with the large L2 errors was a problem on yesterday's master branch, and it's not present in any official release. It's now also fixed on master.
Sorry again for the confusion!

remy-abergel · 2025-04-18T13:15:02Z

remy-abergel
Apr 18, 2025
Author

Here are some fresh feedbacks about this issue: installation without binaries still fails on my own machine (with cudatoolkit installed with conda), but I could make it work on another machine equipped with two NVIDIA A40 GPUs (also managing the cuda installation with conda). I could run again the reported code with the two different kind of installations.

cufinufft v. 2.3.1 installed with binaries

(installation: pip install cufinufft)

lib = cupy : elapsed time = 4.98 s
lib = numpy : elapsed time = 2.64 s

cufinufft v. 2.3.1 installed without binaries

(installation: pip install --no-binary cufinufft cufinufft)

lib = cupy : elapsed time = 4.92 s
lib = numpy : elapsed time = 2.66 s

Installation steps with conda

In case this can be helpful to someone else, here are my installation notes.

####################################
# create a fresh conda environment #
####################################
conda update -y -n base -c defaults conda
conda create -y -n finufft-conda-no-binaries pip
conda activate finufft-conda-no-binaries

#######################
# install dependences #
#######################
pip install setuptools # needed for cudatoolkit-dev install at the next step
conda install -c conda-forge cudatoolkit-dev # cudatoolkit is not enough (nvcc is missing)
conda install -c conda-forge cxx-compiler gcc=11 # cufinufft install fails with gcc > 11
conda install cupy # install of cupy-cuda12x with pip causes issues on this machine
pip install finufft # fails with option --no-binary finufft
pip install packaging # to avoid error on finufft import (version 2.3.1)
pip install --no-binary cufinufft cufinufft # works

########################################
# to install from master (still fails) #
########################################
conda install -c conda-forge cuda-runtime # to get the crt/host_config.h file

On master

git clone https://github.com/flatironinstitute/finufft.git
cd finufft
pip install python/cufinufft # fails

This install attempt fails with fatal error: crt/host_config.h: No such file or directory although the file is included into the conda virtual environment.

2 replies

DiamonDinoia Apr 19, 2025
Maintainer

Thank you for the report!

While we do not officially support conda. I have some ideas on how to improve the performance of this on GPU. However, it will take a bit of time before I get to actually implement the improvements.

remy-abergel Apr 28, 2025
Author

Hi @DiamonDinoia,

Thank you for your feedback, and best of luck with the upcoming improvements.

I finally managed to complete the installation without binaries in a Conda environment on my Dell Precision 7680 (with Ubuntu 24.04 and NVIDIA RTX 4000 GPU). I'm copying the steps here in case they might be useful to someone.

##################
# Clone the code #
##################
git clone https://github.com/flatironinstitute/finufft.git
cd finufft

####################################
# create a fresh conda environment #
####################################
conda update -y -n base -c defaults conda
conda create -y -n finufft-conda-no-binaries pip
conda activate finufft-conda-no-binaries

#######################
# Install (cu)FINUFFT #
#######################
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
conda install conda-forge::fftw # needed for next step (finufft installation with pip), otherwise it fails
pip install --no-binary finufft python/finufft # success!
pip install --no-binary cufinufft python/cufinufft # success!

These installation instructions also worked on the other machine that I mentioned in my previous post (with Ubuntu 22.04 and two NVIDIA A40 GPUs). I think that using the NVIDIA repository (instead of conda-forge) for installing cudatoolkit is what solved the installation issues I was encountering before.

cuFINUFFT vs. FINUFFT: Slowdown in cuFINUFFT for 3D Stacked Type 1 Transform #649

Uh oh!

Uh oh!

remy-abergel Apr 1, 2025

Environment

Replies: 3 comments · 14 replies

Uh oh!

Uh oh!

DiamonDinoia Apr 1, 2025 Maintainer

Uh oh!

Uh oh!

remy-abergel Apr 1, 2025 Author

Uh oh!

DiamonDinoia Apr 1, 2025 Maintainer

Uh oh!

remy-abergel Apr 1, 2025 Author

Uh oh!

mreineck Apr 1, 2025 Maintainer

Uh oh!

Uh oh!

remy-abergel Apr 2, 2025 Author

Forward operator (spectral-spatial projection operator)

Adjoint operator (spectral-spatial backprojection operator)

Toeplitz kernel

Typical sizes

Uh oh!

remy-abergel Apr 2, 2025 Author

Uh oh!

remy-abergel Apr 2, 2025 Author

Uh oh!

mreineck Apr 2, 2025 Maintainer

Uh oh!

remy-abergel Apr 2, 2025 Author

Uh oh!

mreineck Apr 3, 2025 Maintainer

Uh oh!

remy-abergel Apr 18, 2025 Author

cufinufft v. 2.3.1 installed with binaries

cufinufft v. 2.3.1 installed without binaries

Installation steps with conda

On master

Uh oh!

DiamonDinoia Apr 19, 2025 Maintainer

Uh oh!

remy-abergel Apr 28, 2025 Author

remy-abergel
Apr 1, 2025

Replies: 3 comments 14 replies

DiamonDinoia
Apr 1, 2025
Maintainer

remy-abergel Apr 1, 2025
Author

DiamonDinoia Apr 1, 2025
Maintainer

remy-abergel Apr 1, 2025
Author

mreineck Apr 1, 2025
Maintainer

remy-abergel
Apr 2, 2025
Author

remy-abergel Apr 2, 2025
Author

remy-abergel Apr 2, 2025
Author

mreineck Apr 2, 2025
Maintainer

remy-abergel Apr 2, 2025
Author

mreineck Apr 3, 2025
Maintainer

remy-abergel
Apr 18, 2025
Author

DiamonDinoia Apr 19, 2025
Maintainer

remy-abergel Apr 28, 2025
Author