This repository contains the implementation of the Distributed Scaled Block Vecchia (SBV) method for large-scale Gaussian Process approximation. The code provides a highly efficient distributed implementation optimized for GPU acceleration that can handle massive datasets.
SBV-GP is a scalable method for approximating Gaussian Processes that:
- Uses a block-based approach to partition data
- Leverages GPU acceleration for covariance computations
- Distributes computation across multiple nodes with MPI
The implementation is optimized for modern high-performance computing environments with multi-GPU support.
- Scalable: Handles millions of data points with efficient block-based partitioning
- GPU-Accelerated: Optimized for NVIDIA GPUs with MAGMA linear algebra acceleration
- Distributed: MPI-based implementation for multi-node execution
- Parameter Estimation: Implements efficient likelihood-based parameter optimization
- Prediction: Fast prediction at new locations by leveraging the approximation structure
The following libraries and tools are required:
- C++ compiler GCC 13.3.0
- CUDA 11.8/12/9
- MAGMA 2.7.2 - GPU-accelerated linear algebra library
- NLopt 2.7.1 - Nonlinear optimization library
- OpenMPI 5.0.5 - MPI implementation
-
Clone the repository:
cd ParallelScaledBlockVecchiaGP -
Build the code:
# For slurm systems make -j 8
The build system will automatically detect available GPUs and configure the compilation accordingly.
See experiments/
To run a simple example using a single GPU:
# For systems without SLURM (direct MPI execution)
mpirun -np 4 ./bin/dbv --num_total_points 8000000 --num_total_blocks 80000 -m 400 --dim 10 --mode estimation --maxeval 100 --theta_init 1.5,0.0 --distance_scale 1,1,1,1,1,1,1,1,1,1 --kernel_type Matern72 --nn_multiplier 10 --seed 8SBV-GP provides many configuration options:
--num_total_points <N>: Total number of points in the dataset (default: 20000)--num_total_blocks <B>: Total number of blocks for partitioning (default: 1000)--num_total_points_test <N>: Total number of points for testing (default: 2000)--num_total_blocks_test <B>: Total number of blocks for testing (default: 100)--dim <D>: Dimension of the input space (default: 8)--train_metadata_path <PATH>: Path to the training metadata file (default: "")--test_metadata_path <PATH>: Path to the testing metadata file (default: "")
-
-m <M>: Number of nearest neighbors to consider (default: 200) -
--m_test <M>: Number of nearest neighbors for testing (default: 120) -
--nn_multiplier <M>: Number of nearest neighbors multiplier ($\alpha$ ) (default: 400) -
--distance_scale <S>: Distance scales for each dimension (default: all 1.0) -
--distance_scale_init <S>: Initial distance scales for optimization (default: same as distance_scale)
--kernel_type <TYPE>: Covariance kernel type (default: "Matern72")- Options: "Matern12", "Matern32", "Matern52", "Matern72"
--theta_init <θ>: Initial parameter values (comma-separated, defaults depend on kernel)- For PowerExponential kernel: variance, smoothness, nugget (default: 1.0, 0.5, 0.0)
- For Matérn kernels: variance, nugget (default: 1.0, 0.0)
--mode <MODE>: Operation mode (default: "estimation")- Options: "estimation", "prediction", "full"
--maxeval <N>: Maximum number of function evaluations (default: 500)--xtol_rel <T>: Relative tolerance for optimization (default: 1e-5)--ftol_rel <T>: Relative tolerance of function for optimization (default: 1e-5)--lower_bounds <B>: Lower bounds for optimization (default: theta_init * 0.001)--upper_bounds <B>: Upper bounds for optimization (default: theta_init * 10)
--omp_num_threads <T>: Number of OpenMP threads per MPI process (default: 20)--seed <S>: Seed for random number generator (default: 0)--print: Print additional information (default: true)--log_append <S>: Append string to log file names (default: "")
--partition <TYPE>: Partition type (default: "linear")- Options: "linear", "none"
--clustering <TYPE>: Clustering method for large datasets (default: "random")- Options: "random", "kmeans++"
--kmeans_max_iter <N>: Maximum iterations for k-means++ (default: 10)--num_simulations <N>: Number of simulations for evaluation (default: 1000)
Run ./bin/dbv --help to see all available options.
Parameter estimation finds optimal hyperparameters for the Gaussian process model:
mpirun -n 4 ./bin/dbv \
--num_total_points 1000000 \
--num_total_blocks 10000 \
--distance_scale 0.05,0.05,0.1,1.0,1.0,1.0,1.0,1.0 \
--theta_init 1.0,0.001 \
-m 200 \
--dim 8 \
--mode estimation \
--maxeval 500 \
--xtol_rel 1e-6 \
--ftol_rel 1e-6 \
--kernel_type Matern72 \
--seed 42 \
--nn_multiplier 500 \
--omp_num_threads 20 \
--log_append "estimation_run1"
This performs maximum likelihood estimation for the Matérn 7/2 kernel parameters. The results are saved to log files with the suffix "estimation_run1".
After parameter estimation, you can make predictions at new test locations:
mpirun -n 4 ./bin/dbv \
--num_total_points 1000000 \
--num_total_blocks 10000 \
--num_total_points_test 10000 \
--num_total_blocks_test 1000 \
--distance_scale 0.05,0.05,0.1,1.0,1.0,1.0,1.0,1.0 \
--theta_init 1.0,0.001 \
-m 200 \
--m_test 60 \
--dim 8 \
--mode prediction \
--kernel_type Matern72 \
--nn_multiplier 500 \
--omp_num_threads 20 \
--log_append "prediction_run1"
You can perform both estimation and prediction in a single run:
mpirun -n 4 ./bin/dbv \
--num_total_points 1000000 \
--num_total_blocks 10000 \
--num_total_points_test 10000 \
--num_total_blocks_test 1000 \
--distance_scale 0.05,0.05,0.1,1.0,1.0,1.0,1.0,1.0 \
--theta_init 1.0,0.001 \
-m 200 \
--m_test 120 \
--dim 8 \
--mode full \
--maxeval 10 \
--kernel_type Matern72 \
--log_append "full_run1"
For real datasets, data should be formatted with one point per line:
x1,x2,...,xD,y
where x1, x2, ..., xD are the D-dimensional input coordinates and y is the observation value.
Use the --train_metadata_path and --test_metadata_path options to specify the paths to your data files.
For optimal performance:
- Choose the number of blocks based on your GPU memory
- Set the number of nearest neighbors (-m) according to desired accuracy and computational budget
- Adjust
--omp_num_threadsbased on your CPU resources - Use
--nn_multiplierto control the coarseness of nearest neighbor search - For higher-dimensional problems, tune
--distance_scaleparameters to give appropriate weights to different dimensions
If you use this code in your research, please cite:
@article{...},
[License information]