Roughly Speaking, cuHAL is a CUDA-accelerated implementation of the nonparametric regression method known as Highly Adaptive Lasso (HAL).
But what exactly is HAL? And why do we need yet another nonparametric method (or machine learning algorithm, if you prefer) when there’s already an abundance of options like Random Forest, XGBoost, LightGBM, and more?
HAL is primarily used in Targeted Learning—a fascinating branch of causal machine learning—to estimate nuisance parameters (a regression task) of causal estimands, such as the average treatment effect, under realistic assumptions. To achieve this, HAL offers the following advantages:
- Assumption Lean: HAL assumes only that the underlying regression function is right-hand continuous with left-hand limits and has a finite variation norm. It's hard to imagine a non-pathological scenario where these conditions would be violated.
-
Theoretically Guaranteed Fast Convergence Rate: HAL guarantees convergence to the true regression function at a rate of at least
$o_p(n^{-\frac{1}{4}})$ , which is dimension-free—a significant achievement in nonparametric regression.
While this convergence result is asymptotic, HAL has demonstrated strong finite-sample performance and has been shown to be competitive with state-of-the-art machine learning algorithms across various datasets.
Even if you're only concerned with prediction tasks rather than causal inference, HAL provides distinct advantages:
- Interpretability: The HAL estimator is simply a sparse linear combination of products of indicator functions, making it more interpretable compared to ensemble methods like XGBoost. Additionally, HAL can be converted into an equivalent decision tree (though this feature is not implemented in this project), further enhancing its interpretability.
While HAL enjoys many great properties, these advantages do not come without a cost. Without some form of approximation,
training HAL can quickly become computationally intractable as the data size increases. This is because HAL generates a design matrix of size
To address this issue, the R package hal9001 (GitHub link) allows users to customize the number of knot points and the maximum order of interactions between variables. By doing so, users can reduce the number of basis functions to better suit their computational and practical needs.
Although hal9001 is already an excellent package, it would be beneficial to harness the power of GPUs to further boost performance, especially when working with datasets at the scale of those commonly found on Kaggle.
While cuHAL aims to mimic the behavior of hal9001 (e.g., strategies for reducing basis functions), it introduces its own design choices to ensure high GPU utilization.
For instance, during the initialization of the DesignMatrix object, cuHAL allocates memory on the GPU to store the dataframe and all information needed to construct the
design matrix. This approach minimizes the need to frequently transfer data between the host and the device.
Additionally, instead of precomputing the design matrix, cuHAL employs a custom CUDA kernel that fuses the construction of the design matrix with matrix-vector multiplication, ensuring that the design matrix is never explicitly constructed. This design is motivated by two key considerations:
- Minimized Memory Usage: Given the sheer size of the design matrix, explicitly generating it for large datasets is infeasible.
- Reduced Memory Access Overhead: Even if the design matrix were precomputed, performing matrix operations on it would incur significant memory access overhead, which could dominate computational time.
For optimization, cuHAL introduces SRTrainer, which implements the strong rule
used in glmnet
but incorporates an Adam-like update rule. The design of this optimizer is,
admittedly, based on trial and error and still suffers from slow convergence on some large-scale datasets. There is potential for exploring better alternatives in future iterations.
Another notable feature of cuHAL is its support for user-defined loss functions with minimal effort. To add a custom loss function, users need only implement it in Loss.hpp
and register it in LossRegister.hpp. Once registered, the custom loss can be specified in the configuration.
This library has been tested and is primarily supported on the following system configuration:
- Operating System: Ubuntu 24.04.1 LTS (including support for WSL2)
- Compiler: GCC 13.3.0
- Hardware: GeForce RTX 4060, GeForce RTX 2080
Note: This library has been tested exclusively on the specified system configuration. It may work in other environments or setups, but functionality and performance cannot be guaranteed or formally supported.
Install nvcc, the CUDA compiler driver required to compile CUDA programs:
sudo apt install -y nvidia-cuda-toolkit
Install NumCpp, a C++ library that provides functionality similar to NumPy:
sudo apt-get install libboost-all-dev
git clone https://github.com/dpilger26/NumCpp.git
cd NumCpp
mkdir build
cd build
cmake ..
sudo cmake --build . --target install
Install xmake, a lightweight build system used to compile cuHAL:
curl -fsSL https://xmake.io/shget.text | bash
With all dependencies installed, clone the cuHAL repository and build it
git clone https://github.com/ChengYuHan0406/cuHAL.git
cd cuHAL
xmake
The examples directory contains examples demonstrating three common loss types: mse, wmae, and coxloss.
-
Set Library Path: Before running examples, ensure the library path is set correctly, as the executable
build/cuHALrelies onbuild/lib:source set_lib_path.sh -
Training: To train a model,
cuHALuses a JSON file to specify configurations such as data paths, losses, hyperparameters, and more:cd build ./cuHAL ../examples/mse/config.json -
Configuration: Below is an example configuration file (
examples/mse/config.json){ "num_features": 10, "train_size": 1000, "val_size": 500, "loss": "mse", "path_X_train": "../examples/mse/X_train.csv", "path_y_train": "../examples/mse/y_train.csv", "path_X_val": "../examples/mse/X_val.csv", "path_y_val": "../examples/mse/y_val.csv", "max_order": 2, "sample_ratio": 0.5, "reduce_epsilon": 0.1, "step_size": 0.001, "max_iter": 500 }Param Meaning max_orderUpper limit on the order of interactions between variables. sample ratioRatio of samples used as knot points. For example, if train_size=1000andsample_ratio=0.5, then 500 samples will be randomly selected as knot points.reduce_epsilonColumns of the DesignMatrixare filtered based on the proportion of ones. Columns with proportions below(1 + reduce_epsilon) × min_prop_ones, or all 0/1 columns, are removed.step_sizeStep size for the SRTrainermax_iterMaximum number of iterations for each lambda(regularization rate) -
Inference: To make predictions using a trained model, run the
inference.pyscript:cd scripts python inference.py ../build/best_model.json ../examples/mse/X_test.csv ./y_hat.csv -
Feature Importance: To compute and display feature importance, use the
feature_importance.pyscript:cd scripts python feature_importance.py ../build/best_model.json ../examples/mse/col_names.csv 20
This project is licensed under the MIT License. See the LICENSE file for details.
This project uses the following third-party libraries:
- nlohmann/json, licensed under the MIT License.
- dpilger26/NumCpp, , licensed under the MIT License.
- google/googletest, licensed under the BSD 3-Clause License.

