Contents
The likelihood, i.e., the distribution of the response variable (=label) conditional on fixed and random effects, is set through the likelihood
parameter of the random effects / Gaussian process model (= 'GPModel' in R / Python) for both the GPBoost algorithm and (generalized) linear mixed effects and Gaussian process models. See the likelihood documentation below for a list of currently supported likelihoods.
It is important that tuning parameters (= hyperparameters) for the tree-boosting part are chosen appropriately. There are no universal good "default" values for different data sets. See below for a list of important tuning parameters. Selecting tuning parameters can be done conveniently via the gpb.grid.search.tune.parameters
function in the Python and R packages.
The metric
parameter (e.g., for the gpb.train
, gpboost
, and gpb.grid.search.tune.parameters
functions in R and Python) specifies how prediction accuracy is measured on validation data.
- For the GPBoost algorithm, i.e., if there is a gp_model,
test_neg_log_likelihood
is the default metric. - Other supported metrics include:
mse
,mae
,binary_logloss
,binary_error
, andauc
. - If another metric besides
test_neg_log_likelihood
is used for the GPBoost algorithm, it is calculated as follows. First, the predictive mean of the response variable is calculated. Second, the corresponding metric is evaluated using this predictive mean as point prediction. See here for a list of all supported metrics.
Below is a list of important parameters for the tree-boosting part. A comprehensive list of all tree-bosting related parameters can be found here.
num_iterations
🔗︎, default =100
, type = int, aliases:num_iteration
,n_iter
,num_tree
,num_trees
,num_round
,num_rounds
,num_boost_round
,n_estimators
, constraints:num_iterations >= 0
- number of boosting iterations
- this is arguably the most important tuning parameter, in particular for regession settings
learning_rate
🔗︎, default =0.1
, type = double, aliases:shrinkage_rate
,eta
, constraints:learning_rate > 0.0
- shrinkage rate or damping parameter
- smaller values lead to higher predictive accuracy but require more computational time since more boosting iterations are needed
max_depth
🔗︎, default =-1
, type = int- maximal depth of a tree
<= 0
means no limit
num_leaves
🔗︎, default =31
, type = int, aliases:num_leaf
,max_leaves
max_leaf
, constraints:1 < num_leaves <= 131072
- maximal number of leaves of a tree
- Note on ``max_depth`` and ``num_leaves`` parameters: The GPBoost library uses the LightGBM tree growing algorithm which grows trees using a leaf-wise strategy. I.e., trees are grown by first splitting leaf nodes that maximize the information gain until the maximal number of leaves
num_leaves
or the maximal depth of a treemax_depth
is attained, even when this leads to unbalanced trees. This in contrast to a depth-wise growth strategy of other boosting implementations which builds "balanced" trees. For shallow trees (=smallmax_depth
), there is likely no difference between these two tree growing strategies. If you only want to tune the maximal depth of a treemax_depth
parameter and not thenum_leaves
parameter, it is recommended that you set thenum_leaves
parameter to a large value min_data_in_leaf
🔗︎, default =20
, type = int, aliases:min_data_per_leaf
,min_data
,min_child_samples
, constraints:min_data_in_leaf >= 0
- minimal number of samples in a leaf
lambda_l2
🔗︎, default =0.0
, type = double, aliases:reg_lambda
,lambda
, constraints:lambda_l2 >= 0.0
- L2 regularization
lambda_l1
🔗︎, default =0.0
, type = double, aliases:reg_alpha
, constraints:lambda_l1 >= 0.0
- L1 regularization
max_bin
🔗︎, default =255
, type = int, constraints:max_bin > 1
- Maximal number of bins that feature values will be bucketed in
- GPBoost uses histogram-based algorithms [1, 2, 3], which bucket continuous feature (covariate) values into discrete bins. A small number speeds up training and reduces memory usage but may reduce the accuracy of the model
min_gain_to_split
🔗︎, default =0.0
, type = double, aliases:min_split_gain
, constraints:min_gain_to_split >= 0.0
- the minimal gain to perform a split
line_search_step_length
🔗︎, default =false
, type = bool- if
true
, a line search is done to find the optimal step length for every boosting update (see, e.g., Friedman 2001). This is then multiplied by thelearning_rate
- applies only to the GPBoost algorithm
- if
reuse_learning_rates_gp_model
🔗︎, default =true
, type = bool- if
true
, the learning rates for the covariance and potential auxiliary parameters are kept at the values from the previous boosting iteration and not re-initialized when optimizing them - this option can only be used if
optimizer_cov
=gradient_descent
oroptimizer_cov
=lbfgs
(for the latter, the approximate Hessian is reused)
- if
train_gp_model_cov_pars
🔗︎, default =true
, type = bool- if
true
, the covariance parameters of the Gaussian process / random effects model are trained (estimated) in every boosting iteration of the GPBoost algorithm, otherwise not
- if
use_gp_model_for_validation
🔗︎, default =true
, type = bool- set this to
true
to also use the Gaussian process / random effects model (in addition to the tree model) for calculating predictions on the validation data when using the GPBoost algorithm
- set this to
leaves_newton_update
🔗︎, default =false
, type = bool- if
true
, a Newton update step is done for the tree leaves after the gradient step - applies only to the GPBoost algorithm for Gaussian data and cannot be used for non-Gaussian data
- if
Below is a list of parameters for specifying GPModel
objects for modeling Gaussian processes and grouped random effects
and for specifying how these models are trained. These parameters are documented in a generic manner in the form they are
used in the R and Python package. The C API works slightly different.
likelihood
: string, (default =gaussian
)- Likelihood function, i.e., conditional distribution of the response variable (=label)
- Currently supported likelihoods:
gaussian
: Gaussian likelihoodbernoulli_probit
: Bernoulli likelihood with a probit link function. Aliases:binary
,binary_probit
bernoulli_probit
: Bernoulli likelihood with a logit link function. Aliases:binary
,binary_logit
poisson
: Poisson likelihood with log link functiongamma
: Gamma likelihood with log link functionnegative_binomial
: Negative binomial likelihood with log link function- Note: other likelihoods could be implemented upon request
group_data
: two dimensional array / matrix of doubles or strings, optional (default = None)- Labels of group levels for grouped random effects
group_rand_coef_data
: two dimensional array / matrix of doubles or None, optional (default = None)- Covariate data for grouped random coefficients
ind_effect_group_rand_coef
: integer vector / array of integers or None, optional (default = None)- Indices that relate every random coefficients to a "base" intercept grouped random effect. Counting starts at 1.
gp_coords
: two dimensional array / matrix of doubles or None, optional (default = None)- Coordinates (input features) for Gaussian process
gp_rand_coef_data
: two dimensional array / matrix of doubles or None, optional (default = None)- Covariate data for Gaussian process random coefficients
cov_function
: string, (default =exponential
)Covariance function for the Gaussian process. Available options:
exponential
: Exponential covariance function (using the parametrization of Diggle and Ribeiro, 2007)gaussian
: Gaussian, aka squared exponential, covariance function (using the parametrization of Diggle and Ribeiro, 2007)matern
: Matern covariance function with the smoothness specified by thecov_fct_shape
parameter (using the parametrization of Rasmussen and Williams, 2006)powered_exponential
: Powered exponential covariance function with the exponent specified bycov_fct_shape
parameter (using the parametrization of Diggle and Ribeiro, 2007)wendland
: Compactly supported Wendland covariance function (using the parametrization of Bevilacqua et al., 2019, AOS)matern_space_time
: Spatio-temporal Matern covariance function with different range parameters for space and time- Note that the first column in
gp_coords
must correspond to the time dimension
- Note that the first column in
matern_ard
: Anisotropic Matern covariance function with Automatic Relevance Determination (ARD), i.e., with a different range parameter for every coordinate dimension / column ofgp_coords
gaussian_ard
: Anisotropic Gaussian, aka squared exponential, covariance function with Automatic Relevance Determination (ARD), i.e., with a different range parameter for every coordinate dimension / column ofgp_coords
cov_fct_shape
: double, (default = 1.5)- Shape parameter of the covariance function (=smoothness parameter for Matern covariance). This parameter is irrelevant for some covariance functions such as the exponential or Gaussian.
gp_approx
: string, (default =none
)- Specifies the use of a large data approximation for Gaussian processes. Available options:
none
: No approximationvecchia
: A Vecchia approximation; see Sigrist (2022, JMLR for more details)tapering
: The covariance function is multiplied by a compactly supported Wendland correlation functionfitc
: Fully Independent Training Conditional approximation aka modified predictive process approximation; see Gyger, Furrer, and Sigrist (2024) for more detailsfull_scale_tapering
: A full scale approximation combining an inducing point / predictive process approximation with tapering on the residual process; see Gyger, Furrer, and Sigrist (2024) for more details
- Specifies the use of a large data approximation for Gaussian processes. Available options:
cov_fct_taper_range
: double, (default = 1.)- Range parameter of the Wendland covariance function and Wendland correlation taper function. We follow the notation of Bevilacqua et al. (2019, AOS)
cov_fct_taper_shape
: double, (default = 1.)- Shape parameter of the Wendland covariance function and Wendland correlation taper function. We follow the notation of Bevilacqua et al. (2019, AOS)
num_neighbors
: integer, (default = 20)- Number of neighbors for the Vecchia approximation
vecchia_ordering
: string, (default =random
)- Ordering used in the Vecchia approximation. Available options:
none
: the default ordering in the data is usedrandom
: a random orderingtime
: ordering accorrding to time (only for space-time models)time_random_space
: ordering according to time and randomly for all spatial points with the same time points (only for space-time models)
- Ordering used in the Vecchia approximation. Available options:
vecchia_pred_type
: string, (default = Null)Type of Vecchia approximation used for making predictions
Default value if
vecchia_pred_type
= Null :order_obs_first_cond_obs_only
Available options:
order_obs_first_cond_obs_only
: observed data is ordered first and the neighbors are only observed pointsorder_obs_first_cond_all
: observed data is ordered first and the neighbors are selected among all points (observed + predicted)latent_order_obs_first_cond_obs_only
: Vecchia approximation for the latent process and observed data is ordered first and neighbors are only observed pointslatent_order_obs_first_cond_all
: Vecchia approximation for the latent process and observed data is ordered first and neighbors are selected among all pointsorder_pred_first
: predicted data is ordered first for making predictions. This option is only available for Gaussian likelihoods
num_neighbors_pred
: integer, (default = Null)- Number of neighbors for the Vecchia approximation for making predictions.
- Default value if
num_neighbors_pred
= Null:num_neighbors_pred
= 2 *num_neighbors
num_ind_points
: integer, (default = 500)- Number of inducing points / knots for, e.g., a predictive process approximation
matrix_inversion_method
: string, (default =cholesky
)Method used for inverting covariance matrices. Available options:
cholesky
: Cholesky factorizationiterative
: iterative methods. A combination of conjugate gradient, Lanczos algorithm, and other methods.This is currently only supported for the following cases:
likelihood
!=gaussian
andgp_approx
==vecchia
(non-Gaussian likelihoods with a Vecchia-Laplace approximation)likelihood
==gaussian
andgp_approx
==full_scale_tapering
(Gaussian likelihood with a full-scale tapering approximation)
seed
: integer, (default = 0)- The seed used for model creation (e.g., random ordering in Vecchia approximation)
cluster_ids
: one dimensional numpy array (vector) with integer data or Null, (default = Null)- IDs / labels indicating independent realizations of random effects / Gaussian processes (same values = same process realization)
The following list shows options for the optimization of the variance and covariance parameters of gp_model
objects which contain Gaussian process and/or grouped random effects models. These parameters are passed to either the fit
function of a gp_model
object in Python and R or to the set_optim_params
function prior to running the GPBoost algorithm.
optimizer_cov
: string, optional (default = "lbfgs" for linear mixed effects models and "gradient_descent" for the GPBoost algorithm)- Optimizer used for estimating covariance parameters
- Options:
gradient_descent
,lbfgs
,fisher_scoring
,nelder_mead
optimizer_coef
: string, optional (default =wls
for Gaussian data andgradient_descent
for other likelihoods)- Optimizer used for estimating linear regression coefficients, if there are any (for the GPBoost algorithm there are usually none)
- Options:
gradient_descent
,lbfgs
,wls
,nelder_mead
. Gradient descent steps are done simultaneously with gradient descent steps for the covariance paramters.wls
refers to doing coordinate descent for the regression coefficients using weighted least squares - If
optimizer_cov
is set tonelder_mead
orlbfgs
,optimizer_coef
is automatically also set to the same value
maxit
: integer, optional (default = 1000)- Maximal number of iterations for optimization algorithm
delta_rel_conv
: double, optional (default = 1e-6 except fornelder_mead
for which the default is 1e-8)- Convergence tolerance. The algorithm stops if the relative change in eiher the (approximate) log-likelihood or the parameters is below this value.
- If < 0, internal default values are used (= 1e-6 except for
nelder_mead
for which the default is 1e-8)
convergence_criterion
: string, optional (default =relative_change_in_log_likelihood
)- The convergence criterion used for terminating the optimization algorithm. Options:
relative_change_in_log_likelihood
orrelative_change_in_parameters
- The convergence criterion used for terminating the optimization algorithm. Options:
init_cov_pars
: numeric vector / array of doubles, optional (default = Null)- Initial values for covariance parameters of Gaussian process and random effects (can be Null). The order it the same as the order of the parameters in the summary function: first is the error variance (only for "gaussian" likelihood), next follow the variances of the grouped random effects (if there are any, in the order provided in 'group_data'), and then follow the marginal variance and the range of the Gaussian process. If there are multiple Gaussian processes, then the variances and ranges follow alternatingly. If 'init_cov_pars = Null', an internatl choice is used that depends on the likelihood and the random effects type and covariance function. If you select the option 'trace = true' in the 'params' argument, you will see the first initial covariance parameters in iteration 0.
init_coef
: numeric vector / array of doubles, optional (default = Null)- Initial values for the regression coefficients (if there are any, can be Null)
lr_cov
: double, optional (default = 0.1 forgradient_descent
and 1. otherwise)- Initial Learning rate for covariance parameters if a gradient-based optimization method is used
- If < 0, internal default values are used (0.1 for
gradient_descent
and 1. otherwise) - If there are additional auxiliary parameters for non-Gaussian likelihoods,
lr_cov
is also used for those - For
lbfgs
, this is divided by the norm of the gradient in the first iteration
lr_coef
: double, optional (default = 0.1)- Learning rate for fixed effect regression coefficients
use_nesterov_acc
: bool, optional (default = True)- If True Nesterov acceleration is used (only for gradient descent)
acc_rate_cov
: double, optional (default = 0.5)- Acceleration rate for covariance parameters for Nesterov acceleration
acc_rate_coef
: double, optional (default = 0.5)- Acceleration rate for coefficients for Nesterov acceleration
momentum_offset
: integer, optional (default = 2)- Number of iterations for which no mometum is applied in the beginning
trace
: bool, optional (default = False)- If True, information on the progress of the parameter optimization is printed.
std_dev
: bool, optional (default = False)- If True, (asymptotic) standard deviations are calculated for the covariance parameters