Skip to content

Latest commit

 

History

History
366 lines (191 loc) · 21.7 KB

Main_parameters.rst

File metadata and controls

366 lines (191 loc) · 21.7 KB

Main parameters for GPBoost

Likelihood / response variable distribution

The likelihood, i.e., the distribution of the response variable (=label) conditional on fixed and random effects, is set through the likelihood parameter of the random effects / Gaussian process model (= 'GPModel' in R / Python) for both the GPBoost algorithm and (generalized) linear mixed effects and Gaussian process models. See the likelihood documentation below for a list of currently supported likelihoods.

Metrics for parameter tuning

It is important that tuning parameters (= hyperparameters) for the tree-boosting part are chosen appropriately. There are no universal good "default" values for different data sets. See below for a list of important tuning parameters. Selecting tuning parameters can be done conveniently via the gpb.grid.search.tune.parameters function in the Python and R packages.

The metric parameter (e.g., for the gpb.train, gpboost, and gpb.grid.search.tune.parameters functions in R and Python) specifies how prediction accuracy is measured on validation data.

  • For the GPBoost algorithm, i.e., if there is a gp_model, test_neg_log_likelihood is the default metric.
  • Other supported metrics include: mse, mae, binary_logloss, binary_error, and auc.
  • If another metric besides test_neg_log_likelihood is used for the GPBoost algorithm, it is calculated as follows. First, the predictive mean of the response variable is calculated. Second, the corresponding metric is evaluated using this predictive mean as point prediction. See here for a list of all supported metrics.

Tuning parameters (= hyperparameters) for the tree-boosting part

Below is a list of important parameters for the tree-boosting part. A comprehensive list of all tree-bosting related parameters can be found here.

  • num_iterations 🔗︎, default = 100, type = int, aliases: num_iteration, n_iter, num_tree, num_trees, num_round, num_rounds, num_boost_round, n_estimators, constraints: num_iterations >= 0
    • number of boosting iterations
    • this is arguably the most important tuning parameter, in particular for regession settings
  • learning_rate 🔗︎, default = 0.1, type = double, aliases: shrinkage_rate, eta, constraints: learning_rate > 0.0
    • shrinkage rate or damping parameter
    • smaller values lead to higher predictive accuracy but require more computational time since more boosting iterations are needed
  • max_depth 🔗︎, default = -1, type = int
    • maximal depth of a tree
    • <= 0 means no limit
  • num_leaves 🔗︎, default = 31, type = int, aliases: num_leaf, max_leaves max_leaf, constraints: 1 < num_leaves <= 131072
    • maximal number of leaves of a tree
  • Note on ``max_depth`` and ``num_leaves`` parameters: The GPBoost library uses the LightGBM tree growing algorithm which grows trees using a leaf-wise strategy. I.e., trees are grown by first splitting leaf nodes that maximize the information gain until the maximal number of leaves num_leaves or the maximal depth of a tree max_depth is attained, even when this leads to unbalanced trees. This in contrast to a depth-wise growth strategy of other boosting implementations which builds "balanced" trees. For shallow trees (=small max_depth), there is likely no difference between these two tree growing strategies. If you only want to tune the maximal depth of a tree max_depth parameter and not the num_leaves parameter, it is recommended that you set the num_leaves parameter to a large value
  • min_data_in_leaf 🔗︎, default = 20, type = int, aliases: min_data_per_leaf, min_data, min_child_samples, constraints: min_data_in_leaf >= 0
    • minimal number of samples in a leaf
  • lambda_l2 🔗︎, default = 0.0, type = double, aliases: reg_lambda, lambda, constraints: lambda_l2 >= 0.0
    • L2 regularization
  • lambda_l1 🔗︎, default = 0.0, type = double, aliases: reg_alpha, constraints: lambda_l1 >= 0.0
    • L1 regularization
  • max_bin 🔗︎, default = 255, type = int, constraints: max_bin > 1
    • Maximal number of bins that feature values will be bucketed in
    • GPBoost uses histogram-based algorithms [1, 2, 3], which bucket continuous feature (covariate) values into discrete bins. A small number speeds up training and reduces memory usage but may reduce the accuracy of the model
  • min_gain_to_split 🔗︎, default = 0.0, type = double, aliases: min_split_gain, constraints: min_gain_to_split >= 0.0
    • the minimal gain to perform a split
  • line_search_step_length 🔗︎, default = false, type = bool
    • if true, a line search is done to find the optimal step length for every boosting update (see, e.g., Friedman 2001). This is then multiplied by the learning_rate
    • applies only to the GPBoost algorithm
  • reuse_learning_rates_gp_model 🔗︎, default = true, type = bool
    • if true, the learning rates for the covariance and potential auxiliary parameters are kept at the values from the previous boosting iteration and not re-initialized when optimizing them
    • this option can only be used if optimizer_cov = gradient_descent or optimizer_cov = lbfgs (for the latter, the approximate Hessian is reused)
  • train_gp_model_cov_pars 🔗︎, default = true, type = bool
    • if true, the covariance parameters of the Gaussian process / random effects model are trained (estimated) in every boosting iteration of the GPBoost algorithm, otherwise not
  • use_gp_model_for_validation 🔗︎, default = true, type = bool
    • set this to true to also use the Gaussian process / random effects model (in addition to the tree model) for calculating predictions on the validation data when using the GPBoost algorithm
  • leaves_newton_update 🔗︎, default = false, type = bool
    • if true, a Newton update step is done for the tree leaves after the gradient step
    • applies only to the GPBoost algorithm for Gaussian data and cannot be used for non-Gaussian data

Gaussian process and random effects parameters

Below is a list of parameters for specifying GPModel objects for modeling Gaussian processes and grouped random effects and for specifying how these models are trained. These parameters are documented in a generic manner in the form they are used in the R and Python package. The C API works slightly different.

Model specification parameters

  • likelihood : string, (default = gaussian)

    • Likelihood function, i.e., conditional distribution of the response variable (=label)
    • Currently supported likelihoods:
      • gaussian : Gaussian likelihood
      • bernoulli_probit : Bernoulli likelihood with a probit link function. Aliases: binary, binary_probit
      • bernoulli_probit : Bernoulli likelihood with a logit link function. Aliases: binary, binary_logit
      • poisson : Poisson likelihood with log link function
      • gamma : Gamma likelihood with log link function
      • negative_binomial : Negative binomial likelihood with log link function
      • Note: other likelihoods could be implemented upon request
  • group_data : two dimensional array / matrix of doubles or strings, optional (default = None)

    • Labels of group levels for grouped random effects
  • group_rand_coef_data : two dimensional array / matrix of doubles or None, optional (default = None)

    • Covariate data for grouped random coefficients
  • ind_effect_group_rand_coef : integer vector / array of integers or None, optional (default = None)

    • Indices that relate every random coefficients to a "base" intercept grouped random effect. Counting starts at 1.
  • gp_coords : two dimensional array / matrix of doubles or None, optional (default = None)

    • Coordinates (input features) for Gaussian process
  • gp_rand_coef_data : two dimensional array / matrix of doubles or None, optional (default = None)

    • Covariate data for Gaussian process random coefficients
  • cov_function : string, (default = exponential)

    • Covariance function for the Gaussian process. Available options:

      • exponential : Exponential covariance function (using the parametrization of Diggle and Ribeiro, 2007)

      • gaussian : Gaussian, aka squared exponential, covariance function (using the parametrization of Diggle and Ribeiro, 2007)

      • matern : Matern covariance function with the smoothness specified by the cov_fct_shape parameter (using the parametrization of Rasmussen and Williams, 2006)

      • powered_exponential : Powered exponential covariance function with the exponent specified by cov_fct_shape parameter (using the parametrization of Diggle and Ribeiro, 2007)

      • wendland : Compactly supported Wendland covariance function (using the parametrization of Bevilacqua et al., 2019, AOS)

      • matern_space_time : Spatio-temporal Matern covariance function with different range parameters for space and time

        • Note that the first column in gp_coords must correspond to the time dimension
      • matern_ard: Anisotropic Matern covariance function with Automatic Relevance Determination (ARD), i.e., with a different range parameter for every coordinate dimension / column of gp_coords

      • gaussian_ard: Anisotropic Gaussian, aka squared exponential, covariance function with Automatic Relevance Determination (ARD), i.e., with a different range parameter for every coordinate dimension / column of gp_coords

  • cov_fct_shape : double, (default = 1.5)

    • Shape parameter of the covariance function (=smoothness parameter for Matern covariance). This parameter is irrelevant for some covariance functions such as the exponential or Gaussian.
  • gp_approx : string, (default = none)

    • Specifies the use of a large data approximation for Gaussian processes. Available options:
      • none : No approximation
      • vecchia : A Vecchia approximation; see Sigrist (2022, JMLR for more details)
      • tapering : The covariance function is multiplied by a compactly supported Wendland correlation function
      • fitc: Fully Independent Training Conditional approximation aka modified predictive process approximation; see Gyger, Furrer, and Sigrist (2024) for more details
      • full_scale_tapering: A full scale approximation combining an inducing point / predictive process approximation with tapering on the residual process; see Gyger, Furrer, and Sigrist (2024) for more details
  • cov_fct_taper_range : double, (default = 1.)

    • Range parameter of the Wendland covariance function and Wendland correlation taper function. We follow the notation of Bevilacqua et al. (2019, AOS)
  • cov_fct_taper_shape : double, (default = 1.)

    • Shape parameter of the Wendland covariance function and Wendland correlation taper function. We follow the notation of Bevilacqua et al. (2019, AOS)
  • num_neighbors : integer, (default = 20)

    • Number of neighbors for the Vecchia approximation
  • vecchia_ordering : string, (default = random)

    • Ordering used in the Vecchia approximation. Available options:
      • none: the default ordering in the data is used
      • random: a random ordering
      • time: ordering accorrding to time (only for space-time models)
      • time_random_space: ordering according to time and randomly for all spatial points with the same time points (only for space-time models)
  • vecchia_pred_type : string, (default = Null)

    • Type of Vecchia approximation used for making predictions

    • Default value if vecchia_pred_type = Null : order_obs_first_cond_obs_only

    • Available options:

      • order_obs_first_cond_obs_only : observed data is ordered first and the neighbors are only observed points
      • order_obs_first_cond_all : observed data is ordered first and the neighbors are selected among all points (observed + predicted)
      • latent_order_obs_first_cond_obs_only : Vecchia approximation for the latent process and observed data is ordered first and neighbors are only observed points
      • latent_order_obs_first_cond_all : Vecchia approximation for the latent process and observed data is ordered first and neighbors are selected among all points
      • order_pred_first : predicted data is ordered first for making predictions. This option is only available for Gaussian likelihoods
  • num_neighbors_pred : integer, (default = Null)

    • Number of neighbors for the Vecchia approximation for making predictions.
    • Default value if num_neighbors_pred = Null: num_neighbors_pred = 2 * num_neighbors
  • num_ind_points : integer, (default = 500)

    • Number of inducing points / knots for, e.g., a predictive process approximation
  • matrix_inversion_method : string, (default = cholesky)

    • Method used for inverting covariance matrices. Available options:

      • cholesky : Cholesky factorization

      • iterative : iterative methods. A combination of conjugate gradient, Lanczos algorithm, and other methods.

        This is currently only supported for the following cases:

        • likelihood != gaussian and gp_approx == vecchia (non-Gaussian likelihoods with a Vecchia-Laplace approximation)
        • likelihood == gaussian and gp_approx == full_scale_tapering (Gaussian likelihood with a full-scale tapering approximation)
  • seed : integer, (default = 0)

    • The seed used for model creation (e.g., random ordering in Vecchia approximation)
  • cluster_ids : one dimensional numpy array (vector) with integer data or Null, (default = Null)

    • IDs / labels indicating independent realizations of random effects / Gaussian processes (same values = same process realization)

Optimization parameters

The following list shows options for the optimization of the variance and covariance parameters of gp_model objects which contain Gaussian process and/or grouped random effects models. These parameters are passed to either the fit function of a gp_model object in Python and R or to the set_optim_params function prior to running the GPBoost algorithm.

  • optimizer_cov : string, optional (default = "lbfgs" for linear mixed effects models and "gradient_descent" for the GPBoost algorithm)
    • Optimizer used for estimating covariance parameters
    • Options: gradient_descent, lbfgs, fisher_scoring, nelder_mead
  • optimizer_coef : string, optional (default = wls for Gaussian data and gradient_descent for other likelihoods)
    • Optimizer used for estimating linear regression coefficients, if there are any (for the GPBoost algorithm there are usually none)
    • Options: gradient_descent, lbfgs, wls, nelder_mead. Gradient descent steps are done simultaneously with gradient descent steps for the covariance paramters. wls refers to doing coordinate descent for the regression coefficients using weighted least squares
    • If optimizer_cov is set to nelder_mead or lbfgs, optimizer_coef is automatically also set to the same value
  • maxit : integer, optional (default = 1000)
    • Maximal number of iterations for optimization algorithm
  • delta_rel_conv : double, optional (default = 1e-6 except for nelder_mead for which the default is 1e-8)
    • Convergence tolerance. The algorithm stops if the relative change in eiher the (approximate) log-likelihood or the parameters is below this value.
    • If < 0, internal default values are used (= 1e-6 except for nelder_mead for which the default is 1e-8)
  • convergence_criterion : string, optional (default = relative_change_in_log_likelihood)
    • The convergence criterion used for terminating the optimization algorithm. Options: relative_change_in_log_likelihood or relative_change_in_parameters
  • init_cov_pars : numeric vector / array of doubles, optional (default = Null)
    • Initial values for covariance parameters of Gaussian process and random effects (can be Null). The order it the same as the order of the parameters in the summary function: first is the error variance (only for "gaussian" likelihood), next follow the variances of the grouped random effects (if there are any, in the order provided in 'group_data'), and then follow the marginal variance and the range of the Gaussian process. If there are multiple Gaussian processes, then the variances and ranges follow alternatingly. If 'init_cov_pars = Null', an internatl choice is used that depends on the likelihood and the random effects type and covariance function. If you select the option 'trace = true' in the 'params' argument, you will see the first initial covariance parameters in iteration 0.
  • init_coef : numeric vector / array of doubles, optional (default = Null)
    • Initial values for the regression coefficients (if there are any, can be Null)
  • lr_cov : double, optional (default = 0.1 for gradient_descent and 1. otherwise)
    • Initial Learning rate for covariance parameters if a gradient-based optimization method is used
    • If < 0, internal default values are used (0.1 for gradient_descent and 1. otherwise)
    • If there are additional auxiliary parameters for non-Gaussian likelihoods, lr_cov is also used for those
    • For lbfgs, this is divided by the norm of the gradient in the first iteration
  • lr_coef : double, optional (default = 0.1)
    • Learning rate for fixed effect regression coefficients
  • use_nesterov_acc : bool, optional (default = True)
    • If True Nesterov acceleration is used (only for gradient descent)
  • acc_rate_cov : double, optional (default = 0.5)
    • Acceleration rate for covariance parameters for Nesterov acceleration
  • acc_rate_coef : double, optional (default = 0.5)
    • Acceleration rate for coefficients for Nesterov acceleration
  • momentum_offset : integer, optional (default = 2)
    • Number of iterations for which no mometum is applied in the beginning
  • trace : bool, optional (default = False)
    • If True, information on the progress of the parameter optimization is printed.
  • std_dev : bool, optional (default = False)
    • If True, (asymptotic) standard deviations are calculated for the covariance parameters