Skip to content

Commit

Permalink
Update fastcpd 0.7.0
Browse files Browse the repository at this point in the history
*   Remove C++ unit tests using catch and commented out the code since the new
    version of development version of Rcpp is not yet available on CRAN.
    Related pull request: RcppCore/Rcpp#1274.
*   Add more documentation for `fastcpd` method.
  • Loading branch information
doccstat authored Sep 20, 2023
1 parent ee48dd2 commit b6e2ec4
Show file tree
Hide file tree
Showing 21 changed files with 381 additions and 195 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: fastcpd
Title: Fast Change Point Detection via Sequential Gradient Descent
Version: 0.6.5
Version: 0.7.0
Authors@R: c(
person("Xingchi", "Li", , "anthony.li@stat.tamu.edu", role = c("aut", "cre", "cph"),
comment = c(ORCID = "0009-0006-2493-0853")),
Expand Down
2 changes: 1 addition & 1 deletion LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
GNU General Public License
==========================

_Version 3, 29 June 2007_
_Version 3, 29 June 2007_
_Copyright © 2007 Free Software Foundation, Inc. &lt;<http://fsf.org/>&gt;_

Everyone is permitted to copy and distribute verbatim copies of this license
Expand Down
7 changes: 7 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# fastcpd 0.7.0

* Remove C++ unit tests using catch and commented out the code since the new
version of development version of Rcpp is not yet available on CRAN.
Related pull request: https://github.com/RcppCore/Rcpp/pull/1274.
* Add more documentation for `fastcpd` method.

# fastcpd 0.6.5

* Add more experiments.
Expand Down
28 changes: 14 additions & 14 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -187,30 +187,30 @@ cost_optim <- function(family, p, data_segment, cost, lambda, cv) {
#' @param momentum_coef Momentum coefficient to be applied to each update.
#' @param k Function on number of epochs in SGD.
#' @param family Family of the models. Can be "binomial", "poisson", "lasso" or
#' "gaussian". If not provided, the user must specify the cost function and
#' its gradient (and Hessian).
#' "gaussian". If not provided, the user must specify the cost function and
#' its gradient (and Hessian).
#' @param epsilon Epsilon to avoid numerical issues. Only used for binomial and
#' poisson.
#' poisson.
#' @param min_prob Minimum probability to avoid numerical issues. Only used for
#' poisson.
#' poisson.
#' @param winsorise_minval Minimum value to be winsorised. Only used for
#' poisson.
#' poisson.
#' @param winsorise_maxval Maximum value to be winsorised. Only used for
#' poisson.
#' poisson.
#' @param p Number of parameters to be estimated.
#' @param cost Cost function to be used. If not specified, the default is
#' the negative log-likelihood for the corresponding family.
#' the negative log-likelihood for the corresponding family.
#' @param cost_gradient Gradient for custom cost function.
#' @param cost_hessian Hessian for custom cost function.
#' @param cp_only Whether to return only the change points or with the cost
#' values for each segment. If family is not provided or set to be
#' "custom", this parameter will be set to be true.
#' values for each segment. If family is not provided or set to be
#' "custom", this parameter will be set to be true.
#' @param vanilla_percentage How many of the data should be processed through
#' vanilla PELT. Range should be between 0 and 1. If set to be 0, all data
#' will be processed through sequential gradient descnet. If set to be 1,
#' all data will be processed through vaniall PELT. If the cost function
#' have an explicit solution, i.e. does not depend on coefficients like
#' the mean change case, this parameter will be set to be 1.
#' vanilla PELT. Range should be between 0 and 1. If set to be 0, all data
#' will be processed through sequential gradient descnet. If set to be 1,
#' all data will be processed through vaniall PELT. If the cost function
#' have an explicit solution, i.e. does not depend on coefficients like
#' the mean change case, this parameter will be set to be 1.
#' @keywords internal
#'
#' @noRd
Expand Down
7 changes: 4 additions & 3 deletions R/catch-routine-registration.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# This dummy function definition is included with the package to ensure that
# 'tools::package_native_routine_registration_skeleton()' generates the required
# registration info for the 'run_testthat_tests' symbol.
(function() {
.Call("run_testthat_tests", FALSE, PACKAGE = "fastcpd")
})
# Commented out due to the LTO on CRAN.
# (function() {
# .Call("run_testthat_tests", FALSE, PACKAGE = "fastcpd")
# })
28 changes: 15 additions & 13 deletions R/fastcpd-class.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,18 @@
#' with only change points or with change points and parameters, which you can
#' select using \code{@}.
#'
#' @slot call The call to \link{fastcpd}.
#' @slot data The data used.
#' @slot call The call of the function `fastcpd`.
#' @slot data The data passed to the `fastcpd` function.
#' @slot family The family of the model.
#' @slot cp_set The change points.
#' @slot cost_values The cost values for each segment.
#' @slot residuals The residuals for each segment.
#' @slot thetas The estimated parameters for each segment.
#' @slot cp_only A boolean indicating whether \link{fastcpd} was run to return
#' only the change points or the change points with the estimated parameters
#' and cost values for each segment.
#' @slot cp_set The set of change points.
#' @slot cost_values The cost function values for each segment.
#' @slot residuals The residuals for each segment. Used only for built-in
#' families.
#' @slot thetas The estimated parameters for each segment. Used only for
#' built-in families.
#' @slot cp_only A boolean indicating whether `fastcpd` was run to return
#' only the change points or the change points with the estimated parameters
#' and cost values for each segment.
#' @export
setClass(
"fastcpd",
Expand Down Expand Up @@ -110,7 +112,7 @@ print.fastcpd <- function(x, ...) {
#' @param ... Ignored.
#'
#' @return Return a (temporarily) invisible copy of the \code{fastcpd} object.
#' Called primarily for printing the change points in the model.
#' Called primarily for printing the change points in the model.
#' @rdname print
#' @export
setMethod("print", "fastcpd", print.fastcpd)
Expand All @@ -132,7 +134,7 @@ show.fastcpd <- function(object) {
#' @param object \code{fastcpd} object.
#'
#' @return No return value, called for showing a list of available methods
#' for a \code{fastcpd} object.
#' for a \code{fastcpd} object.
#' @rdname show
#' @export
setMethod("show", "fastcpd", show.fastcpd)
Expand Down Expand Up @@ -172,8 +174,8 @@ summary.fastcpd <- function(object, ...) {
#' @param ... Ignored.
#'
#' @return Return a (temporarily) invisible copy of the \code{fastcpd} object.
#' Called primarily for printing the summary of the model including the
#' call, the change points, the cost values and the estimated parameters.
#' Called primarily for printing the summary of the model including the
#' call, the change points, the cost values and the estimated parameters.
#' @rdname summary
#' @export
setMethod("summary", "fastcpd", summary.fastcpd)
184 changes: 138 additions & 46 deletions R/fastcpd.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,56 +22,148 @@ NULL
#' Sequential Gradient Descent and Quasi-Newton's Method for Change-Point
#' Analysis
#'
#' @param formula A symbolic description of the model to be fitted. A response
#' variable is not necessary in the case of mean change or variance change.
#' Please refer to the examples for more details.
#' @param data A data frame containing the data to be segmented. The data frame
#' should contain the response variable as the first column and the
#' covariates as the rest of the columns if the dataset is a regression
#' problem. The response is not necessary in the case of mean change or
#' variance change, in which case the formula will need to be adjusted
#' as well. Please refer to the examples for more details.
#' @param beta Initial cost value. For the choice of `beta`, please refer to
#' the paper.
#' @param segment_count Number of segments for initial guess.
#' @param trim Trimming for the boundary change points or changes points that
#' are too close.
#' @param momentum_coef Momentum coefficient to be applied to each update.
#' @param k Function on number of epochs in SGD. If k is a function returning
#' values larger than 0, the algorithm will run for k more epochs. By
#' default k returns 0, meaning no multiple epochs will be performed.
#' @param family Family of the models. Can be "binomial", "poisson", "lasso",
#' "gaussian" or "custom". If not provided, the user must specify the cost
#' function (and its gradient and Hessian) if the cost function does not
#' have explicit solution.
#' @param epsilon Epsilon to avoid numerical issues. Only used for binomial and
#' poisson.
#' @param min_prob Minimum probability to avoid numerical issues. Only used for
#' poisson.
#' @param winsorise_minval Minimum value to be winsorised. Only used for
#' poisson.
#' @param winsorise_maxval Maximum value to be winsorised. Only used for
#' poisson.
#' @param p Number of parameters to be estimated. If not provided will be set
#' to be the number of columns in the data minus 1.
#' @param cost Cost function to be used. If not specified, the default is
#' the negative log-likelihood for the corresponding family. The custom
#' cost function should only contain a `data` parameter (and a `theta`
#' parameter if there are no explicit solutions).
#' @param cost_gradient Gradient for custom cost function.
#' @param cost_hessian Hessian for custom cost function.
#' @param cp_only Whether to return only the change points or with the cost
#' values for each segment. If family is not provided or set to be "custom",
#' this parameter will be set to be true.
#' @param formula A formula object specifying the model to be fitted. The
#' optional response variable should be on the left hand side of the formula
#' while the covariates should be on the right hand side. The intercept term
#' should be removed from the formula. The response variable is not
#' necessary if the data considered is not of regression type. For example,
#' a mean or variance change model does not necessarily have response
#' variables. By default an intercept column will be added to the data
#' similar to the \code{lm} function in \proglang{R}. Thus it is suggested
#' that user should remove the intercept term from the formula by appending
#' \code{- 1} to the formula. The default formula is suitable for regression
#' data sets with one-dimensional response variable and the rest being
#' covariates without intercept. The naming of variables used in the formula
#' should be consistent with the column names in the data frame provided in
#' \code{data}.
#' @param data A data frame containing the data to be segmented where each row
#' denotes each data point. In one-dimensional response variable regression
#' settings, the first column is the response variable while the rest are
#' covariates. The response is not necessary in the case of mean change or
#' variance change, in which case the formula will need to be adjusted
#' accordingly.
#' @param beta Initial cost value specified in the algorithm in the paper.
#' For the proper choice of a value, please refer to the paper. If not
#' specified, BIC criterion is used to obtain a proper value, i.e.,
#' \code{beta = (p + 1) * log(nrow(data)) / 2}.
#' @param segment_count Number of segments for initial guess. If not specified,
#' the initial guess on the number of segments is 10.
#' @param trim Trimming for the boundary change points so that a change point
#' close to the boundary will not be counted as a change point. This
#' parameter also specifies the minimum distance between two change points.
#' If. several change points have mutual distances smaller than
#' \code{trim * nrow(data)}, those change points will be merged into one
#' single change point. The value of this parameter should be between
#' 0 and 1.
#' @param momentum_coef Momentum coefficient to be applied to each update. This
#' parameter is used when the loss function is bad-shaped so that
#' maintaining a momentum from previous update is desired. Default value is
#' 0, meaning the algorithm doesn't maintain a momentum by default.
#' @param k Function on number of epochs in SGD. \code{k} should be a function
#' taking only a parameter \code{x} meaning the current number of data
#' points considered since last segmentaion. The return value of the
#' function should be an integer indicating how many epochs should be
#' performed apart from the default update. By default the function returns
#' 0, meaning no multiple epochs will be used to update the parameters.
#' Example usage:
#' ```r
#' k = function(x) {
#' if (x < n / segment_count / 4 * 1) 3
#' else if (x < n / segment_count / 4 * 2) 2
#' else if (x < n / segment_count / 4 * 3) 1
#' else 0
#' }
#' ```
#' This function will perform 3 epochs for the first quarter of the data, 2
#' epochs for the second quarter of the data, 1 epoch for the third quarter
#' of the data and no multiple epochs for the last quarter of the data.
#' Experiments show that performing multiple epochs will significantly
#' affect the performance of the algorithm. This parameter is left for the
#' users to tune the performance of the algorithm if the result is not
#' ideal. Details are discussed in the paper.
#' @param family Family of the model. Can be \code{"gaussian"},
#' \code{"binomial"}, \code{"poisson"}, \code{"lasso"}, \code{"custom"} or
#' \code{NULL}. For simplicity, user can also omit this parameter,
#' indicating that they will be using their own cost functions. Omitting the
#' parameter is the same as specifying the parameter to be \code{"custom"}
#' or \code{NULL}, in which case, users must specify the cost function, with
#' optional gradient and corresponding Hessian matrix functions.
#' @param epsilon Epsilon to avoid numerical issues. Only used for the Hessian
#' computation in Logistic Regression and Poisson Regression.
#' @param min_prob Minimum probability to avoid numerical issues. Only used
#' for Poisson Regression.
#' @param winsorise_minval Minimum value for the parameter in Poisson Regression
#' to be winsorised.
#' @param winsorise_maxval Maximum value for the parameter in Poisson Regression
#' to be winsorised.
#' @param p Number of covariates in the model. If not specified, the number of
#' covariates will be inferred from the data, i.e.,
#' \code{p = ncol(data) - 1}.
#' @param cost Cost function to be used. This and the following two parameters
#' should not be specified at the same time with \code{family}. If not
#' specified, the default is the negative log-likelihood for the
#' corresponding family. Custom cost functions can be provided in the
#' following two formats:
#'
#' - \code{cost = function(data) \{...\}}
#' - \code{cost = function(data, theta) \{...\}}
#'
#' In both methods, users should implement the cost value calculation based
#' on the data provided, where the data parameter can be considered as a
#' segment of the original data frame in the form of a matrix. The first
#' method is used when the cost function has an explicit solution, in which
#' case the cost function value can be calculated directly from the data.
#' The second method is used when the cost function does not have an
#' explicit solution, in which case the cost function value can be
#' calculated from the data and the estimated parameters. In the case of
#' only one \code{data} argument is provided, `fastcpd` performs the
#' vanilla PELT algorithm since no parameter updating is performed.
#' @param cost_gradient Gradient function for the custom cost function.
#' Example usage:
#' ```r
#' cost_gradient = function(data, theta) {
#' ...
#' return(gradient)
#' }
#' ```
#' The gradient function should take two parameters, the first one being a
#' segment of the data in the format of a matrix, the second one being the
#' estimated parameters. The gradient function should return the gradient of
#' the cost function with respect to the data and parameters.
#' @param cost_hessian Hessian function for the custom cost function. Similar to
#' the gradient function, the Hessian function should take two parameters,
#' the first one being a segment of the data in the format of a matrix, the
#' second one being the estimated parameters. The Hessian function should
#' return the Hessian matrix of the cost function with respect to the data
#' and parameters.
#' @param cp_only If \code{TRUE}, only the change points are returned.
#' Otherwise, the cost function values together with the estimated
#' parameters for each segment are also returned. By default the value is
#' set to be \code{FALSE} so that `plot` can be used to visualize the
#' results for a built-in model. If \code{family} is not provided or
#' specified as \code{NULL} or \code{"custom"}, \code{cp_only} is set to be
#' \code{TRUE} by default. \code{cp_only} has some performance impact on the
#' algorithm, since the cost values and estimated parameters for each
#' segment need to be calculated and stored. If the users are only
#' interested in the change points, setting \code{cp_only} to be \code{TRUE}
#' will help with the computational cost.
#' @param vanilla_percentage How many of the data should be processed through
#' vanilla PELT. Range should be between 0 and 1. If set to be 0, all data
#' will be processed through sequential gradient descnet. If set to be 1,
#' all data will be processed through vaniall PELT. If the cost function
#' have an explicit solution, i.e. does not depend on coefficients like
#' the mean change case, this parameter will be set to be 1.
#' vanilla PELT. Range should be between 0 and 1. The `fastcpd`
#' algorithm is based on gradient descent and thus a starting estimate can
#' be crucial. At the beginning of the algorithm, vanilla PELT can be
#' performed to obtain a relatively accurate estimate of the parameters
#' despite the small amount of the data being used. If set to be 0, all data
#' will be processed through sequential gradient descnet. If set to be 1,
#' all data will be processed through vaniall PELT. If the cost function
#' have an explicit solution, i.e. does not depend on coefficients like the
#' mean change case, this parameter will be set to be 1. If the value is set
#' to be between 0 and 1, the first \code{vanilla_percentage * nrow(data)}
#' data points will be processed through vanilla PELT and the rest will be
#' processed through sequential gradient descent.
#'
#' @return A class \code{fastcpd} object.
#' @export
#' @md
#' @examples
#' \donttest{
#' ### linear regression
Expand Down
16 changes: 2 additions & 14 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,8 @@
## Resubmission

This is a resubmission. In this version I have:

* Reduced the length of the title to less than 65 characters.
* Expanded all acronyms in the description text.
* Added value fields to .Rd files / return fields to roxygen documentation
regarding exported methods `plot`, `print`, `show`, `summary`.

NOTE: There might be an extra note about possibly misspelled words in
DESCRIPTION. This comes from the first name and last name of the authors.

## R CMD check results

❯ checking CRAN incoming feasibility ... [4s/35s] NOTE
❯ checking CRAN incoming feasibility ... [4s/67s] NOTE
Maintainer: ‘Xingchi Li <anthony.li@stat.tamu.edu>’

New submission
Days since last update: 6

0 errors ✔ | 0 warnings ✔ | 1 notes ✖
Loading

0 comments on commit b6e2ec4

Please sign in to comment.