Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] add support for xgboost objective = 'count:poisson' #62

Open
SimonCoulombe opened this issue Aug 27, 2019 · 3 comments
Labels
feature a feature request or enhancement

Comments

@SimonCoulombe
Copy link

Hi,

It would be great if this can be added. As far as I can see, it would require adding the following to 'build_fit_formula_xgb'

 else if (objective %in% c("count:poisson")) {
    assigned <- 1
    f <- expr(!!f)
  }
@edgararuiz-zz
Copy link
Contributor

Thank you @SimonCoulombe , may also trouble you with a minimal example to test the change?

@SimonCoulombe
Copy link
Author

Here is an example of a poisson model applied to insurance data. Is this what you need?

library(xgboost)
library(insuranceData) # example dataset https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf
library(tidyverse) 
library(tidypredict)

# load insurance data (1 row = 1 policy)
data(dataCar)

label_var <- "numclaims" # number of claims during policy
offset_var <- "exposure" # length of exposure (anywhere between 0.001 year and 1.0 year)
feature_vars <- c("veh_value", "veh_body", "veh_age", "gender", "area", "agecat")

mydb <- dataCar %>% select(one_of(c(label_var, offset_var, feature_vars)))

# create dummies from factor variables
myformula <- paste0( "~", paste0( feature_vars, collapse = " + ") ) %>% 
  as.formula()

dummyFier <- caret::dummyVars(myformula, data=mydb, fullRank = TRUE)
dummyVars.df <- predict(dummyFier,newdata = mydb)
mydb_dummy <- cbind(mydb %>% select(one_of(c(label_var, offset_var))), 
                    dummyVars.df)
rm(myformula, dummyFier, dummyVars.df)

# get  list the column names of the db with the dummy variables
feature_vars_dummy <-  mydb_dummy  %>% 
  select(-one_of(c(label_var, offset_var))) %>% 
  colnames()

# create xgb.matrix for xgboost consumption
mydb_xgbmatrix <- xgb.DMatrix(
  data = mydb_dummy %>% select(feature_vars_dummy) %>% as.matrix, 
  label = mydb_dummy %>% pull(label_var),
  missing = "NAN")

#base_margin: apply exposure offset  (more claims are expected if you are insured 1.0 year rather than 0.20 year )
setinfo(mydb_xgbmatrix,"base_margin", 
        mydb %>% pull(offset_var) %>% log() )

params <- data.frame(max_depth = 6,
                           colsample_bytree= 0.8,
                           subsample = 0.8,
                           min_child_weight = 3,
                           eta  = 0.1,
                           gamma = 0,
                     objective = 'count:poisson', 
                     eval_metric = "poisson-nloglik"
                     ) 

model <- xgb.train(mydb_xgbmatrix, params = params, nrounds=100)

mean(mydb_dummy$numclaims) # 0.07275701
mean(predict(model, newdata=mydb_xgbmatrix)) # 0.07525851

@topepo topepo added the feature a feature request or enhancement label Apr 3, 2020
@topepo
Copy link
Member

topepo commented Apr 3, 2020

I think that this should be straightforward (as you point out). Would you like to do a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants