predict.caretEnsemble returns the same type of data structure, no matter the call #158

zachmayer · 2015-07-24T17:11:59Z

It's really hard to test predict.caretEnsemble, because it sometimes returns a list, sometimes a data.frame, and sometimes the structure of the list is different, depending on the values of keepNA, se, and return_weights. Also, sometimes the weights have one row, and sometimes there's a weight for every row.

Let's make it always return a data.frame.

The text was updated successfully, but these errors were encountered:

jknowles · 2015-08-04T12:54:37Z

I have recently been digging back into caretEnsemble and hope to make some changes and improvements -- particularly here. I have been thinking more about the data structure and the need for it to be consistent.

Currently, most standard R functions with a predict method like glm or lm return either a vector of numeric predictions, or a data.frame with values fit, upr, and lwr if the user requests standard errors. However, figuring out how to calculate a prediction interval is tricky, and in the caretEnsemble case it's more tricky because the predictions may be weighted differently if we allow the user to include cases where not all models produced a prediction.

What I've been thinking is that we could mimic the behavior of predict.lm and return a numeric vector (or store it as a data.frame with one column) of fit if no se is requested, and we could return the three column data.frame with fit, upr, and lwr if the se is requested. We could then attach the weights to this data.frame as an attribute, mimicking the behavior (somewhat) of some of the objects within a merMod in the lme4 package.

How does this sound?

zachmayer · 2015-08-04T13:46:02Z

I think that works. The default return is a vector, and if anything extra is requested a data.frame is returned. I like adding the weights as an attribute. That seems pretty clean to me.

jknowles · 2015-08-05T13:32:47Z

@zachmayer What do you think about making predict.caretStack and predict.caretEnsemble behave the same way? I think it's pretty easy to do this. I noticed that you need different options to get the same output currently from these functions. Would you agree it will make more sense to the user to have them be the same?

If so -- I was thinking by default both should return a factor for classification models and the user should be allowed to specify if they want probabilities and if they want standard errors, but this should not be defaulted.

zachmayer · 2015-08-05T13:37:16Z

Could you give me an example of the different options to get the same predictions?

I was trying to keep predict.caretStack as similar to predict.train as possible.

jknowles · 2015-08-05T14:32:42Z

Well, what I came across today is that by default:

# predicting caretStack
yhat <- predict(myStack)
# predicting caretEnsemble
yhat2 <- predict(myEnsemble)

For a classification model type the first one will be a vector, class factor, and the second one will be a numeric vector of class probabilities. I can straighten this out pretty easily by rewriting predict.caretEnsemble to match the default behavior of `predict.caretStack'. Just want to confirm that's a good idea.

zachmayer · 2015-08-05T14:48:03Z

Ahah, I see what you me. Yes, that's a good idea. I think predict.train, predict.caretList, predict.caretEnsemble, and predict.caretStack should all have similar returns (as much as is possible).

jknowles · 2015-08-05T15:30:03Z

Makes sense to me -- I'll do that all together and rewrite the unit tests accordingly.

zachmayer · 2015-08-05T15:31:50Z

Awesome, thank you!

jknowles · 2015-09-08T17:49:56Z

@zachmayer -- an update on fixing the consistency of the predict method:

As I'm working on this I'm uncovering some flaws with the way NA values are handled by predict.caretEnsemble and predict.caretList. This relates back to the way predict.train handles missing values in new data that is passed to an object. Consider the following case:

 load(system.file("testdata/models.class.rda",
                   package="caretEnsemble", mustWork=TRUE))
load(system.file("testdata/models.reg.rda",
                   package="caretEnsemble", mustWork=TRUE))
  load(system.file("testdata/X.class.rda",
                   package="caretEnsemble", mustWork=TRUE))
  load(system.file("testdata/Y.class.rda",
                   package="caretEnsemble", mustWork=TRUE))
  ens.reg <- caretEnsemble(models.reg, iter=1000)
  models.class2 <- models.class[c(2:5)] # hack to cut out randomForest
  class(models.class2) <- "caretList"
  ens.class <- caretEnsemble(models.class2, iter=1000)
  newDat <- ens.class$models[[4]]$trainingData
  newDat[2, 2] <- NA
  newDat[3, 3] <- NA
  newDat[4, 4] <- NA
  newDat <- newDat[1:10, ]
  p1 <- predict(ens.class, newdata = newDat)
  p2 <- predict(ens.class, newdata = newDat[1, ])

In a normal predict method, like for glm, it would work:

gmData <- cbind(Y.class, X.class
gmData <- gmData[, -2]
gmData <- as.data.frame(gmData)
gmData$Y.class <- factor(gmData$Y.class)
gm1 <- glm(Y.class ~ ., data = gmData, family = "binomial")
newDat <- gmData[1:10,]
newDat[2, 4] <- NA
preds <- predict(gm1, newdata = newDat, type = "response")

Results in:

preds
           1            2            3            4            5 
1.825548e-09           NA 7.789075e-10 1.636546e-09 2.094198e-09 
           6            7            8            9           10 
1.226207e-08 1.620147e-09 2.470597e-09 8.010239e-10 1.607467e-09

Currently, we can't replicate this behavior because of the inconsistent way the methods in train may or may not handle NA values. This limitation should probably be either documented or mitigated with some optional pre-processing of the newdata object.

zachmayer · 2015-09-09T15:00:30Z

Hmm, on current master, I get the following:

> p1 <- predict(ens.class, newdata = newDat)
 Hide Traceback

 Rerun with Debug
 Error in colnames(tempUnkProb)[tempUnkPred] : 
  invalid subscript type 'list' 
14 extractProb(list(object), unkX = newdata, unkOnly = TRUE, ...) 
13 predict.train(x, type = "prob", ...) 
12 predict(x, type = "prob", ...) 
11 predict(x, type = "prob", ...) 
10 FUN(X[[i]], ...) 
9 lapply(X, FUN, ...) 
8 pblapply(X, FUN, ...) 
7 pbsapply(object, function(x) {
    type <- x$modelType
    if (type == "Classification") {
        if (x$control$classProbs) { ... 
6 predict.caretList(object$models, ...) 
5 predict(object$models, ...) 
4 predict(object$models, ...) 
3 predict.caretEnsemble(ens.class, newdata = newDat) 
2 predict(ens.class, newdata = newDat) 
1 predict(ens.class, newdata = newDat)

And then:

> p2 <- predict(ens.class, newdata = newDat[1, ])
 Hide Traceback

 Rerun with Debug
 Error in `colnames<-`(`*tmp*`, value = c("glm", "svmRadial", "nnet", "treebag" : 
  attempt to set 'colnames' on an object with less than two dimensions 
8 stop("attempt to set 'colnames' on an object with less than two dimensions") 
7 `colnames<-`(`*tmp*`, value = c("glm", "svmRadial", "nnet", "treebag"
)) 
6 predict.caretList(object$models, ...) 
5 predict(object$models, ...) 
4 predict(object$models, ...) 
3 predict.caretEnsemble(ens.class, newdata = newDat[1, ]) 
2 predict(ens.class, newdata = newDat[1, ]) 
1 predict(ens.class, newdata = newDat[1, ])

But the GBM code does work. So yeah, we should add tests for this and fix it.

Also, it looks like our builds are failing on master, so I need to dig into and fix that too.

jknowles · 2015-09-09T15:10:43Z

Oh yeah, sorry I wasn't clear -- the error is the problem. It is not consistent though -- glm handles prediction with NA values while in a caretList. This is why I missed the inconsistent behavior before, I only tested it with a list of different glm models.

I can see a few different options --

We na.omit newdata in predict and document that
We wrap up the individual model predictions in try and then only use predictions from models that return a prediction (this would mean rf would not be used in a prediction if only one observation had missing values) and a warning.
We predict row-wise for each model using try and then put an NA in for each observation that fails.
We wait for a solution to be proposed and accepted in caret.

What do you think?

zachmayer · 2015-09-09T15:13:14Z

Ok, I suspected that was the case, but wasn't sure. I don't really like option 2. Option 3 is worth investigating, and I think option 4 is the best bet, but who knows how long that will take =D

jknowles · 2015-09-09T15:18:39Z

Good, I'll investigate option 3 -- it's my favorite too if we can do it reasonably quickly. In the meantime, maybe we can ask @topepo if caret is going to take this up in the future.

zachmayer · 2015-09-09T15:22:39Z

Sounds good. Can you open an issue on github?

jknowles · 2015-09-09T15:24:50Z

On it.

topepo · 2015-09-09T16:25:43Z

Does this cover it?

jknowles · 2015-09-09T16:37:46Z

@topepo I think so, but in the example I was working with rf just crashes if newdata has missing values for the predictors. I didn't check a lot of other methods, but I know this problem is not unique to just rf. At any rate, that would be extremely ideal -- just return a vector with predictions and missing values for observations with incomplete data.

topepo · 2015-09-09T16:45:49Z

Got it. That won't be in the next version (hopefully sent to CRAN early next week) but since this is a common issue, I'll work on it next.

jknowles · 2015-10-15T18:31:43Z

This can be closed now by #167 and others.

jknowles mentioned this issue Sep 10, 2015

Fixing prediction methods to be more standardized #167

Merged

zachmayer closed this as completed Oct 15, 2015

blueskypie mentioned this issue Dec 11, 2015

Error with nb from klaR package topepo/caret#339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict.caretEnsemble returns the same type of data structure, no matter the call #158

predict.caretEnsemble returns the same type of data structure, no matter the call #158

zachmayer commented Jul 24, 2015

jknowles commented Aug 4, 2015

zachmayer commented Aug 4, 2015

jknowles commented Aug 5, 2015

zachmayer commented Aug 5, 2015

jknowles commented Aug 5, 2015

zachmayer commented Aug 5, 2015

jknowles commented Aug 5, 2015

zachmayer commented Aug 5, 2015

jknowles commented Sep 8, 2015

zachmayer commented Sep 9, 2015

jknowles commented Sep 9, 2015

zachmayer commented Sep 9, 2015

jknowles commented Sep 9, 2015

zachmayer commented Sep 9, 2015

jknowles commented Sep 9, 2015

topepo commented Sep 9, 2015

jknowles commented Sep 9, 2015

topepo commented Sep 9, 2015

jknowles commented Oct 15, 2015

predict.caretEnsemble returns the same type of data structure, no matter the call #158

predict.caretEnsemble returns the same type of data structure, no matter the call #158

Comments

zachmayer commented Jul 24, 2015

jknowles commented Aug 4, 2015

zachmayer commented Aug 4, 2015

jknowles commented Aug 5, 2015

zachmayer commented Aug 5, 2015

jknowles commented Aug 5, 2015

zachmayer commented Aug 5, 2015

jknowles commented Aug 5, 2015

zachmayer commented Aug 5, 2015

jknowles commented Sep 8, 2015

zachmayer commented Sep 9, 2015

jknowles commented Sep 9, 2015

zachmayer commented Sep 9, 2015

jknowles commented Sep 9, 2015

zachmayer commented Sep 9, 2015

jknowles commented Sep 9, 2015

topepo commented Sep 9, 2015

jknowles commented Sep 9, 2015

topepo commented Sep 9, 2015

jknowles commented Oct 15, 2015