-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
predict.caretEnsemble returns the same type of data structure, no matter the call #158
Comments
I have recently been digging back into Currently, most standard R functions with a What I've been thinking is that we could mimic the behavior of How does this sound? |
I think that works. The default return is a vector, and if anything extra is requested a data.frame is returned. I like adding the weights as an attribute. That seems pretty clean to me. |
@zachmayer What do you think about making If so -- I was thinking by default both should return a factor for classification models and the user should be allowed to specify if they want probabilities and if they want standard errors, but this should not be defaulted. |
Could you give me an example of the different options to get the same predictions? I was trying to keep |
Well, what I came across today is that by default:
For a classification model type the first one will be a vector, class factor, and the second one will be a numeric vector of class probabilities. I can straighten this out pretty easily by rewriting |
Ahah, I see what you me. Yes, that's a good idea. I think |
Makes sense to me -- I'll do that all together and rewrite the unit tests accordingly. |
Awesome, thank you! |
@zachmayer -- an update on fixing the consistency of the predict method: As I'm working on this I'm uncovering some flaws with the way NA values are handled by load(system.file("testdata/models.class.rda",
package="caretEnsemble", mustWork=TRUE))
load(system.file("testdata/models.reg.rda",
package="caretEnsemble", mustWork=TRUE))
load(system.file("testdata/X.class.rda",
package="caretEnsemble", mustWork=TRUE))
load(system.file("testdata/Y.class.rda",
package="caretEnsemble", mustWork=TRUE))
ens.reg <- caretEnsemble(models.reg, iter=1000)
models.class2 <- models.class[c(2:5)] # hack to cut out randomForest
class(models.class2) <- "caretList"
ens.class <- caretEnsemble(models.class2, iter=1000)
newDat <- ens.class$models[[4]]$trainingData
newDat[2, 2] <- NA
newDat[3, 3] <- NA
newDat[4, 4] <- NA
newDat <- newDat[1:10, ]
p1 <- predict(ens.class, newdata = newDat)
p2 <- predict(ens.class, newdata = newDat[1, ]) In a normal predict method, like for glm, it would work: gmData <- cbind(Y.class, X.class
gmData <- gmData[, -2]
gmData <- as.data.frame(gmData)
gmData$Y.class <- factor(gmData$Y.class)
gm1 <- glm(Y.class ~ ., data = gmData, family = "binomial")
newDat <- gmData[1:10,]
newDat[2, 4] <- NA
preds <- predict(gm1, newdata = newDat, type = "response") Results in:
Currently, we can't replicate this behavior because of the inconsistent way the methods in |
Hmm, on current master, I get the following: > p1 <- predict(ens.class, newdata = newDat)
Hide Traceback
Rerun with Debug
Error in colnames(tempUnkProb)[tempUnkPred] :
invalid subscript type 'list'
14 extractProb(list(object), unkX = newdata, unkOnly = TRUE, ...)
13 predict.train(x, type = "prob", ...)
12 predict(x, type = "prob", ...)
11 predict(x, type = "prob", ...)
10 FUN(X[[i]], ...)
9 lapply(X, FUN, ...)
8 pblapply(X, FUN, ...)
7 pbsapply(object, function(x) {
type <- x$modelType
if (type == "Classification") {
if (x$control$classProbs) { ...
6 predict.caretList(object$models, ...)
5 predict(object$models, ...)
4 predict(object$models, ...)
3 predict.caretEnsemble(ens.class, newdata = newDat)
2 predict(ens.class, newdata = newDat)
1 predict(ens.class, newdata = newDat) And then: > p2 <- predict(ens.class, newdata = newDat[1, ])
Hide Traceback
Rerun with Debug
Error in `colnames<-`(`*tmp*`, value = c("glm", "svmRadial", "nnet", "treebag" :
attempt to set 'colnames' on an object with less than two dimensions
8 stop("attempt to set 'colnames' on an object with less than two dimensions")
7 `colnames<-`(`*tmp*`, value = c("glm", "svmRadial", "nnet", "treebag"
))
6 predict.caretList(object$models, ...)
5 predict(object$models, ...)
4 predict(object$models, ...)
3 predict.caretEnsemble(ens.class, newdata = newDat[1, ])
2 predict(ens.class, newdata = newDat[1, ])
1 predict(ens.class, newdata = newDat[1, ]) But the GBM code does work. So yeah, we should add tests for this and fix it. Also, it looks like our builds are failing on master, so I need to dig into and fix that too. |
Oh yeah, sorry I wasn't clear -- the error is the problem. It is not consistent though -- I can see a few different options --
What do you think? |
Ok, I suspected that was the case, but wasn't sure. I don't really like option 2. Option 3 is worth investigating, and I think option 4 is the best bet, but who knows how long that will take =D |
Good, I'll investigate option 3 -- it's my favorite too if we can do it reasonably quickly. In the meantime, maybe we can ask @topepo if |
Sounds good. Can you open an issue on github? |
On it. |
Does this cover it? |
@topepo I think so, but in the example I was working with |
Got it. That won't be in the next version (hopefully sent to CRAN early next week) but since this is a common issue, I'll work on it next. |
This can be closed now by #167 and others. |
It's really hard to test
predict.caretEnsemble
, because it sometimes returns a list, sometimes a data.frame, and sometimes the structure of the list is different, depending on the values ofkeepNA
,se
, andreturn_weights
. Also, sometimes the weights have one row, and sometimes there's a weight for every row.Let's make it always return a
data.frame
.The text was updated successfully, but these errors were encountered: