-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with nb from klaR package #339
Comments
What happens if:
|
Thanks for the help! There is no error if I use the finalModel testPredProb <- predict(mFit$finalModel,newdata = testing[,-classIndex], type = "prob") I'm also attaching the training and testing files, the classIndex = 1 |
Also there is no missing value, so why the imputation? |
can anyone help? please! |
So let's walk through this. The error that you are getting is "Numerical 0 probability for all classes with observation X" for almost every test set value. Naive Bayes works from Bayes theorem:
Your data have even odds for the prior, so we can ignore that. The naive bit treats every predictor independently, so you have the product of these
but the bottom is not a function of the outcome
Here is some code to do the computations manually for the nonparametric density estimate (it is easy to substitute the Gaussian functions but your predictors are clearly not normally distributed): all_tumor <- all_normal <- unconditional <- matrix(NA, nrow = nrow(testing), ncol = ncol(testing) - 1)
colnames(all_tumor) <- colnames(all_normal) <- colnames(unconditional) <- names(training)[-1]
for(i in 2:ncol(training)) {
dx <- density(training[,i])
dx_yn <- density(training[training$sampleType == "normal",i])
dx_yt <- density(training[training$sampleType == "tumor",i])
px <- dkernel(testing[,i], kernel = dx)
px_yn <- dkernel(testing[,i], kernel = dx_yn)
px_yt <- dkernel(testing[,i], kernel = dx_yt)
all_tumor[,i-1] <- px_yt
all_normal[,i-1] <- px_yn
unconditional[,i-1] <- px
yl <- extendrange(c(dx_yn$y, dx_yt$y, px_yn, px_yt))
xl <- extendrange(c(dx_yn$x, dx_yt$x, testing[, i]))
plot(dx_yn$x, dx_yn$y, col = rgb(1, 0, 0, .5),
xlim = xl, ylim = yl, type = "l", lwd = 2,
ylab = "", xlab = names(testing)[i])
points(dx_yt$x, dx_yt$y, col = rgb(0, 0, 1, .5), type = "l", lwd = 2)
points(testing[,i], px_yn, col = rgb(1, 0, 0, .5), pch = 16)
points(testing[,i], px_yt, col = rgb(0, 0, 1, .5), pch = 16)
}
prob_sum <- all_tumor + all_normal
tumor_frac <- all_tumor/prob_sum
normal_frac <- all_normal/prob_sum So, we can approximate the posterior probability (under a balanced prior) as: > apply(tumor_frac, 1, prod)
[1] 0.000000e+00 2.676536e-256 2.255684e-242 1.094915e-281 0.000000e+00
[6] 0.000000e+00 2.462139e-230 2.704752e-228 0.000000e+00 3.988369e-225
[11] 6.340585e-250 0.000000e+00 1.595304e-241 0.000000e+00 0.000000e+00
[16] 0.000000e+00 4.797126e-236 0.000000e+00 0.000000e+00 0.000000e+00
[21] 5.724403e-274 4.681493e-243 1.667817e-271 0.000000e+00 0.000000e+00
[26] 3.505219e-257 0.000000e+00 1.080127e-260 9.364458e-269 0.000000e+00
[31] 1.678362e-268 0.000000e+00 0.000000e+00 3.639862e-293 0.000000e+00
[36] 0.000000e+00 1.870002e-250 0.000000e+00 1.775739e-253 8.250657e-236
[41] 1.578156e-233 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
[46] 1.030652e-254 1.221540e-231 0.000000e+00 3.158808e-261 4.535165e-244
[51] 4.304656e-222 0.000000e+00 0.000000e+00 0.000000e+00 1.315211e-233
[56] 6.908419e-237
> apply(normal_frac, 1, prod)
[1] 2.297612e-212 0.000000e+00 0.000000e+00 1.467317e-281 1.972531e-208
[6] 1.145059e-200 0.000000e+00 0.000000e+00 5.526592e-207 0.000000e+00
[11] 0.000000e+00 3.260158e-204 0.000000e+00 1.811132e-220 4.347907e-187
[16] 1.867480e-244 0.000000e+00 1.626627e-213 6.456660e-201 3.014360e-211
[21] 8.399097e-307 0.000000e+00 3.312943e-301 2.126332e-208 2.566658e-215
[26] 8.641432e-309 1.693245e-202 0.000000e+00 0.000000e+00 3.037738e-196
[31] 2.567129e-307 1.874569e-223 2.683573e-220 8.547862e-273 2.856175e-217
[36] 6.860483e-190 1.998496e-320 7.515750e-232 4.940656e-324 0.000000e+00
[41] 0.000000e+00 1.458379e-220 3.035225e-236 1.186740e-194 2.773529e-218
[46] 1.483024e-315 0.000000e+00 1.282923e-204 2.241090e-306 0.000000e+00
[51] 0.000000e+00 2.070366e-218 1.359433e-192 1.084135e-207 0.000000e+00
[56] 0.000000e+00 So, my guess as tot the problem is that Naive Bayes isn't really the right tool for the job here (e.g. large numbers of predictors). Numerically, the probabilities are all going towards zero. You might be able to get around it my summing the logs and choosing the class to be the one with the largest number: > apply(normal_frac, 1, function(x) sum(log10(x)))
[1] -211.6387 -324.8646 -350.7398 -280.8335 -207.7050 -199.9412 -378.7986
[8] -406.5969 -206.2575 -423.2121 -346.4341 -203.4868 -359.1892 -219.7420
[15] -186.3617 -243.7287 -414.1802 -212.7887 -200.1900 -210.5208 -306.0758
[22] -336.3935 -300.4798 -207.6724 -214.5906 -308.0634 -201.7713 -339.9645
[29] -345.0583 -195.5174 -306.5906 -222.7271 -219.5713 -272.0681 -216.5442
[36] -189.1636 -319.6993 -231.1240 -323.1415 -376.3364 -365.2225 -219.8361
[43] -235.5178 -193.9256 -217.5570 -314.8289 -374.2555 -203.8918 -305.6495
[50] -349.1757 -392.5578 -217.6840 -191.8666 -206.9649 -359.8696 -368.2657
> apply(tumor_frac, 1, function(x) sum(log10(x)))
[1] -353.6756 -255.5724 -241.6467 -280.9606 -358.4289 -375.7746 -229.6087
[8] -227.5679 -363.3898 -224.3992 -249.1979 -374.2010 -240.7972 -340.0716
[15] -392.4513 -337.2000 -235.3190 -352.9489 -362.4628 -362.8660 -273.2423
[22] -242.3296 -270.7779 -376.8879 -348.4819 -256.4553 -363.7962 -259.9665
[29] -268.0285 -389.4792 -267.7751 -343.3558 -342.0412 -292.4389 -353.5113
[36] -393.9689 -249.7282 -362.4821 -252.7506 -235.0835 -232.8019 -347.1579
[43] -362.1826 -387.0755 -359.5564 -253.9869 -230.9131 -367.7321 -260.5005
[50] -243.3434 -221.3661 -345.3825 -383.3747 -361.1584 -232.8810 -236.1606 I don't know that it is reasonable to compare |
It is also point out this from
|
Thanks so much for the detailed explanation! Actually I've digged out one of your old communications ten year ago. Both are very very helpful! I really appreciate! Now I understand the warning message and NA prediction on some samples. But I think the error in the 3rd point of my original post is related to caret, no nb based on this post. Would you please take a look? |
Could @topepo or someone please take a look? because I think the error is due to caret, not naiveBayes from klaR. Thanks a lot! |
I have gene expression data, including 834 predictors and 172 samples in training and 56 in testing. Hope someone can help with a few questions:
Error in colnames(tempUnkProb)[tempUnkPred] : invalid subscript type 'list'
Here are some related code:
Thanks a lot!
The text was updated successfully, but these errors were encountered: