Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with nb from klaR package #339

Open
blueskypie opened this issue Dec 8, 2015 · 8 comments
Open

Error with nb from klaR package #339

blueskypie opened this issue Dec 8, 2015 · 8 comments

Comments

@blueskypie
Copy link

I have gene expression data, including 834 predictors and 172 samples in training and 56 in testing. Hope someone can help with a few questions:

  1. Among 172 samples in training set, the predictions for five are NA. But the predictor variance of those five samples seems normal.
  2. The predict.train on testing dataset produces Class prediction but no probabilities, if I use all predictors except GABRQ.
  3. If I include GABRQ, the predict.train on testing dataset produces the following error. But I don't see anything wrong with the variance and class conditional distribution of that variable.
    Error in colnames(tempUnkProb)[tempUnkPred] : invalid subscript type 'list'

Here are some related code:

                    trainCtrl <- trainControl(method   =   "repeatedcv",
                            repeats   =   3, 
                            number = 10, 
                            classProbs   =   TRUE, 
                            savePredictions = 'final',
                            summaryFunction   =   twoClassSummary) 

                mFit <- train(x = training[,-classIndex],y = training[,classIndex],
                        method   =   'nb',
                        trControl   =   trainCtrl,
                        metric   =   "ROC", #for classification
                        preProc = c("center", "scale","conditionalX"),
                        tuneLength = 10) 

testPredProb   <-   predict(mFit,newdata = testing[,-classIndex], type = "prob")

Thanks a lot!

@topepo
Copy link
Owner

topepo commented Dec 8, 2015

What happens if:

  • you use the NaiveBayes function from klaR directly? Do you still get the error?
  • you try KNN imputation as an additional pre-processing measure?

@blueskypie
Copy link
Author

Thanks for the help! There is no error if I use the finalModel

testPredProb   <-   predict(mFit$finalModel,newdata = testing[,-classIndex], type = "prob")

I'm also attaching the training and testing files, the classIndex = 1
testing.txt
training.txt

@blueskypie
Copy link
Author

Also there is no missing value, so why the imputation?

@blueskypie
Copy link
Author

can anyone help? please!

@topepo
Copy link
Owner

topepo commented Dec 11, 2015

So let's walk through this. The error that you are getting is "Numerical 0 probability for all classes with observation X" for almost every test set value. Naive Bayes works from Bayes theorem:

Pr[Y|X] =   Pr[Y] x Pr[X|Y]
            ---------------
                 Pr[X]

Your data have even odds for the prior, so we can ignore that.

The naive bit treats every predictor independently, so you have the product of these

     Pr[X_i|Y]
  ---------------
      Pr[X_i]                 

but the bottom is not a function of the outcome Y so we normally just have to worry about Pr[X_i|Y]. The problem is that you have ~800 predictors and will be multiplying a lot of stuff between [0, 1] together and that can cause numerical problems. We might want to look at this proportion instead:

             Pr[X_i|Y]
  -------------------------------
   Pr[X_i|Y = t] + Pr[X_i|Y = n]                

Here is some code to do the computations manually for the nonparametric density estimate (it is easy to substitute the Gaussian functions but your predictors are clearly not normally distributed):

all_tumor <- all_normal <- unconditional <- matrix(NA, nrow = nrow(testing), ncol = ncol(testing) - 1)
colnames(all_tumor) <- colnames(all_normal) <- colnames(unconditional) <- names(training)[-1]

for(i in 2:ncol(training)) {
  dx <- density(training[,i])
  dx_yn <- density(training[training$sampleType == "normal",i])
  dx_yt <- density(training[training$sampleType == "tumor",i])

  px <- dkernel(testing[,i], kernel = dx)
  px_yn <- dkernel(testing[,i], kernel = dx_yn)
  px_yt <- dkernel(testing[,i], kernel = dx_yt)

  all_tumor[,i-1] <- px_yt  
  all_normal[,i-1] <- px_yn  
  unconditional[,i-1] <- px  

  yl <- extendrange(c(dx_yn$y, dx_yt$y, px_yn, px_yt))
  xl <- extendrange(c(dx_yn$x, dx_yt$x, testing[, i]))

  plot(dx_yn$x, dx_yn$y, col = rgb(1, 0, 0, .5), 
       xlim = xl, ylim = yl, type = "l", lwd = 2,
       ylab = "", xlab = names(testing)[i])
  points(dx_yt$x, dx_yt$y, col = rgb(0, 0, 1, .5), type = "l", lwd = 2)
  points(testing[,i], px_yn, col = rgb(1, 0, 0, .5), pch = 16)
  points(testing[,i], px_yt, col = rgb(0, 0, 1, .5), pch = 16)
}

prob_sum <- all_tumor + all_normal
tumor_frac <- all_tumor/prob_sum
normal_frac <- all_normal/prob_sum

So, we can approximate the posterior probability (under a balanced prior) as:

> apply(tumor_frac, 1, prod)
 [1]  0.000000e+00 2.676536e-256 2.255684e-242 1.094915e-281  0.000000e+00
 [6]  0.000000e+00 2.462139e-230 2.704752e-228  0.000000e+00 3.988369e-225
[11] 6.340585e-250  0.000000e+00 1.595304e-241  0.000000e+00  0.000000e+00
[16]  0.000000e+00 4.797126e-236  0.000000e+00  0.000000e+00  0.000000e+00
[21] 5.724403e-274 4.681493e-243 1.667817e-271  0.000000e+00  0.000000e+00
[26] 3.505219e-257  0.000000e+00 1.080127e-260 9.364458e-269  0.000000e+00
[31] 1.678362e-268  0.000000e+00  0.000000e+00 3.639862e-293  0.000000e+00
[36]  0.000000e+00 1.870002e-250  0.000000e+00 1.775739e-253 8.250657e-236
[41] 1.578156e-233  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
[46] 1.030652e-254 1.221540e-231  0.000000e+00 3.158808e-261 4.535165e-244
[51] 4.304656e-222  0.000000e+00  0.000000e+00  0.000000e+00 1.315211e-233
[56] 6.908419e-237
> apply(normal_frac, 1, prod)
 [1] 2.297612e-212  0.000000e+00  0.000000e+00 1.467317e-281 1.972531e-208
 [6] 1.145059e-200  0.000000e+00  0.000000e+00 5.526592e-207  0.000000e+00
[11]  0.000000e+00 3.260158e-204  0.000000e+00 1.811132e-220 4.347907e-187
[16] 1.867480e-244  0.000000e+00 1.626627e-213 6.456660e-201 3.014360e-211
[21] 8.399097e-307  0.000000e+00 3.312943e-301 2.126332e-208 2.566658e-215
[26] 8.641432e-309 1.693245e-202  0.000000e+00  0.000000e+00 3.037738e-196
[31] 2.567129e-307 1.874569e-223 2.683573e-220 8.547862e-273 2.856175e-217
[36] 6.860483e-190 1.998496e-320 7.515750e-232 4.940656e-324  0.000000e+00
[41]  0.000000e+00 1.458379e-220 3.035225e-236 1.186740e-194 2.773529e-218
[46] 1.483024e-315  0.000000e+00 1.282923e-204 2.241090e-306  0.000000e+00
[51]  0.000000e+00 2.070366e-218 1.359433e-192 1.084135e-207  0.000000e+00
[56]  0.000000e+00

So, my guess as tot the problem is that Naive Bayes isn't really the right tool for the job here (e.g. large numbers of predictors). Numerically, the probabilities are all going towards zero.

You might be able to get around it my summing the logs and choosing the class to be the one with the largest number:

> apply(normal_frac, 1, function(x) sum(log10(x)))
 [1] -211.6387 -324.8646 -350.7398 -280.8335 -207.7050 -199.9412 -378.7986
 [8] -406.5969 -206.2575 -423.2121 -346.4341 -203.4868 -359.1892 -219.7420
[15] -186.3617 -243.7287 -414.1802 -212.7887 -200.1900 -210.5208 -306.0758
[22] -336.3935 -300.4798 -207.6724 -214.5906 -308.0634 -201.7713 -339.9645
[29] -345.0583 -195.5174 -306.5906 -222.7271 -219.5713 -272.0681 -216.5442
[36] -189.1636 -319.6993 -231.1240 -323.1415 -376.3364 -365.2225 -219.8361
[43] -235.5178 -193.9256 -217.5570 -314.8289 -374.2555 -203.8918 -305.6495
[50] -349.1757 -392.5578 -217.6840 -191.8666 -206.9649 -359.8696 -368.2657
> apply(tumor_frac, 1, function(x) sum(log10(x)))
 [1] -353.6756 -255.5724 -241.6467 -280.9606 -358.4289 -375.7746 -229.6087
 [8] -227.5679 -363.3898 -224.3992 -249.1979 -374.2010 -240.7972 -340.0716
[15] -392.4513 -337.2000 -235.3190 -352.9489 -362.4628 -362.8660 -273.2423
[22] -242.3296 -270.7779 -376.8879 -348.4819 -256.4553 -363.7962 -259.9665
[29] -268.0285 -389.4792 -267.7751 -343.3558 -342.0412 -292.4389 -353.5113
[36] -393.9689 -249.7282 -362.4821 -252.7506 -235.0835 -232.8019 -347.1579
[43] -362.1826 -387.0755 -359.5564 -253.9869 -230.9131 -367.7321 -260.5005
[50] -243.3434 -221.3661 -345.3825 -383.3747 -361.1584 -232.8810 -236.1606

I don't know that it is reasonable to compare 10^-211.6 to 10^-353.7 though (but I've seem people do it).

@topepo
Copy link
Owner

topepo commented Dec 11, 2015

It is also point out this from ?.Machine:

double.xmin: the smallest non-zero normalized floating-point number, a power of the radix, i.e., 'double.base ^ double.min.exp'. Normally '2.225074e-308'.

@blueskypie
Copy link
Author

Thanks so much for the detailed explanation! Actually I've digged out one of your old communications ten year ago. Both are very very helpful! I really appreciate!

Now I understand the warning message and NA prediction on some samples. But I think the error in the 3rd point of my original post is related to caret, no nb based on this post. Would you please take a look?

@blueskypie
Copy link
Author

Could @topepo or someone please take a look? because I think the error is due to caret, not naiveBayes from klaR.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants