Error with nb from klaR package #339

blueskypie · 2015-12-08T01:21:33Z

I have gene expression data, including 834 predictors and 172 samples in training and 56 in testing. Hope someone can help with a few questions:

Among 172 samples in training set, the predictions for five are NA. But the predictor variance of those five samples seems normal.
The predict.train on testing dataset produces Class prediction but no probabilities, if I use all predictors except GABRQ.
If I include GABRQ, the predict.train on testing dataset produces the following error. But I don't see anything wrong with the variance and class conditional distribution of that variable.
Error in colnames(tempUnkProb)[tempUnkPred] : invalid subscript type 'list'

Here are some related code:

                    trainCtrl <- trainControl(method   =   "repeatedcv",
                            repeats   =   3, 
                            number = 10, 
                            classProbs   =   TRUE, 
                            savePredictions = 'final',
                            summaryFunction   =   twoClassSummary) 

                mFit <- train(x = training[,-classIndex],y = training[,classIndex],
                        method   =   'nb',
                        trControl   =   trainCtrl,
                        metric   =   "ROC", #for classification
                        preProc = c("center", "scale","conditionalX"),
                        tuneLength = 10) 

testPredProb   <-   predict(mFit,newdata = testing[,-classIndex], type = "prob")

Thanks a lot!

The text was updated successfully, but these errors were encountered:

topepo · 2015-12-08T01:42:09Z

What happens if:

you use the NaiveBayes function from klaR directly? Do you still get the error?
you try KNN imputation as an additional pre-processing measure?

blueskypie · 2015-12-08T05:46:35Z

Thanks for the help! There is no error if I use the finalModel

testPredProb   <-   predict(mFit$finalModel,newdata = testing[,-classIndex], type = "prob")

I'm also attaching the training and testing files, the classIndex = 1
testing.txt
training.txt

blueskypie · 2015-12-08T05:48:01Z

Also there is no missing value, so why the imputation?

blueskypie · 2015-12-10T19:33:55Z

can anyone help? please!

topepo · 2015-12-11T13:33:00Z

So let's walk through this. The error that you are getting is "Numerical 0 probability for all classes with observation X" for almost every test set value. Naive Bayes works from Bayes theorem:

Pr[Y|X] =   Pr[Y] x Pr[X|Y]
            ---------------
                 Pr[X]

Your data have even odds for the prior, so we can ignore that.

The naive bit treats every predictor independently, so you have the product of these

     Pr[X_i|Y]
  ---------------
      Pr[X_i]

but the bottom is not a function of the outcome Y so we normally just have to worry about Pr[X_i|Y]. The problem is that you have ~800 predictors and will be multiplying a lot of stuff between [0, 1] together and that can cause numerical problems. We might want to look at this proportion instead:

             Pr[X_i|Y]
  -------------------------------
   Pr[X_i|Y = t] + Pr[X_i|Y = n]

Here is some code to do the computations manually for the nonparametric density estimate (it is easy to substitute the Gaussian functions but your predictors are clearly not normally distributed):

all_tumor <- all_normal <- unconditional <- matrix(NA, nrow = nrow(testing), ncol = ncol(testing) - 1)
colnames(all_tumor) <- colnames(all_normal) <- colnames(unconditional) <- names(training)[-1]

for(i in 2:ncol(training)) {
  dx <- density(training[,i])
  dx_yn <- density(training[training$sampleType == "normal",i])
  dx_yt <- density(training[training$sampleType == "tumor",i])

  px <- dkernel(testing[,i], kernel = dx)
  px_yn <- dkernel(testing[,i], kernel = dx_yn)
  px_yt <- dkernel(testing[,i], kernel = dx_yt)

  all_tumor[,i-1] <- px_yt  
  all_normal[,i-1] <- px_yn  
  unconditional[,i-1] <- px  

  yl <- extendrange(c(dx_yn$y, dx_yt$y, px_yn, px_yt))
  xl <- extendrange(c(dx_yn$x, dx_yt$x, testing[, i]))

  plot(dx_yn$x, dx_yn$y, col = rgb(1, 0, 0, .5), 
       xlim = xl, ylim = yl, type = "l", lwd = 2,
       ylab = "", xlab = names(testing)[i])
  points(dx_yt$x, dx_yt$y, col = rgb(0, 0, 1, .5), type = "l", lwd = 2)
  points(testing[,i], px_yn, col = rgb(1, 0, 0, .5), pch = 16)
  points(testing[,i], px_yt, col = rgb(0, 0, 1, .5), pch = 16)
}

prob_sum <- all_tumor + all_normal
tumor_frac <- all_tumor/prob_sum
normal_frac <- all_normal/prob_sum

So, we can approximate the posterior probability (under a balanced prior) as:

> apply(tumor_frac, 1, prod)
 [1]  0.000000e+00 2.676536e-256 2.255684e-242 1.094915e-281  0.000000e+00
 [6]  0.000000e+00 2.462139e-230 2.704752e-228  0.000000e+00 3.988369e-225
[11] 6.340585e-250  0.000000e+00 1.595304e-241  0.000000e+00  0.000000e+00
[16]  0.000000e+00 4.797126e-236  0.000000e+00  0.000000e+00  0.000000e+00
[21] 5.724403e-274 4.681493e-243 1.667817e-271  0.000000e+00  0.000000e+00
[26] 3.505219e-257  0.000000e+00 1.080127e-260 9.364458e-269  0.000000e+00
[31] 1.678362e-268  0.000000e+00  0.000000e+00 3.639862e-293  0.000000e+00
[36]  0.000000e+00 1.870002e-250  0.000000e+00 1.775739e-253 8.250657e-236
[41] 1.578156e-233  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
[46] 1.030652e-254 1.221540e-231  0.000000e+00 3.158808e-261 4.535165e-244
[51] 4.304656e-222  0.000000e+00  0.000000e+00  0.000000e+00 1.315211e-233
[56] 6.908419e-237
> apply(normal_frac, 1, prod)
 [1] 2.297612e-212  0.000000e+00  0.000000e+00 1.467317e-281 1.972531e-208
 [6] 1.145059e-200  0.000000e+00  0.000000e+00 5.526592e-207  0.000000e+00
[11]  0.000000e+00 3.260158e-204  0.000000e+00 1.811132e-220 4.347907e-187
[16] 1.867480e-244  0.000000e+00 1.626627e-213 6.456660e-201 3.014360e-211
[21] 8.399097e-307  0.000000e+00 3.312943e-301 2.126332e-208 2.566658e-215
[26] 8.641432e-309 1.693245e-202  0.000000e+00  0.000000e+00 3.037738e-196
[31] 2.567129e-307 1.874569e-223 2.683573e-220 8.547862e-273 2.856175e-217
[36] 6.860483e-190 1.998496e-320 7.515750e-232 4.940656e-324  0.000000e+00
[41]  0.000000e+00 1.458379e-220 3.035225e-236 1.186740e-194 2.773529e-218
[46] 1.483024e-315  0.000000e+00 1.282923e-204 2.241090e-306  0.000000e+00
[51]  0.000000e+00 2.070366e-218 1.359433e-192 1.084135e-207  0.000000e+00
[56]  0.000000e+00

So, my guess as tot the problem is that Naive Bayes isn't really the right tool for the job here (e.g. large numbers of predictors). Numerically, the probabilities are all going towards zero.

You might be able to get around it my summing the logs and choosing the class to be the one with the largest number:

> apply(normal_frac, 1, function(x) sum(log10(x)))
 [1] -211.6387 -324.8646 -350.7398 -280.8335 -207.7050 -199.9412 -378.7986
 [8] -406.5969 -206.2575 -423.2121 -346.4341 -203.4868 -359.1892 -219.7420
[15] -186.3617 -243.7287 -414.1802 -212.7887 -200.1900 -210.5208 -306.0758
[22] -336.3935 -300.4798 -207.6724 -214.5906 -308.0634 -201.7713 -339.9645
[29] -345.0583 -195.5174 -306.5906 -222.7271 -219.5713 -272.0681 -216.5442
[36] -189.1636 -319.6993 -231.1240 -323.1415 -376.3364 -365.2225 -219.8361
[43] -235.5178 -193.9256 -217.5570 -314.8289 -374.2555 -203.8918 -305.6495
[50] -349.1757 -392.5578 -217.6840 -191.8666 -206.9649 -359.8696 -368.2657
> apply(tumor_frac, 1, function(x) sum(log10(x)))
 [1] -353.6756 -255.5724 -241.6467 -280.9606 -358.4289 -375.7746 -229.6087
 [8] -227.5679 -363.3898 -224.3992 -249.1979 -374.2010 -240.7972 -340.0716
[15] -392.4513 -337.2000 -235.3190 -352.9489 -362.4628 -362.8660 -273.2423
[22] -242.3296 -270.7779 -376.8879 -348.4819 -256.4553 -363.7962 -259.9665
[29] -268.0285 -389.4792 -267.7751 -343.3558 -342.0412 -292.4389 -353.5113
[36] -393.9689 -249.7282 -362.4821 -252.7506 -235.0835 -232.8019 -347.1579
[43] -362.1826 -387.0755 -359.5564 -253.9869 -230.9131 -367.7321 -260.5005
[50] -243.3434 -221.3661 -345.3825 -383.3747 -361.1584 -232.8810 -236.1606

I don't know that it is reasonable to compare 10^-211.6 to 10^-353.7 though (but I've seem people do it).

topepo · 2015-12-11T14:46:17Z

It is also point out this from ?.Machine:

double.xmin: the smallest non-zero normalized floating-point number, a power of the radix, i.e., 'double.base ^ double.min.exp'. Normally '2.225074e-308'.

blueskypie · 2015-12-11T21:42:51Z

Thanks so much for the detailed explanation! Actually I've digged out one of your old communications ten year ago. Both are very very helpful! I really appreciate!

Now I understand the warning message and NA prediction on some samples. But I think the error in the 3rd point of my original post is related to caret, no nb based on this post. Would you please take a look?

blueskypie · 2015-12-16T15:02:55Z

Could @topepo or someone please take a look? because I think the error is due to caret, not naiveBayes from klaR.

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with nb from klaR package #339

Error with nb from klaR package #339

blueskypie commented Dec 8, 2015

topepo commented Dec 8, 2015

blueskypie commented Dec 8, 2015

blueskypie commented Dec 8, 2015

blueskypie commented Dec 10, 2015

topepo commented Dec 11, 2015

topepo commented Dec 11, 2015

blueskypie commented Dec 11, 2015

blueskypie commented Dec 16, 2015

Error with nb from klaR package #339

Error with nb from klaR package #339

Comments

blueskypie commented Dec 8, 2015

topepo commented Dec 8, 2015

blueskypie commented Dec 8, 2015

blueskypie commented Dec 8, 2015

blueskypie commented Dec 10, 2015

topepo commented Dec 11, 2015

topepo commented Dec 11, 2015

blueskypie commented Dec 11, 2015

blueskypie commented Dec 16, 2015