createDataPartition creates approximate splits #284

tobigithub · 2015-10-19T19:11:38Z

Hi,
createDataPartition creates correct splits for 100%, 80% and 0% but approximate (inaccurate) splits for 70% 50% and 10%. I did not test all the numbers with apply, but I am sure for 70% it should return 70 instead of 72. Unless that is a feature and not a bug.

library(caret) 
library(mlbench)

# create list of simulated regression data 
# y = 10*sin(PI*x1*x2) + 20*(x3 - 0.5)^2 + 10*x4 + 5*x5 + N*(0,s^2)
# x1..x5 and x6..x10 non-informative

set.seed(123)
simReg <- mlbench.friedman1(100, sd = 1)
# conversion to data frame as suggested in book Applied ML
simReg$x <- data.frame(simReg$x)


inTrain <- createDataPartition(y=simReg$y, p=1.0, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.80, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.70, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.50, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.1, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.0, list=FALSE)
dim(inTrain)
str(simReg)

creates

> inTrain <- createDataPartition(y=simReg$y, p=1.0, list=FALSE)
> dim(inTrain)
[1] 100   1
> inTrain <- createDataPartition(y=simReg$y, p=0.80, list=FALSE)
> dim(inTrain)
[1] 80  1
> inTrain <- createDataPartition(y=simReg$y, p=0.70, list=FALSE)
> dim(inTrain)
[1] 72  1
> inTrain <- createDataPartition(y=simReg$y, p=0.50, list=FALSE)
> dim(inTrain)
[1] 52  1
> inTrain <- createDataPartition(y=simReg$y, p=0.1, list=FALSE)
> dim(inTrain)
[1] 12  1
> inTrain <- createDataPartition(y=simReg$y, p=0.0, list=FALSE)
> dim(inTrain)
[1] 0 1

>

Is there a way to create accurate splits?

Cheers
Tobias

khotilov · 2016-04-15T00:11:42Z

Is there a way to create accurate splits?

Yes, if turn off the default stratified sampling by setting the number of y quantile breaks to two or less, e.g., createDataPartition(simReg$y, p=0.10, list=F, groups=2)

tobigithub · 2016-04-20T01:47:19Z

Thank you Vadim.
Tobias

VectorPosse · 2018-02-10T23:13:27Z

I, too, was just bitten by this. I understand the rationale for the splitting procedure to respect the structure in the outcome variable, but it's counterintuitive to a new user that p = 0.5 does not give a 50/50 split without setting another argument away from its default value.

anadiedrichs mentioned this issue Jul 30, 2018

bug al utilizar createDataPartition anadiedrichs/diedrichs2017prediction-frost-experiments#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

createDataPartition creates approximate splits #284

createDataPartition creates approximate splits #284

tobigithub commented Oct 19, 2015

khotilov commented Apr 15, 2016

tobigithub commented Apr 20, 2016

VectorPosse commented Feb 10, 2018

createDataPartition creates approximate splits #284

createDataPartition creates approximate splits #284

Comments

tobigithub commented Oct 19, 2015

khotilov commented Apr 15, 2016

tobigithub commented Apr 20, 2016

VectorPosse commented Feb 10, 2018