Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createDataPartition creates approximate splits #284

Open
tobigithub opened this issue Oct 19, 2015 · 3 comments
Open

createDataPartition creates approximate splits #284

tobigithub opened this issue Oct 19, 2015 · 3 comments

Comments

@tobigithub
Copy link

Hi,
createDataPartition creates correct splits for 100%, 80% and 0% but approximate (inaccurate) splits for 70% 50% and 10%. I did not test all the numbers with apply, but I am sure for 70% it should return 70 instead of 72. Unless that is a feature and not a bug.

library(caret) 
library(mlbench)

# create list of simulated regression data 
# y = 10*sin(PI*x1*x2) + 20*(x3 - 0.5)^2 + 10*x4 + 5*x5 + N*(0,s^2)
# x1..x5 and x6..x10 non-informative

set.seed(123)
simReg <- mlbench.friedman1(100, sd = 1)
# conversion to data frame as suggested in book Applied ML
simReg$x <- data.frame(simReg$x)


inTrain <- createDataPartition(y=simReg$y, p=1.0, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.80, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.70, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.50, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.1, list=FALSE)
dim(inTrain)
inTrain <- createDataPartition(y=simReg$y, p=0.0, list=FALSE)
dim(inTrain)
str(simReg)

creates

> inTrain <- createDataPartition(y=simReg$y, p=1.0, list=FALSE)
> dim(inTrain)
[1] 100   1
> inTrain <- createDataPartition(y=simReg$y, p=0.80, list=FALSE)
> dim(inTrain)
[1] 80  1
> inTrain <- createDataPartition(y=simReg$y, p=0.70, list=FALSE)
> dim(inTrain)
[1] 72  1
> inTrain <- createDataPartition(y=simReg$y, p=0.50, list=FALSE)
> dim(inTrain)
[1] 52  1
> inTrain <- createDataPartition(y=simReg$y, p=0.1, list=FALSE)
> dim(inTrain)
[1] 12  1
> inTrain <- createDataPartition(y=simReg$y, p=0.0, list=FALSE)
> dim(inTrain)
[1] 0 1

> 

Is there a way to create accurate splits?

Cheers
Tobias

@khotilov
Copy link
Contributor

Is there a way to create accurate splits?

Yes, if turn off the default stratified sampling by setting the number of y quantile breaks to two or less, e.g., createDataPartition(simReg$y, p=0.10, list=F, groups=2)

@tobigithub
Copy link
Author

Thank you Vadim.
Tobias

@VectorPosse
Copy link

I, too, was just bitten by this. I understand the rationale for the splitting procedure to respect the structure in the outcome variable, but it's counterintuitive to a new user that p = 0.5 does not give a 50/50 split without setting another argument away from its default value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants