-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
I was wondering about calculate_distance() and after_cut object.
How it will behave if one wants to calculate_distance() for factor vectors with different number of levels?
For example, if a train and test samples are provided from outside of R. One of the factor variables in test sample (variable_new) is missing some types of values and it won't have the same number of levels as the variable in train sample (variable_old). Then, if I'm not mistaken, c() that creates after_cuts, will encode them differently than it should.
Example:
variable_old <- apartments[, 6]
variable_new <- filter(apartments_test, district != "Praga")[, 6]
variable_new2 <- droplevels(variable_new)
length(levels(variable_new))
[1] 10
length(levels(variable_new2))
[1] 9
calculate_distance(variable_old,variable_new)
[1] 0.092
calculate_distance(variable_old,variable_new2)
[1] 0.097
If that's indeed a problem, than maybe the change proposed below would solve it?
after_cuts <- as.factor(c(as.character(variable_old),as.character(variable_new)))
Metadata
Metadata
Assignees
Labels
No labels