-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR #19551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #82958 has finished for PR 19551 at commit
|
ffe751b
to
145b60f
Compare
Test build #82959 has finished for PR 19551 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this PR.
@@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", { | |||
expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE)) | |||
}) | |||
|
|||
test_that("SPARK-17902: collect() with stringsAsFactors enabled", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please verify that factor orders are identical. I wonder if expect_equal
really verifies order of values in a factor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> # Ordered vs unordered
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=TRUE)
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=FALSE)
> expect_equal(or, or1)
error: `or` not equal to `or1`.
Attributes: < Component “class”: Lengths (2, 1) differ (string compare on first 1) >
Attributes: < Component “class”: 1 string mismatch >
> # level order mismatch
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"))
> expect_equal(or, or1)
error: `or` not equal to `or1`.
Attributes: < Component “levels”: 3 string mismatches >
# Data order mismatch
> or <- factor(c("Lo", "Hi", "Med", "Med", "Hi"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> expect_equal(or, or1)
error: `or` not equal to `or1`.
4 string mismatches
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> expect_equal(or, or1)
Would this test address your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: I think iris data frame all Species values are clustered together. That is why the test is passing (the new factor order ends up being identical to the existing order).
R/pkg/R/DataFrame.R
Outdated
@@ -1191,6 +1191,9 @@ setMethod("collect", | |||
vec <- do.call(c, col) | |||
stopifnot(class(vec) != "list") | |||
class(vec) <- PRIMITIVE_TYPES[[colType]] | |||
if (stringsAsFactors && is.character(vec)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance maybe it is better to reverse the order of checks: is.character(vec) && stringsAsFactors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks.
Test build #82982 has finished for PR 19551 at commit
|
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, added this to SPARK-21616
## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in 71a138c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in 71a138c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Merged to master, branch-2.2 and branch-2.1. |
Thank you for review @falaki and @felixcheung. |
## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in apache@71a138c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
What changes were proposed in this pull request?
This PR proposes to revive
stringsAsFactors
option in collect API, which was mistakenly removed in 71a138c.Simply, it casts
charactor
tofactor
if it meets the condition,stringsAsFactors && is.character(vec)
in primitive type conversion.How was this patch tested?
Unit test in
R/pkg/tests/fulltests/test_sparkSQL.R
.