-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17608][SPARKR]:Long type has incorrect serialization/deserialization #17640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Overall, this looks like a sensible approach to a messy problem. |
Test build #75815 has finished for PR 17640 at commit
|
I will some bound check and error handling. I think there should be some generic handling approach in the Scala side, because |
cc @felixcheung |
If I use very big number, then sparkR shell will get the following output:
So the overflow problem has been taken care of in the Scala side. We don't have to add additional handling in R side. |
What's the number you use for this big number?
|
Thanks @wangmiao1981 Does this address SPARK-12360 too? |
For 138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240 |
Based on my understanding, it does not directly solvethe 12360. This one just solves the serialization of a specific type |
+1 on what @felixcheung said -- It'll be good to have more tests in test_Serde.R. Other than that the change looks fine |
I am adding more tests right now. |
these are great tests to add... thinking more about this, I think these are actually testing large numeric values, not large integer value, because in R, though I'm not sure we could specify that as integer as R integer is only 32-bit. so we need a way to get bigint back from JVM side - and since we can't specify bigint on the R side (R integer is only 32-bit), should there be a SQL function test to create such value (and new column with the bigint type) from within the JVM? |
Test build #75862 has finished for PR 17640 at commit
|
@wangmiao1981 do you want to get this into 2.2? |
@felixcheung I just came back from vacation. I will make changes now. Thanks! |
Test build #76121 has finished for PR 17640 at commit
|
let me know if I'm mistaken, I think tests in here are only testing doubles/numerics instead of large int? |
df <- createDataFrame( | ||
list(list(1380742793415240, 1, "1"), list(1380742793415240, 2, "2"), | ||
list(1380742793415240, 3, "3")), c("a", "b", "c")) | ||
schema <- structType(structField("a", "bigint"), structField("b", "bigint"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one tests bigint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I'm not sure.
Walking through the code, createDataFrame
calls parallelize
which eventually calls R's serialize
. Since these big values are actually numeric, and not integer, serialize
writes them in that way
@@ -28,6 +28,10 @@ test_that("SerDe of primitive types", { | |||
expect_equal(x, 1) | |||
expect_equal(class(x), "numeric") | |||
|
|||
x <- callJStatic("SparkRHandler", "echo", 1380742793415240) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to specify in R console to enforce bigint type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some google search. R can't specify bigint
type. So, we can't directly test bigint
type.
We can remove the tests above, as we added schema
tests and scala API tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need step through this a bit more?
df <- createDataFrame( | ||
list(list(1380742793415240, 1, "1"), list(1380742793415240, 2, "2"), | ||
list(1380742793415240, 3, "3")), c("a", "b", "c")) | ||
schema <- structType(structField("a", "bigint"), structField("b", "bigint"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I'm not sure.
Walking through the code, createDataFrame
calls parallelize
which eventually calls R's serialize
. Since these big values are actually numeric, and not integer, serialize
writes them in that way
@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) { | |||
Date = writeDate(con, object), | |||
POSIXlt = writeTime(con, object), | |||
POSIXct = writeTime(con, object), | |||
bigint = writeDouble(con, object), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and another thing, there is no bigint
in R
so I'm not sure how we would hit this path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When specifying schema with bigint
, we will hit the bigint path. Without this change, it will thrown an error of type mismatch. But as you said, we can't specify bigint
type in R console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If R doesn't have bigint
type, we should remove all bigint
related logic. I don't know the history of bigint
mapping in the Types.R file. Why should we have it since every big number is numeric (Double in the backend)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Any thoughts? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are referring to https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25
like it says,
names(PRIMITIVE_TYPES) are Scala types whereas values are equivalent R types.
so bigint
there is Scala type, not R type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But as you mentioned, we don't know how to trigger the write path on the R side, because both bigint and double are numeric
. I think we can just remove the test in the R side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is different though, for PRIMITIVE_TYPES, it is used when you create a schema with structField in R. In this case you can definitely define a column as bigint and then pass a R numeric value to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using createDataFrame, R uses serialize
to send data to the backend. When taking an action, say, collect
, scala side logic refers to the schema field and calls the readTypedObjects
where the newly added read logic kicks in. When it returns back to R side, the newly added write logic kicks in and R side can interpret it due to the R side read logic. It seems that the write
logic in R side is not called, because we don't have specific type bigint
in R. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness purpose, I think we can keep the write logic in R side.
Test build #76306 has finished for PR 17640 at commit
|
Test build #76307 has finished for PR 17640 at commit
|
@wangmiao1981 Are you still working on this? |
@jiangxb1987 The original PR has some issues that are not correctly handled. I will open a new PR when I figure out the right fix. I intended to close this PR. Thanks for closing it. |
What changes were proposed in this pull request?
bigint
is not supported in schema and the serialization is notDouble
.Add
bigint
support in schema and serialized and deserialized asDouble
.This fix is orthogonal to the precision problem in
https://issues.apache.org/jira/browse/SPARK-12360
How was this patch tested?
Add a new unit test.