Skip to content

[SPARK-17608][SPARKR]:Long type has incorrect serialization/deserialization #17640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

bigint is not supported in schema and the serialization is not Double.

Add bigint support in schema and serialized and deserialized as Double.

This fix is orthogonal to the precision problem in
https://issues.apache.org/jira/browse/SPARK-12360

How was this patch tested?

Add a new unit test.

@frreiss
Copy link
Contributor

frreiss commented Apr 14, 2017

Overall, this looks like a sensible approach to a messy problem.
You might want to think about adding some overflow handling to the SQL-->R translation. That is, if a Dataframe contains a bigint value that cannot be expressed as a Double, it would be safer to convert that value to NaN instead of stripping the lower-order bits off the bigint. The bigint column in the source Dataframe could hold a unique identifier or a hash value.

@SparkQA
Copy link

SparkQA commented Apr 14, 2017

Test build #75815 has finished for PR 17640 at commit 03b82ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

wangmiao1981 commented Apr 14, 2017

I will some bound check and error handling. I think there should be some generic handling approach in the Scala side, because Double will also have the problem of overflow. Anyway, I will check the source code to understand the handling.

@wangmiao1981
Copy link
Contributor Author

cc @felixcheung

@wangmiao1981
Copy link
Contributor Author

If I use very big number, then sparkR shell will get the following output:

collect(df1)
a b c d
1 Inf 1 1 Inf

So the overflow problem has been taken care of in the Scala side. We don't have to add additional handling in R side.

@felixcheung
Copy link
Member

felixcheung commented Apr 15, 2017 via email

@felixcheung
Copy link
Member

Thanks @wangmiao1981
cc @shivaram @sun-rui

Does this address SPARK-12360 too?
Since this changes the serde, this affects more than dapply? Could you please add more test?

@wangmiao1981
Copy link
Contributor Author

For Inf case, I used a very large number:

138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240

@wangmiao1981
Copy link
Contributor Author

Based on my understanding, it does not directly solvethe 12360. This one just solves the serialization of a specific type bigint in struct field.

@shivaram
Copy link
Contributor

+1 on what @felixcheung said -- It'll be good to have more tests in test_Serde.R. Other than that the change looks fine

@wangmiao1981
Copy link
Contributor Author

I am adding more tests right now.

@felixcheung
Copy link
Member

these are great tests to add...

thinking more about this, I think these are actually testing large numeric values, not large integer value, because in R, 1380742793415240 defaults to numeric, vs 1380742793415240L is integer.

though I'm not sure we could specify that as integer as R integer is only 32-bit. so we need a way to get bigint back from JVM side - and since we can't specify bigint on the R side (R integer is only 32-bit), should there be a SQL function test to create such value (and new column with the bigint type) from within the JVM?

@SparkQA
Copy link

SparkQA commented Apr 17, 2017

Test build #75862 has finished for PR 17640 at commit da39bcf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

@wangmiao1981 do you want to get this into 2.2?

@wangmiao1981
Copy link
Contributor Author

@felixcheung I just came back from vacation. I will make changes now. Thanks!

@SparkQA
Copy link

SparkQA commented Apr 25, 2017

Test build #76121 has finished for PR 17640 at commit 331b781.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

let me know if I'm mistaken, I think tests in here are only testing doubles/numerics instead of large int?

df <- createDataFrame(
list(list(1380742793415240, 1, "1"), list(1380742793415240, 2, "2"),
list(1380742793415240, 3, "3")), c("a", "b", "c"))
schema <- structType(structField("a", "bigint"), structField("b", "bigint"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one tests bigint

Copy link
Member

@felixcheung felixcheung Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I'm not sure.
Walking through the code, createDataFrame calls parallelize which eventually calls R's serialize. Since these big values are actually numeric, and not integer, serialize writes them in that way

@@ -28,6 +28,10 @@ test_that("SerDe of primitive types", {
expect_equal(x, 1)
expect_equal(class(x), "numeric")

x <- callJStatic("SparkRHandler", "echo", 1380742793415240)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to specify in R console to enforce bigint type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some google search. R can't specify bigint type. So, we can't directly test bigint type.

We can remove the tests above, as we added schema tests and scala API tests.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need step through this a bit more?

df <- createDataFrame(
list(list(1380742793415240, 1, "1"), list(1380742793415240, 2, "2"),
list(1380742793415240, 3, "3")), c("a", "b", "c"))
schema <- structType(structField("a", "bigint"), structField("b", "bigint"),
Copy link
Member

@felixcheung felixcheung Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I'm not sure.
Walking through the code, createDataFrame calls parallelize which eventually calls R's serialize. Since these big values are actually numeric, and not integer, serialize writes them in that way

@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) {
Date = writeDate(con, object),
POSIXlt = writeTime(con, object),
POSIXct = writeTime(con, object),
bigint = writeDouble(con, object),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another thing, there is no bigint in R
so I'm not sure how we would hit this path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When specifying schema with bigint, we will hit the bigint path. Without this change, it will thrown an error of type mismatch. But as you said, we can't specify bigint type in R console.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If R doesn't have bigint type, we should remove all bigint related logic. I don't know the history of bigint mapping in the Types.R file. Why should we have it since every big number is numeric (Double in the backend)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Any thoughts? Thanks!

Copy link
Member

@felixcheung felixcheung Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are referring to https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25
like it says,

names(PRIMITIVE_TYPES) are Scala types whereas values are equivalent R types.

so bigint there is Scala type, not R type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But as you mentioned, we don't know how to trigger the write path on the R side, because both bigint and double are numeric. I think we can just remove the test in the R side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is different though, for PRIMITIVE_TYPES, it is used when you create a schema with structField in R. In this case you can definitely define a column as bigint and then pass a R numeric value to it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using createDataFrame, R uses serialize to send data to the backend. When taking an action, say, collect, scala side logic refers to the schema field and calls the readTypedObjects where the newly added read logic kicks in. When it returns back to R side, the newly added write logic kicks in and R side can interpret it due to the R side read logic. It seems that the write logic in R side is not called, because we don't have specific type bigint in R. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness purpose, I think we can keep the write logic in R side.

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76306 has finished for PR 17640 at commit 26d7750.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76307 has finished for PR 17640 at commit 38fef1d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

@wangmiao1981 Are you still working on this?

@wangmiao1981
Copy link
Contributor Author

@jiangxb1987 The original PR has some issues that are not correctly handled. I will open a new PR when I figure out the right fix. I intended to close this PR. Thanks for closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants