[SPARK-17608][SPARKR]:Long type has incorrect serialization/deserialization #17640

wangmiao1981 · 2017-04-14T17:49:06Z

What changes were proposed in this pull request?

bigint is not supported in schema and the serialization is not Double.

Add bigint support in schema and serialized and deserialized as Double.

This fix is orthogonal to the precision problem in
https://issues.apache.org/jira/browse/SPARK-12360

How was this patch tested?

Add a new unit test.

frreiss · 2017-04-14T19:21:56Z

Overall, this looks like a sensible approach to a messy problem.
You might want to think about adding some overflow handling to the SQL-->R translation. That is, if a Dataframe contains a bigint value that cannot be expressed as a Double, it would be safer to convert that value to NaN instead of stripping the lower-order bits off the bigint. The bigint column in the source Dataframe could hold a unique identifier or a hash value.

SparkQA · 2017-04-14T20:39:23Z

Test build #75815 has finished for PR 17640 at commit 03b82ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-04-14T21:04:03Z

I will some bound check and error handling. I think there should be some generic handling approach in the Scala side, because Double will also have the problem of overflow. Anyway, I will check the source code to understand the handling.

wangmiao1981 · 2017-04-14T21:04:27Z

cc @felixcheung

wangmiao1981 · 2017-04-15T19:28:43Z

If I use very big number, then sparkR shell will get the following output:

collect(df1)
a b c d
1 Inf 1 1 Inf

So the overflow problem has been taken care of in the Scala side. We don't have to add additional handling in R side.

felixcheung · 2017-04-15T23:52:53Z

What's the number you use for this big number?

felixcheung · 2017-04-16T18:38:05Z

Thanks @wangmiao1981
cc @shivaram @sun-rui

Does this address SPARK-12360 too?
Since this changes the serde, this affects more than dapply? Could you please add more test?

wangmiao1981 · 2017-04-17T05:54:30Z

For Inf case, I used a very large number:

138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240

wangmiao1981 · 2017-04-17T05:57:50Z

Based on my understanding, it does not directly solvethe 12360. This one just solves the serialization of a specific type bigint in struct field.

shivaram · 2017-04-17T15:30:02Z

+1 on what @felixcheung said -- It'll be good to have more tests in test_Serde.R. Other than that the change looks fine

wangmiao1981 · 2017-04-17T17:53:06Z

I am adding more tests right now.

felixcheung · 2017-04-17T21:01:19Z

these are great tests to add...

thinking more about this, I think these are actually testing large numeric values, not large integer value, because in R, 1380742793415240 defaults to numeric, vs 1380742793415240L is integer.

though I'm not sure we could specify that as integer as R integer is only 32-bit. so we need a way to get bigint back from JVM side - and since we can't specify bigint on the R side (R integer is only 32-bit), should there be a SQL function test to create such value (and new column with the bigint type) from within the JVM?

SparkQA · 2017-04-17T22:10:59Z

Test build #75862 has finished for PR 17640 at commit da39bcf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-19T03:37:33Z

@wangmiao1981 do you want to get this into 2.2?

wangmiao1981 · 2017-04-23T17:20:37Z

@felixcheung I just came back from vacation. I will make changes now. Thanks!

SparkQA · 2017-04-25T01:46:22Z

Test build #76121 has finished for PR 17640 at commit 331b781.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-25T07:39:43Z

let me know if I'm mistaken, I think tests in here are only testing doubles/numerics instead of large int?

wangmiao1981 · 2017-04-26T03:25:10Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

+  df <- createDataFrame(
+        list(list(1380742793415240, 1, "1"), list(1380742793415240, 2, "2"),
+        list(1380742793415240, 3, "3")), c("a", "b", "c"))
+  schema <- structType(structField("a", "bigint"), structField("b", "bigint"),


This one tests bigint

actually, I'm not sure.
Walking through the code, createDataFrame calls parallelize which eventually calls R's serialize. Since these big values are actually numeric, and not integer, serialize writes them in that way

wangmiao1981 · 2017-04-26T03:26:25Z

R/pkg/inst/tests/testthat/test_Serde.R

@@ -28,6 +28,10 @@ test_that("SerDe of primitive types", {
  expect_equal(x, 1)
  expect_equal(class(x), "numeric")

+  x <- callJStatic("SparkRHandler", "echo", 1380742793415240)


I don't know how to specify in R console to enforce bigint type.

I did some google search. R can't specify bigint type. So, we can't directly test bigint type.

We can remove the tests above, as we added schema tests and scala API tests.

felixcheung

I think we might need step through this a bit more?

felixcheung · 2017-04-26T05:57:26Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

+  df <- createDataFrame(
+        list(list(1380742793415240, 1, "1"), list(1380742793415240, 2, "2"),
+        list(1380742793415240, 3, "3")), c("a", "b", "c"))
+  schema <- structType(structField("a", "bigint"), structField("b", "bigint"),


actually, I'm not sure.
Walking through the code, createDataFrame calls parallelize which eventually calls R's serialize. Since these big values are actually numeric, and not integer, serialize writes them in that way

felixcheung · 2017-04-26T06:12:23Z

R/pkg/R/serialize.R

@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) {
         Date = writeDate(con, object),
         POSIXlt = writeTime(con, object),
         POSIXct = writeTime(con, object),
+         bigint = writeDouble(con, object),


and another thing, there is no bigint in R
so I'm not sure how we would hit this path

When specifying schema with bigint, we will hit the bigint path. Without this change, it will thrown an error of type mismatch. But as you said, we can't specify bigint type in R console.

If R doesn't have bigint type, we should remove all bigint related logic. I don't know the history of bigint mapping in the Types.R file. Why should we have it since every big number is numeric (Double in the backend)?

@felixcheung Any thoughts? Thanks!

if you are referring to https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25
like it says,

names(PRIMITIVE_TYPES) are Scala types whereas values are equivalent R types.

so bigint there is Scala type, not R type

I see. But as you mentioned, we don't know how to trigger the write path on the R side, because both bigint and double are numeric. I think we can just remove the test in the R side.

I think this is different though, for PRIMITIVE_TYPES, it is used when you create a schema with structField in R. In this case you can definitely define a column as bigint and then pass a R numeric value to it

When using createDataFrame, R uses serialize to send data to the backend. When taking an action, say, collect, scala side logic refers to the schema field and calls the readTypedObjects where the newly added read logic kicks in. When it returns back to R side, the newly added write logic kicks in and R side can interpret it due to the R side read logic. It seems that the write logic in R side is not called, because we don't have specific type bigint in R. Right?

For completeness purpose, I think we can keep the write logic in R side.

SparkQA · 2017-04-29T22:50:22Z

Test build #76306 has finished for PR 17640 at commit 26d7750.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-29T22:59:13Z

Test build #76307 has finished for PR 17640 at commit 38fef1d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-05-30T21:08:35Z

@wangmiao1981 Are you still working on this?

wangmiao1981 · 2017-07-01T04:11:09Z

@jiangxb1987 The original PR has some issues that are not correctly handled. I will open a new PR when I figure out the right fix. I intended to close this PR. Thanks for closing it.

wangmiao1981 force-pushed the summary branch from da39bcf to 331b781 Compare April 24, 2017 22:50

wangmiao1981 commented Apr 26, 2017

View reviewed changes

felixcheung reviewed Apr 26, 2017

View reviewed changes

wangmiao1981 added 3 commits April 29, 2017 13:07

add bigint support

a0dd4bd

add tests

08139a3

add a test for bigint on scala side

38fef1d

wangmiao1981 force-pushed the summary branch from 26d7750 to 38fef1d Compare April 29, 2017 20:07

HyukjinKwon mentioned this pull request Jun 7, 2017

[INFRA] Close stale PRs #18223

Closed

asfgit closed this in b771fed Jun 8, 2017

[SPARK-17608][SPARKR]:Long type has incorrect serialization/deserialization #17640

[SPARK-17608][SPARKR]:Long type has incorrect serialization/deserialization #17640

Uh oh!

Conversation

wangmiao1981 commented Apr 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

frreiss commented Apr 14, 2017

Uh oh!

SparkQA commented Apr 14, 2017

Uh oh!

wangmiao1981 commented Apr 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangmiao1981 commented Apr 14, 2017

Uh oh!

wangmiao1981 commented Apr 15, 2017

Uh oh!

felixcheung commented Apr 15, 2017 via email

Uh oh!

felixcheung commented Apr 16, 2017

Uh oh!

wangmiao1981 commented Apr 17, 2017

Uh oh!

wangmiao1981 commented Apr 17, 2017

Uh oh!

shivaram commented Apr 17, 2017

Uh oh!

wangmiao1981 commented Apr 17, 2017

Uh oh!

felixcheung commented Apr 17, 2017

Uh oh!

SparkQA commented Apr 17, 2017

Uh oh!

felixcheung commented Apr 19, 2017

Uh oh!

wangmiao1981 commented Apr 23, 2017

Uh oh!

SparkQA commented Apr 25, 2017

Uh oh!

felixcheung commented Apr 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Apr 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

felixcheung Apr 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Apr 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 commented Apr 14, 2017 •

edited

Loading

felixcheung Apr 26, 2017 •

edited

Loading

felixcheung Apr 26, 2017 •

edited

Loading

felixcheung Apr 28, 2017 •

edited

Loading