[SPARK-10048][SPARKR] Support arbitrary nested Java array in serde. #8276

sun-rui · 2015-08-18T12:57:31Z

This PR:

supports transferring arbitrary nested array from JVM to R side in SerDe;
based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
from a DataFrame.

SparkQA · 2015-08-18T15:34:46Z

Test build #41127 has finished for PR 8276 at commit 6293b2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-08-18T17:57:37Z

Thanks @sun-rui I'll take a look at this today

cc @davies

davies · 2015-08-18T18:11:59Z

core/src/main/scala/org/apache/spark/api/r/SerDe.scala

@@ -210,22 +213,31 @@ private[spark] object SerDe {
      writeType(dos, "void")
    } else {
      value.getClass.getName match {
+        case "java.lang.Character" =>


Is this needed?

Not sure. Just for completeness of handling of primitive types.

davies · 2015-08-18T18:35:54Z

@sun-rui Generally, the changes looks good to me, could you add unit tests for ArrayType? Do we want to support create create DataFrame from ArrayType (could be another PR).

sun-rui · 2015-08-19T00:57:50Z

@davies , that will be done in another PR.

shivaram · 2015-08-19T02:20:08Z

@sun-rui We can do the createDataFrame in another PR, but for this PR can we add a test case using a JSON file which has an ArrayType in it ?

SparkQA · 2015-08-19T04:11:05Z

Test build #41188 has finished for PR 8276 at commit 7dea6fb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-08-19T04:17:22Z

BTW does this fix SPARK-9302 as well or does that require struct support ?

sun-rui · 2015-08-19T07:38:10Z

@shivaram, now ArrayType in a DataFrame is still not supported, as ArrayType's class is something like scala.collection.mutable.WrappedArray$ofRef, it will be passed as a jobj to R side. So some conversion needs to be done to covert it to Java Array, so that SerDe can pass its content to R side. I planned to do it in another PR as https://issues.apache.org/jira/browse/SPARK-10049, do you want me to do it in this PR?

As for SPARK-9302, it needs support for both ArrayType and StructType. Once these two types are supported, it should be fixed.

shivaram · 2015-08-19T08:34:29Z

I see. I think a separate PR for ArrayType is fine, its just harder to review this change if we can't test it. Its ok, I will take one more closer look tomorrow

SparkQA · 2015-08-19T09:20:33Z

Test build #41220 timed out for PR 8276 at commit 63def4c after a configured wait of 175m.

sun-rui · 2015-08-19T12:20:22Z

@shivaram, I tried to support ArrayType. By adding code like:
// Convert Seq[Any] to Array[Any]
val value =
if (obj.isInstanceOf[Seq[Any]]) {
obj.asInstanceOf[Seq[Any]].toArray
} else {
obj
}
I can collect successfully ArrayType on R side.
However, this confilicts with listToSeq() R util functions, because listToSeq() expects to get a jobj to the seq, while SerDe passes back the content of seq instead of jobj. So I think we need some mechanism telling the RBackend what we want a jobj or its content?

Adding support for ArrayType would make this PR big and hard to review. I would be better you could look at this PR and merge it.

I know you have a concern as we don't have test cases for this PR. Maybe I can add test cases for SerDe:
Add a function called Echo() in RBackend, it is for test-only purpose. It can pass args back to R side. args could be array, nested array, etc...
What do you think?

shivaram · 2015-08-19T18:03:40Z

Its fine. Lets not complicate this PR with the listToSeq thing as that has its own issues like you mention.
I think just adding unit tests to SerDe is a good idea in general and we can either have this Echo function or just try to use identity in Scala if that works.

shivaram · 2015-08-19T20:42:43Z

R/pkg/R/DataFrame.R

+              if (nrow <= 0) {
+                df <- data.frame()
+              } else {
+                df <- data.frame(row.names = c(1 : nrow))                


I guess 1:nrow is enough here ? (no need for c())

you are correct

shivaram · 2015-08-19T21:02:31Z

I took a more detailed look at the code and I only had some minor comments inline. So I think it looks pretty good but I think taking this opportunity to add some tests to SerDe is a good idea.

sun-rui · 2015-08-20T02:12:21Z

will add test cases.

SparkQA · 2015-08-20T05:47:49Z

Test build #41301 has finished for PR 8276 at commit edc9652.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-08-20T07:21:30Z

sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala

-      val obj: Object = row(idx).asInstanceOf[Object]
-      SerDe.writeObject(dos, obj)
-    }
+    val cols = (0 until row.length).map { idx => row(idx).asInstanceOf[Object]}.toArray


we should have one extra space after [Object] before }. In fact if the whole map fits on one line you can just use map(idx => ...).toArray

sun-rui · 2015-08-24T07:00:51Z

@shivaram , test cases for SerDe added. Now the SerDe does not support transferring a list of different element types from R side to JVM side. Let's leave it for future's PR.

SparkQA · 2015-08-24T09:12:36Z

Test build #41445 has finished for PR 8276 at commit 9bfa62d.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-08-24T09:53:27Z

Test build #41446 has finished for PR 8276 at commit 3c56872.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

sun-rui · 2015-08-24T14:06:17Z

rebased to master

SparkQA · 2015-08-24T17:09:05Z

Test build #41453 timed out for PR 8276 at commit fd0d086 after a configured wait of 175m.

shivaram · 2015-08-24T17:17:05Z

Jenkins, retest this please

SparkQA · 2015-08-24T19:44:15Z

Test build #41462 has finished for PR 8276 at commit fd0d086.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-08-24T20:00:29Z

Jenkins, retest this please

SparkQA · 2015-08-24T22:49:15Z

Test build #41469 has finished for PR 8276 at commit fd0d086.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-08-24T23:28:54Z

@sun-rui looks like the test failures are related to this PR

sun-rui · 2015-08-25T01:47:58Z

The test passed on my machine, I don't know the reason. Anyway, add spark context initialization into test_Serde to see if it can pass on Jenkins.

shivaram · 2015-08-25T03:21:20Z

R/pkg/inst/tests/test_Serde.R

+  x <- list(list(1L, 2L, 3L), list(1, 2, 3),
+            list(TRUE, FALSE), list("a", "b", "c"))
+  y <- callJStatic("SparkRHandler", "echo", x)
+  expect_equal(x, y)


Could we also add some tests with empty columns / empty lists (as we have some code paths just to handle these)

SparkQA · 2015-08-25T04:48:45Z

Test build #41504 timed out for PR 8276 at commit 9025c9f after a configured wait of 175m.

SparkQA · 2015-08-25T08:01:07Z

Test build #41518 has finished for PR 8276 at commit 0d82eae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-08-25T17:03:22Z

Jenkins, retest this please

shivaram · 2015-08-25T17:03:41Z

Thanks @sun-rui -- Change LGTM.

@davies Any other comments ?

davies · 2015-08-25T17:59:41Z

LGTM

SparkQA · 2015-08-25T20:07:50Z

Test build #41537 has finished for PR 8276 at commit eae3341.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-08-25T20:12:15Z

Alright I'm merging this to master. Note that I'm not porting this branch-1.5 as I think this is a relatively big change and it doesn't have any immediate feature benefits.

davies reviewed Aug 18, 2015
View reviewed changes

shivaram reviewed Aug 19, 2015
View reviewed changes

shivaram reviewed Aug 20, 2015
View reviewed changes

Sun Rui added 7 commits August 24, 2015 21:43

[SPARK-10048][SPARKR] Support arbitrary nested Java array in serde.

9f15f24

Improve collect() to hold data of complex types.

a5c11d9

Remove unuseful readCol() function.

f3239eb

Improve SerDe for conversion between DataFrame and RDD.

0ddc8e2

Address comments.

42286d1

Add test cases for SerDe.

d6b739a

Forgot the test file for SerDe.

fd0d086

sun-rui force-pushed the SPARK-10048 branch from 3c56872 to fd0d086 Compare August 24, 2015 14:04

Add spark context initialization into test_Serde.R.

9025c9f

shivaram reviewed Aug 25, 2015
View reviewed changes

Add test cases for empty lists in test_Serde.

0d82eae

Fix coding style.

eae3341

asfgit closed this in 71a138c Aug 25, 2015

[SPARK-10048][SPARKR] Support arbitrary nested Java array in serde. #8276

[SPARK-10048][SPARKR] Support arbitrary nested Java array in serde. #8276

Uh oh!

Conversation

sun-rui commented Aug 18, 2015

Uh oh!

SparkQA commented Aug 18, 2015

Uh oh!

shivaram commented Aug 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Aug 18, 2015

Uh oh!

sun-rui commented Aug 19, 2015

Uh oh!

shivaram commented Aug 19, 2015

Uh oh!

SparkQA commented Aug 19, 2015

Uh oh!

shivaram commented Aug 19, 2015

Uh oh!

sun-rui commented Aug 19, 2015

Uh oh!

shivaram commented Aug 19, 2015

Uh oh!

SparkQA commented Aug 19, 2015

Uh oh!

sun-rui commented Aug 19, 2015

Uh oh!

shivaram commented Aug 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Aug 19, 2015

Uh oh!

sun-rui commented Aug 20, 2015

Uh oh!

SparkQA commented Aug 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sun-rui commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 24, 2015

Uh oh!

sun-rui commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 24, 2015

Uh oh!

shivaram commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 24, 2015

Uh oh!

shivaram commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 24, 2015

Uh oh!

shivaram commented Aug 24, 2015

Uh oh!

sun-rui commented Aug 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 25, 2015

Uh oh!

SparkQA commented Aug 25, 2015

Uh oh!

shivaram commented Aug 25, 2015

Uh oh!