Skip to content

[SPARK-10048][SPARKR] Support arbitrary nested Java array in serde. #8276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

sun-rui
Copy link
Contributor

@sun-rui sun-rui commented Aug 18, 2015

This PR:

  1. supports transferring arbitrary nested array from JVM to R side in SerDe;
  2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
    from a DataFrame.

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41127 has finished for PR 8276 at commit 6293b2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Thanks @sun-rui I'll take a look at this today

cc @davies

@@ -210,22 +213,31 @@ private[spark] object SerDe {
writeType(dos, "void")
} else {
value.getClass.getName match {
case "java.lang.Character" =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Just for completeness of handling of primitive types.

@davies
Copy link
Contributor

davies commented Aug 18, 2015

@sun-rui Generally, the changes looks good to me, could you add unit tests for ArrayType? Do we want to support create create DataFrame from ArrayType (could be another PR).

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 19, 2015

@davies , that will be done in another PR.

@shivaram
Copy link
Contributor

@sun-rui We can do the createDataFrame in another PR, but for this PR can we add a test case using a JSON file which has an ArrayType in it ?

@SparkQA
Copy link

SparkQA commented Aug 19, 2015

Test build #41188 has finished for PR 8276 at commit 7dea6fb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

BTW does this fix SPARK-9302 as well or does that require struct support ?

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 19, 2015

@shivaram, now ArrayType in a DataFrame is still not supported, as ArrayType's class is something like scala.collection.mutable.WrappedArray$ofRef, it will be passed as a jobj to R side. So some conversion needs to be done to covert it to Java Array, so that SerDe can pass its content to R side. I planned to do it in another PR as https://issues.apache.org/jira/browse/SPARK-10049, do you want me to do it in this PR?

As for SPARK-9302, it needs support for both ArrayType and StructType. Once these two types are supported, it should be fixed.

@shivaram
Copy link
Contributor

I see. I think a separate PR for ArrayType is fine, its just harder to review this change if we can't test it. Its ok, I will take one more closer look tomorrow

@SparkQA
Copy link

SparkQA commented Aug 19, 2015

Test build #41220 timed out for PR 8276 at commit 63def4c after a configured wait of 175m.

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 19, 2015

@shivaram, I tried to support ArrayType. By adding code like:
// Convert Seq[Any] to Array[Any]
val value =
if (obj.isInstanceOf[Seq[Any]]) {
obj.asInstanceOf[Seq[Any]].toArray
} else {
obj
}
I can collect successfully ArrayType on R side.
However, this confilicts with listToSeq() R util functions, because listToSeq() expects to get a jobj to the seq, while SerDe passes back the content of seq instead of jobj. So I think we need some mechanism telling the RBackend what we want a jobj or its content?

Adding support for ArrayType would make this PR big and hard to review. I would be better you could look at this PR and merge it.

I know you have a concern as we don't have test cases for this PR. Maybe I can add test cases for SerDe:
Add a function called Echo() in RBackend, it is for test-only purpose. It can pass args back to R side. args could be array, nested array, etc...
What do you think?

@shivaram
Copy link
Contributor

Its fine. Lets not complicate this PR with the listToSeq thing as that has its own issues like you mention.
I think just adding unit tests to SerDe is a good idea in general and we can either have this Echo function or just try to use identity in Scala if that works.

if (nrow <= 0) {
df <- data.frame()
} else {
df <- data.frame(row.names = c(1 : nrow))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 1:nrow is enough here ? (no need for c())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct

@shivaram
Copy link
Contributor

I took a more detailed look at the code and I only had some minor comments inline. So I think it looks pretty good but I think taking this opportunity to add some tests to SerDe is a good idea.

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 20, 2015

will add test cases.

@SparkQA
Copy link

SparkQA commented Aug 20, 2015

Test build #41301 has finished for PR 8276 at commit edc9652.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val obj: Object = row(idx).asInstanceOf[Object]
SerDe.writeObject(dos, obj)
}
val cols = (0 until row.length).map { idx => row(idx).asInstanceOf[Object]}.toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have one extra space after [Object] before }. In fact if the whole map fits on one line you can just use map(idx => ...).toArray

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 24, 2015

@shivaram , test cases for SerDe added. Now the SerDe does not support transferring a list of different element types from R side to JVM side. Let's leave it for future's PR.

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41445 has finished for PR 8276 at commit 9bfa62d.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41446 has finished for PR 8276 at commit 3c56872.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 24, 2015

rebased to master

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41453 timed out for PR 8276 at commit fd0d086 after a configured wait of 175m.

@shivaram
Copy link
Contributor

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41462 has finished for PR 8276 at commit fd0d086.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41469 has finished for PR 8276 at commit fd0d086.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@sun-rui looks like the test failures are related to this PR

@sun-rui
Copy link
Contributor Author

sun-rui commented Aug 25, 2015

The test passed on my machine, I don't know the reason. Anyway, add spark context initialization into test_Serde to see if it can pass on Jenkins.

x <- list(list(1L, 2L, 3L), list(1, 2, 3),
list(TRUE, FALSE), list("a", "b", "c"))
y <- callJStatic("SparkRHandler", "echo", x)
expect_equal(x, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add some tests with empty columns / empty lists (as we have some code paths just to handle these)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #41504 timed out for PR 8276 at commit 9025c9f after a configured wait of 175m.

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #41518 has finished for PR 8276 at commit 0d82eae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Jenkins, retest this please

@shivaram
Copy link
Contributor

Thanks @sun-rui -- Change LGTM.

@davies Any other comments ?

@davies
Copy link
Contributor

davies commented Aug 25, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #41537 has finished for PR 8276 at commit eae3341.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Alright I'm merging this to master. Note that I'm not porting this branch-1.5 as I think this is a relatively big change and it doesn't have any immediate feature benefits.

@asfgit asfgit closed this in 71a138c Aug 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants