[SPARK-29347][SQL] Add JSON serialization for external Rows #26013

hvanhovell · 2019-10-03T14:53:34Z

What changes were proposed in this pull request?

This PR adds JSON serialization for Spark external Rows.

Why are the changes needed?

This is to be used for observable metrics where the StreamingQueryProgress contains a map of observed metrics rows which needs to be serialized in some cases.

Does this PR introduce any user-facing change?

Yes, a user can call toJson on rows returned when collecting a DataFrame to the driver.

How was this patch tested?

Added a new test suite: RowJsonSuite that should test this.

This is to be used for observable metrics where the `StreamingQueryProgress` contains a map of observed metrics rows which needs to be serialized in some cases. Added a new test suite: `RowJsonSuite` that should test this.

SparkQA · 2019-10-03T18:54:09Z

Test build #111740 has finished for PR 26013 at commit 98d42e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
s\"(class of $

HyukjinKwon · 2019-10-04T08:10:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala

@@ -501,4 +513,88 @@ trait Row extends Serializable {
  private def getAnyValAs[T <: AnyVal](i: Int): T =
    if (isNullAt(i)) throw new NullPointerException(s"Value at index $i is null")
    else getAs[T](i)
+
+  /** The compact JSON representation of this row. */
+  def json: String = compact(jsonValue)


@hvanhovell, how about reusing JacksonGenerator in our JSON datasource?

There's pretty option for prettyJson too.

Ah, right, schema can be unknown ..

Well you still need the schema. The main reason for not using Jackson generator is that we need to convert back to an internal row and this is super slow.

Hm, this API looks already pretty slow though, and I suspect this API should not be called in a critical path .. ?
If it's supposed to be used in a critical path, we might rather have to provide a API to make a convert function given schema (so that we avoid type dispatch for every row).

One rather minor concern is that the JSON representation for a row seems different comparing to JSON datasource. e.g.) https://github.com/apache/spark/pull/26013/files#r331463832 and https://github.com/apache/spark/pull/26013/files#diff-78ce4e47d137bbb0d4350ad732b48d5bR576-R578

and here a bit duplicates the codes ..

So two things to consider here.

I want to use this in StreamingQueryProgress right? All the JSON serialization there is based on the json4s AST and not strings (which is what JacksonGenerator produces).

There is a difference between it being slow, and what you are suggesting. The latter being crazy inefficient. Let's break that down:

Row to InternalRow conversion. You will need to create a converter per row because there is currently no way we can safely cache a converter. You can either use ScalaReflection or RowEncoder here, the latter is particularly bad because it uses code generation (which takes in the order of mills and which is weakly cached on the driver).

Setting up the JacksonGenerator, again this is uncached and we need to set up the same thing for each tuple.

Generating the string.

Do you see my point here? Or shall I write a benchmark?

There's one API case we dropped performance improvement in Row as an example (see #23271).

@deprecated("This method is deprecated and will be removed in future versions.", "3.0.0") def merge(rows: Row*): Row = { // TODO: Improve the performance of this if used in performance critical part. new GenericRow(rows.flatMap(_.toSeq).toArray) }

Do you mind if I ask to add @Unstable or @Private for these new APIs instead just for future improvement in case, with @since in the Scaladoc?

Row itself is marked as @Stable so it might better explicitly note that this can be changed in the future. With this LGTM.

I will mark them as @unstable. @Private is debatable, because it is not really meant as an internal only API.

MaxGekk

to be used for observable metrics where the StreamingQueryProgress ...

Is it the only purpose of new methods? If so, maybe those methods should be put to a separate utils objects out of general Row? How many Spark users are interested in those functions?

hvanhovell · 2019-10-04T11:43:26Z

@MaxGekk while the immediate reason is observable metrics, there surely is a different use case to be found here. I prefer not to hide these things somewhere, if we can also add it to the class it self.

sql/catalyst/src/test/scala/org/apache/spark/sql/RowJsonSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/RowJsonSuite.scala

cloud-fan · 2019-10-07T09:10:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala

+        iteratorToJsonArray(a.iterator, elementType)
+      case (s: Seq[_], ArrayType(elementType, _)) =>
+        iteratorToJsonArray(s.iterator, elementType)
+      case (m: Map[String @unchecked, _], MapType(StringType, valueType, _)) =>


is it really worth to have a special format for string-type-key map?

The reason would that is emits more readable JSON. This is similar to the way StreamingQueryProgress is rendering maps. I can revert if you feel strongly about this.

Do we need to convert the JSON string back to a Row? If we do then I think it's better to keep the ser/de simply. If not I'm fine with the code here.

In its current form it is not really meant to be converted back.

Can other primitive types like Int be good for this format too?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

cloud-fan · 2019-10-09T14:43:36Z

sql/catalyst/src/test/scala/org/apache/spark/sql/RowJsonSuite.scala

+    .add("c1", "string")
+    .add("c2", IntegerType)
+
+    private def testJson(name: String, value: Any, dt: DataType, expected: JValue): Unit = {


nit: wrong indentation

cloud-fan · 2019-10-09T14:44:11Z

LGTM, we probably need to wait a few days until jenkins is back online.

HyukjinKwon

I'm fine with https://github.com/apache/spark/pull/26013/files#r333285090

cloud-fan · 2019-10-14T12:48:50Z

retest this please

SparkQA · 2019-10-14T16:32:01Z

Test build #112026 has finished for PR 26013 at commit abe9ffa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T19:30:35Z

Test build #112039 has finished for PR 26013 at commit 43c2d24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2019-10-14T22:23:36Z

Merging to master

HyukjinKwon

Thanks @hvanhovell. LGTM too.

This PR adds JSON serialization for Spark external Rows.

98d42e2

This is to be used for observable metrics where the `StreamingQueryProgress` contains a map of observed metrics rows which needs to be serialized in some cases. Added a new test suite: `RowJsonSuite` that should test this.

hvanhovell added the SQL label Oct 3, 2019

hvanhovell requested a review from cloud-fan October 3, 2019 14:53

hvanhovell changed the title ~~[SPARK-29347] This PR adds JSON serialization for Spark external Rows.~~ [SPARK-29347] Adds JSON serialization for external Rows Oct 3, 2019

hvanhovell changed the title ~~[SPARK-29347] Adds JSON serialization for external Rows~~ [SPARK-29347] Add JSON serialization for external Rows Oct 3, 2019

dongjoon-hyun changed the title ~~[SPARK-29347] Add JSON serialization for external Rows~~ [SPARK-29347][SQL] Add JSON serialization for external Rows Oct 3, 2019

HyukjinKwon reviewed Oct 4, 2019

View reviewed changes

MaxGekk reviewed Oct 4, 2019

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/RowJsonSuite.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala Show resolved Hide resolved

MaxGekk reviewed Oct 4, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala Outdated Show resolved Hide resolved

sql/catalyst/src/test/scala/org/apache/spark/sql/RowJsonSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 7, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

Code review

07767f2

cloud-fan reviewed Oct 9, 2019

View reviewed changes

Fix style

abe9ffa

HyukjinKwon approved these changes Oct 10, 2019

View reviewed changes

viirya approved these changes Oct 10, 2019

View reviewed changes

Add annotations

43c2d24

hvanhovell closed this in 1f1443e Oct 14, 2019

HyukjinKwon reviewed Oct 15, 2019

View reviewed changes

[SPARK-29347][SQL] Add JSON serialization for external Rows #26013

[SPARK-29347][SQL] Add JSON serialization for external Rows #26013

Uh oh!

Conversation

hvanhovell commented Oct 3, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Oct 4, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 9, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

hvanhovell commented Oct 14, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon Oct 4, 2019 •

edited

Loading

hvanhovell Oct 7, 2019 •

edited

Loading

HyukjinKwon Oct 10, 2019 •

edited

Loading