[SPARK-41993][SQL] Move RowEncoder to AgnosticEncoders #39517

hvanhovell · 2023-01-11T20:33:53Z

What changes were proposed in this pull request?

This PR makes RowEncoder produce an AgnosticEncoder. The expression generation for these encoders is moved to ScalaReflection (this will be moved out in a subsequent PR).

The generated serializer and deserializer expressions will slightly change for both schema and type based encoders. These are not semantically different from the old expressions. Concretely the following changes have been introduced:

There is more type validation in maps/arrays/seqs for type based encoders. This should be a positive change, since it disallows users to pass wrong data through erasure hackd.
Array/Seq serialization is a bit more strict. In the old scenario it was possible to pass in sequences/arrays with the wrong type and/or nullability.

Why are the changes needed?

For the Spark Connect Scala Client we also want to be able to use Row based results.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is a refactoring, existing tests should be sufficient.

hvanhovell · 2023-01-11T20:34:17Z

@cloud-fan can you take a look?

hvanhovell · 2023-01-11T20:35:06Z

A note for the reviewers. I know that Catalyst tests pass. I have not run other tests, so there might still be a few things to iron out.

hvanhovell · 2023-01-11T20:35:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

@@ -306,7 +330,7 @@ object ScalaReflection extends ScalaReflection {
   * input object is located at ordinal 0 of a row, i.e., `BoundReference(0, _)`.
   */
  def serializerFor(enc: AgnosticEncoder[_]): Expression = {


TODO check the generated code for boxed primitives. We might be doing double conversions there.

hvanhovell · 2023-01-11T20:37:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

+          propagateNull = false,
+          returnNullable = false)
+        exprs.If(
+          check,


We can widen this to arrays where the element is allowed to be null. In that case we do need to make sure the element type is sound.

cloud-fan · 2023-01-12T03:03:50Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/RowEncoderSuite.scala

@@ -125,7 +125,7 @@ class RowEncoderSuite extends CodegenInterpretedPlanTest {
    new StructType()
      .add("mapOfIntAndString", MapType(IntegerType, StringType))
      .add("mapOfStringAndArray", MapType(StringType, arrayOfString))
-      .add("mapOfArrayAndInt", MapType(arrayOfString, IntegerType))


arrayOfString doesn't work anymore inside map?

That is a mistake. My bad :)

cloud-fan · 2023-01-12T03:05:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

-      walkedTypePath)
-    expressionWithNullSafety(deserializer, enc.nullable, walkedTypePath)
+    enc match {
+      case RowEncoder(fields) =>


what encoder do we create for inner struct? how is it different from the root RowEncoder?

We create one with null checks. This one does not need them because we always return a Row (the toplevel row always exists).

cloud-fan · 2023-01-12T03:06:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

+      elementNullable: Boolean,
+      input: Expression,
+      lenientSerialization: Boolean): Expression = {
+    // Default serializer for Seq and generic Arrays. This does not work for primitive arrays.


hmm, why the name is createSerializerForMapObjects?

I don't know. It was like that before this.

cloud-fan · 2023-01-12T03:37:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/AgnosticEncoder.scala

-      element: AgnosticEncoder[E])
+      element: AgnosticEncoder[E],
+      containsNull: Boolean,
+      override val lenientSerialization: Boolean)


what does lenient mean for a IterableEncoder?

It means we allow a Seq, a generic Array, or a primitive array as input for serialization

Can we leave a code comment to mention it? It's not that obvious compared to DateEncoder.

yeah will do. TBH I was quite surprised by it.

cloud-fan · 2023-01-16T09:26:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

+    externalDataTypeFor(enc, lenientSerialization = false)
+  }
+
+  private[catalyst]  def lenientExternalDataTypeFor(enc: AgnosticEncoder[_]): DataType =


Suggested change

private[catalyst] def lenientExternalDataTypeFor(enc: AgnosticEncoder[_]): DataType =

private[catalyst] def lenientExternalDataTypeFor(enc: AgnosticEncoder[_]): DataType =

cloud-fan · 2023-01-16T10:02:14Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala

-    assert(serializer.isInstanceOf[NewInstance])
-    assert(serializer.asInstanceOf[NewInstance]
-      .cls.isAssignableFrom(classOf[org.apache.spark.sql.catalyst.util.GenericArrayData]))
+    assert(serializer.isInstanceOf[MapObjects])


is MapObjects better than NewInstance for creating List[Int]?

cloud-fan · 2023-01-16T10:03:41Z

Only 2 minor comments, thanks, merging to master!

This PR makes `RowEncoder` produce an `AgnosticEncoder`. The expression generation for these encoders is moved to `ScalaReflection` (this will be moved out in a subsequent PR). The generated serializer and deserializer expressions will slightly change for both schema and type based encoders. These are not semantically different from the old expressions. Concretely the following changes have been introduced: - There is more type validation in maps/arrays/seqs for type based encoders. This should be a positive change, since it disallows users to pass wrong data through erasure hackd. - Array/Seq serialization is a bit more strict. In the old scenario it was possible to pass in sequences/arrays with the wrong type and/or nullability. For the Spark Connect Scala Client we also want to be able to use `Row` based results. No This is a refactoring, existing tests should be sufficient. Closes apache#39517 from hvanhovell/SPARK-41993. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun

@hvanhovell and @cloud-fan .

Unfortunately, this breaks Scala 2.13 Kafka SQL module with SparkRuntimeException during encoding. I verified that the failures are gone after reverting.

Cause: org.apache.spark.SparkRuntimeException:
Error while encoding: java.lang.RuntimeException:
scala.collection.mutable.ArraySeq$ofRef is not a valid external type for schema of array<struct<key:string,value:binary>>

     [info] *** 1 SUITE ABORTED ***
     [info] *** 8 TESTS FAILED ***
     [error] Failed tests:
     [error] 	org.apache.spark.sql.kafka010.KafkaRelationSuiteV1
     [error] 	org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceSuite
     [error] 	org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceWithAdminSuite
     [error] 	org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceSuite
     [error] 	org.apache.spark.sql.kafka010.KafkaRelationSuiteWithAdminV1
     [error] 	org.apache.spark.sql.kafka010.KafkaSinkBatchSuiteV1
     [error] 	org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceWithAdminSuite
     [error] Error during tests:
     [error] 	org.apache.spark.sql.kafka010.KafkaContinuousSourceSuite
     [error] (sql-kafka-0-10 / Test / test) sbt.TestsFailedException: Tests unsuccessful
     [error] Total time: 2513 s (41:53), completed Jan 16, 2023, 10:07:56 PM

The failures are massive across multiple suites and looks tricky. Let me revert this first because we have a branch cut schedule Today.

cc @HyukjinKwon , @xinrong-meng (3.4.0 release manager).

hvanhovell · 2023-01-17T00:59:02Z

@dongjoon-hyun I am looking at it now. I am not sure if massive is the qualification I would use; all of these are likely to be caused by the same thing. Feel free to revert if you have too.

dongjoon-hyun · 2023-01-17T01:00:31Z

Thank you, @hvanhovell .

hvanhovell added 11 commits December 22, 2022 14:20

AgnosticEncoders

4a20121

Merge remote-tracking branch 'apache/master' into SPARK-41690

9539179

Fix a couple of bugs

2192911

Boxed encoders.

1810541

Update doc, and add a proper field type.

445cb7d

WIP

f1d15f2

Use Boxed values instead primitives in RowEncoder

b5771c1

Fix validation and harden array serialization

b25aede

Remove println from code generator

4f018f6

Merge remote-tracking branch 'apache/master' into agenc-rowencoder

0209b05

Fix tests

5fd0a52

hvanhovell requested a review from cloud-fan January 11, 2023 20:34

github-actions bot added the SQL label Jan 11, 2023

hvanhovell commented Jan 11, 2023

View reviewed changes

cloud-fan reviewed Jan 12, 2023

View reviewed changes

hvanhovell added 5 commits January 12, 2023 08:33

Add Set validation.

1c60a20

WrappedArray should use properly typed array

e14163e

Merge remote-tracking branch 'apache/master' into SPARK-41993

37d073a

Document lenient behavior.

b8cf2e3

Try to fix 2.13

a140e2f

cloud-fan reviewed Jan 16, 2023

View reviewed changes

cloud-fan approved these changes Jan 16, 2023

View reviewed changes

cloud-fan closed this in 2d4be52 Jan 16, 2023

dongjoon-hyun reviewed Jan 16, 2023

View reviewed changes

	private[catalyst] def lenientExternalDataTypeFor(enc: AgnosticEncoder[_]): DataType =
	private[catalyst] def lenientExternalDataTypeFor(enc: AgnosticEncoder[_]): DataType =

[SPARK-41993][SQL] Move RowEncoder to AgnosticEncoders #39517

[SPARK-41993][SQL] Move RowEncoder to AgnosticEncoders #39517

Uh oh!

Conversation

hvanhovell commented Jan 11, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

hvanhovell commented Jan 11, 2023

Uh oh!

hvanhovell commented Jan 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 16, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jan 17, 2023

Uh oh!

dongjoon-hyun commented Jan 17, 2023

Uh oh!

Uh oh!

hvanhovell Jan 13, 2023 •

edited

Loading