[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields #10429

pierre-borckmans · 2015-12-22T08:48:59Z

Accessing null elements in an array field fails when tungsten is enabled.
It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.

This PR solves this by checking if the accessed element in the array field is null, in the generated code.

Example:

// Array of String
case class AS( as: Seq[String] )
val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
dfAS.registerTempTable("T_AS")
for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))}

With Tungsten disabled:

0 = [a]
1 = [null]
2 = [b]

With Tungsten enabled:

0 = [a]
15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
java.lang.NullPointerException
    at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
    at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

rxin · 2015-12-22T08:51:00Z

Can you also add a unit test?

Also the title should say SPARK-12477 (with hyphen). Thanks.

pierre-borckmans · 2015-12-22T08:51:48Z

@rxin Sure!

pierre-borckmans · 2015-12-22T08:52:07Z

@rxin Where should it go to be sure?

rxin · 2015-12-22T08:54:08Z

Maybe DataFrameComplexTypeSuite?

pierre-borckmans · 2015-12-22T09:11:45Z

@rxin I added a small test, let me know if more should be added.

pierre-borckmans · 2015-12-22T09:16:25Z

@rxin This PR incidentally also fixes another issue. Accessing a null element in an array of IntegerType erroneously returned 0:

scala> val df = sc.parallelize(Seq((Seq("val1",null,"val2"),Seq(Some(1),None,Some(2))))).toDF("s","i")
df: org.apache.spark.sql.DataFrame = [s: array<string>, i: array<int>]

scala> df.selectExpr("i[1]").collect()(0)
res0: org.apache.spark.sql.Row = [0]

It now correctly returns null:

scala> val df = sc.parallelize(Seq((Seq("val1",null,"val2"),Seq(Some(1),None,Some(2))))).toDF("s","i")
df: org.apache.spark.sql.DataFrame = [s: array<string>, i: array<int>]

scala> df.selectExpr("i[1]").collect()(0)
res0: org.apache.spark.sql.Row = [null]

rxin · 2015-12-22T18:42:15Z

cc @nongli can you review this?

nongli · 2015-12-22T19:11:22Z

LGTM

rxin · 2015-12-22T19:20:43Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala

@@ -43,4 +43,12 @@ class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext {
    val df = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b")
    df.select(array($"a").as("s")).select(f(expr("s[0]"))).collect()
  }
+
+  test("Accessing null element in array field") {


best to add the JIRA ticket here, i.e. "SPARK-12477 accessing null element in array field"

@rxin You mean as the test title or as a comment?

SparkQA · 2015-12-22T19:26:45Z

Test build #2249 has finished for PR 10429 at commit 3c8a795.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

Missing spaces after commas Line length exceeds 100 characters

pierre-borckmans · 2015-12-22T21:40:58Z

@rxin I fixed the test title, and the scala style issues.
I ran dev/scalastyle successfully.

SparkQA · 2015-12-22T21:59:50Z

Test build #2250 has finished for PR 10429 at commit b1fc7e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-12-23T00:08:25Z

test this please

SparkQA · 2015-12-23T01:52:48Z

Test build #48223 has finished for PR 10429 at commit 64f95ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ay fields Accessing null elements in an array field fails when tungsten is enabled. It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. This PR solves this by checking if the accessed element in the array field is null, in the generated code. Example: ``` // Array of String case class AS( as: Seq[String] ) val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF dfAS.registerTempTable("T_AS") for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))} ``` With Tungsten disabled: ``` 0 = [a] 1 = [null] 2 = [b] ``` With Tungsten enabled: ``` 0 = [a] 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ``` Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com> Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array. (cherry picked from commit 43b2a63) Signed-off-by: Reynold Xin <rxin@databricks.com>

rxin · 2015-12-23T07:01:16Z

Thanks - I've merged this in master and branch-1.6 and branch-1.5.

JoshRosen · 2015-12-25T23:09:34Z

It looks like this accidentally broke test compilation in branch-1.5; I'm hotfixing in #10478.

This fixes a test compilation break in branch-1.5; the break was introduced by #10429. Author: Josh Rosen <joshrosen@databricks.com> Closes #10478 from JoshRosen/SPARK-12477-branch-1.5-compile-fix.

CHECK if element in array field is null

b6a79e7

ADD unit test for accessing null elements in array fields

3c8a795

pierre-borckmans changed the title ~~[SPARK 12477][SQL] Tungsten projection fails for null values in array fields~~ [SPARK-12477][SQL] Tungsten projection fails for null values in array fields Dec 22, 2015

pierre-borckmans changed the title ~~[SPARK-12477][SQL] Tungsten projection fails for null values in array fields~~ [SPARK-12477][SQL] - Tungsten projection fails for null values in array fields Dec 22, 2015

rxin reviewed Dec 22, 2015
View reviewed changes

pierre-borckmans added 2 commits December 22, 2015 22:36

ADD Jira ticket number in test title

3519ace

FIX scalastyle in test

b1fc7e5

Missing spaces after commas Line length exceeds 100 characters

FIX test sc => sparkContext

64f95ec

asfgit closed this in 43b2a63 Dec 23, 2015

pierre-borckmans deleted the SPARK-12477_Tungsten-Projection-Null-Element-In-Array branch December 23, 2015 13:26

JoshRosen mentioned this pull request Dec 25, 2015

[SPARK-12477][HOTIFX] Fix test compilation in branch-1.5 #10478

Closed

[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields #10429

[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields #10429

Uh oh!

Conversation

pierre-borckmans commented Dec 22, 2015

Uh oh!

rxin commented Dec 22, 2015

Uh oh!

pierre-borckmans commented Dec 22, 2015

Uh oh!

pierre-borckmans commented Dec 22, 2015

Uh oh!

rxin commented Dec 22, 2015

Uh oh!

pierre-borckmans commented Dec 22, 2015

Uh oh!

pierre-borckmans commented Dec 22, 2015

Uh oh!

rxin commented Dec 22, 2015

Uh oh!

nongli commented Dec 22, 2015

Uh oh!

rxin Dec 22, 2015

Choose a reason for hiding this comment

Uh oh!

pierre-borckmans Dec 22, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Dec 22, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 22, 2015

Uh oh!

pierre-borckmans commented Dec 22, 2015

Uh oh!

SparkQA commented Dec 22, 2015

Uh oh!

yhuai commented Dec 23, 2015

Uh oh!

SparkQA commented Dec 23, 2015

Uh oh!

rxin commented Dec 23, 2015

Uh oh!

JoshRosen commented Dec 25, 2015

Uh oh!

Uh oh!