-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields #10429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields #10429
Conversation
Can you also add a unit test? Also the title should say SPARK-12477 (with hyphen). Thanks. |
@rxin Sure! |
@rxin Where should it go to be sure? |
Maybe DataFrameComplexTypeSuite? |
@rxin I added a small test, let me know if more should be added. |
@rxin This PR incidentally also fixes another issue. Accessing a null element in an array of IntegerType erroneously returned 0:
It now correctly returns null:
|
cc @nongli can you review this? |
LGTM |
@@ -43,4 +43,12 @@ class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext { | |||
val df = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b") | |||
df.select(array($"a").as("s")).select(f(expr("s[0]"))).collect() | |||
} | |||
|
|||
test("Accessing null element in array field") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best to add the JIRA ticket here, i.e. "SPARK-12477 accessing null element in array field"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin You mean as the test title or as a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title
Test build #2249 has finished for PR 10429 at commit
|
Missing spaces after commas Line length exceeds 100 characters
@rxin I fixed the test title, and the scala style issues. |
Test build #2250 has finished for PR 10429 at commit
|
test this please |
Test build #48223 has finished for PR 10429 at commit
|
…ay fields Accessing null elements in an array field fails when tungsten is enabled. It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. This PR solves this by checking if the accessed element in the array field is null, in the generated code. Example: ``` // Array of String case class AS( as: Seq[String] ) val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF dfAS.registerTempTable("T_AS") for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))} ``` With Tungsten disabled: ``` 0 = [a] 1 = [null] 2 = [b] ``` With Tungsten enabled: ``` 0 = [a] 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ``` Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com> Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array. (cherry picked from commit 43b2a63) Signed-off-by: Reynold Xin <rxin@databricks.com>
…ay fields Accessing null elements in an array field fails when tungsten is enabled. It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. This PR solves this by checking if the accessed element in the array field is null, in the generated code. Example: ``` // Array of String case class AS( as: Seq[String] ) val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF dfAS.registerTempTable("T_AS") for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))} ``` With Tungsten disabled: ``` 0 = [a] 1 = [null] 2 = [b] ``` With Tungsten enabled: ``` 0 = [a] 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ``` Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com> Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array. (cherry picked from commit 43b2a63) Signed-off-by: Reynold Xin <rxin@databricks.com>
Thanks - I've merged this in master and branch-1.6 and branch-1.5. |
It looks like this accidentally broke test compilation in branch-1.5; I'm hotfixing in #10478. |
This fixes a test compilation break in branch-1.5; the break was introduced by #10429. Author: Josh Rosen <joshrosen@databricks.com> Closes #10478 from JoshRosen/SPARK-12477-branch-1.5-compile-fix.
Accessing null elements in an array field fails when tungsten is enabled.
It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.
This PR solves this by checking if the accessed element in the array field is null, in the generated code.
Example:
With Tungsten disabled:
With Tungsten enabled: