Skip to content

[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields #10429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pierre-borckmans
Copy link

Accessing null elements in an array field fails when tungsten is enabled.
It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.

This PR solves this by checking if the accessed element in the array field is null, in the generated code.

Example:

// Array of String
case class AS( as: Seq[String] )
val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
dfAS.registerTempTable("T_AS")
for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))}

With Tungsten disabled:

0 = [a]
1 = [null]
2 = [b]

With Tungsten enabled:

0 = [a]
15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
java.lang.NullPointerException
    at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
    at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

@rxin
Copy link
Contributor

rxin commented Dec 22, 2015

Can you also add a unit test?

Also the title should say SPARK-12477 (with hyphen). Thanks.

@pierre-borckmans
Copy link
Author

@rxin Sure!

@pierre-borckmans
Copy link
Author

@rxin Where should it go to be sure?

@rxin
Copy link
Contributor

rxin commented Dec 22, 2015

Maybe DataFrameComplexTypeSuite?

@pierre-borckmans
Copy link
Author

@rxin I added a small test, let me know if more should be added.

@pierre-borckmans
Copy link
Author

@rxin This PR incidentally also fixes another issue. Accessing a null element in an array of IntegerType erroneously returned 0:

scala> val df = sc.parallelize(Seq((Seq("val1",null,"val2"),Seq(Some(1),None,Some(2))))).toDF("s","i")
df: org.apache.spark.sql.DataFrame = [s: array<string>, i: array<int>]

scala> df.selectExpr("i[1]").collect()(0)
res0: org.apache.spark.sql.Row = [0]

It now correctly returns null:

scala> val df = sc.parallelize(Seq((Seq("val1",null,"val2"),Seq(Some(1),None,Some(2))))).toDF("s","i")
df: org.apache.spark.sql.DataFrame = [s: array<string>, i: array<int>]

scala> df.selectExpr("i[1]").collect()(0)
res0: org.apache.spark.sql.Row = [null]

@pierre-borckmans pierre-borckmans changed the title [SPARK 12477][SQL] Tungsten projection fails for null values in array fields [SPARK-12477][SQL] Tungsten projection fails for null values in array fields Dec 22, 2015
@pierre-borckmans pierre-borckmans changed the title [SPARK-12477][SQL] Tungsten projection fails for null values in array fields [SPARK-12477][SQL] - Tungsten projection fails for null values in array fields Dec 22, 2015
@rxin
Copy link
Contributor

rxin commented Dec 22, 2015

cc @nongli can you review this?

@nongli
Copy link
Contributor

nongli commented Dec 22, 2015

LGTM

@@ -43,4 +43,12 @@ class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext {
val df = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b")
df.select(array($"a").as("s")).select(f(expr("s[0]"))).collect()
}

test("Accessing null element in array field") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to add the JIRA ticket here, i.e. "SPARK-12477 accessing null element in array field"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin You mean as the test title or as a comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #2249 has finished for PR 10429 at commit 3c8a795.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

pierre-borckmans added 2 commits December 22, 2015 22:36
Missing spaces after commas
Line  length exceeds 100 characters
@pierre-borckmans
Copy link
Author

@rxin I fixed the test title, and the scala style issues.
I ran dev/scalastyle successfully.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #2250 has finished for PR 10429 at commit b1fc7e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Dec 23, 2015

test this please

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48223 has finished for PR 10429 at commit 64f95ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Dec 23, 2015
…ay fields

Accessing null elements in an array field fails when tungsten is enabled.
It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.

This PR solves this by checking if the accessed element in the array field is null, in the generated code.

Example:
```
// Array of String
case class AS( as: Seq[String] )
val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
dfAS.registerTempTable("T_AS")
for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))}
```

With Tungsten disabled:
```
0 = [a]
1 = [null]
2 = [b]
```

With Tungsten enabled:
```
0 = [a]
15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
	at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
```

Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>

Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array.

(cherry picked from commit 43b2a63)
Signed-off-by: Reynold Xin <rxin@databricks.com>
asfgit pushed a commit that referenced this pull request Dec 23, 2015
…ay fields

Accessing null elements in an array field fails when tungsten is enabled.
It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.

This PR solves this by checking if the accessed element in the array field is null, in the generated code.

Example:
```
// Array of String
case class AS( as: Seq[String] )
val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
dfAS.registerTempTable("T_AS")
for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))}
```

With Tungsten disabled:
```
0 = [a]
1 = [null]
2 = [b]
```

With Tungsten enabled:
```
0 = [a]
15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
	at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
```

Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>

Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array.

(cherry picked from commit 43b2a63)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@rxin
Copy link
Contributor

rxin commented Dec 23, 2015

Thanks - I've merged this in master and branch-1.6 and branch-1.5.

@asfgit asfgit closed this in 43b2a63 Dec 23, 2015
@pierre-borckmans pierre-borckmans deleted the SPARK-12477_Tungsten-Projection-Null-Element-In-Array branch December 23, 2015 13:26
@JoshRosen
Copy link
Contributor

It looks like this accidentally broke test compilation in branch-1.5; I'm hotfixing in #10478.

asfgit pushed a commit that referenced this pull request Dec 26, 2015
This fixes a test compilation break in branch-1.5; the break was introduced by #10429.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10478 from JoshRosen/SPARK-12477-branch-1.5-compile-fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants