[SPARK-17618] Fix invalid comparisons between UnsafeRow and other row formats #15185

JoshRosen · 2016-09-21T20:14:05Z

What changes were proposed in this pull request?

This patch addresses a correctness bug in Spark 1.6.x in where coalesce() declares that it can process UnsafeRows but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious false comparisons, leading to wrong answers in operators which perform whole-row comparison (such as distinct() or except()). An example of a query impacted by this bug is given in the JIRA ticket.

The problem is that the validity of our row format conversion rules depends on operators which handle unsafeRows (signalled by overriding canProcessUnsafeRows) correctly reporting their output row format (which is done by overriding outputsUnsafeRows). In #9024, we overrode canProcessUnsafeRows but forgot to override outputsUnsafeRows, leading to the incorrect equals() comparison.

Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time.

This patch addresses this issue by adding missing outputsUnsafeRows overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified UnsafeRow.equals() to throw an IllegalArgumentException if it is called with an object that is not an UnsafeRow.

How was this patch tested?

I believe that the stronger misuse-checking in UnsafeRow.equals() is sufficient to detect and prevent this class of bug.

…ls comparison.

SparkQA · 2016-09-21T20:40:43Z

Test build #65733 has finished for PR 15185 at commit 9d4cf44.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-21T21:10:39Z

LGTM

SparkQA · 2016-09-21T22:29:59Z

Test build #65734 has finished for PR 15185 at commit 1319e82.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-21T22:52:04Z

Jenkins, retest this please.

SparkQA · 2016-09-21T23:04:01Z

Test build #65741 has finished for PR 15185 at commit 1319e82.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-22T00:49:07Z

Jenkins, retest this please.

SparkQA · 2016-09-22T02:15:51Z

Test build #65750 has finished for PR 15185 at commit 1319e82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… formats ## What changes were proposed in this pull request? This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618). The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In #9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison. Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time. This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`. ## How was this patch tested? I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug. Author: Josh Rosen <joshrosen@databricks.com> Closes #15185 from JoshRosen/SPARK-17618.

hvanhovell · 2016-09-27T17:57:47Z

Merging to master. Thanks!

… other formats This patch ports changes from #15185 to Spark 2.x. In that patch, a correctness bug in Spark 1.6.x which was caused by an invalid `equals()` comparison between an `UnsafeRow` and another row of a different format. Spark 2.x is not affected by that specific correctness bug but it can still reap the error-prevention benefits of that patch's changes, which modify ``UnsafeRow.equals()` to throw an IllegalArgumentException if it is called with an object that is not an `UnsafeRow`. Author: Josh Rosen <joshrosen@databricks.com> Closes #15265 from JoshRosen/SPARK-17618-master.

… other formats This patch ports changes from #15185 to Spark 2.x. In that patch, a correctness bug in Spark 1.6.x which was caused by an invalid `equals()` comparison between an `UnsafeRow` and another row of a different format. Spark 2.x is not affected by that specific correctness bug but it can still reap the error-prevention benefits of that patch's changes, which modify ``UnsafeRow.equals()` to throw an IllegalArgumentException if it is called with an object that is not an `UnsafeRow`. Author: Josh Rosen <joshrosen@databricks.com> Closes #15265 from JoshRosen/SPARK-17618-master. (cherry picked from commit 2f84a68) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

… formats ## What changes were proposed in this pull request? This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618). The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In apache#9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison. Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time. This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`. ## How was this patch tested? I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15185 from JoshRosen/SPARK-17618. (cherry picked from commit e2ce0ca)

JoshRosen added 2 commits September 21, 2016 12:56

Override outputsUnsafeRows when overriding canProcessUnsafeRows

d09a81e

Throw IllegalArgumentException when performing illegal UnsafeRow equa…

9d4cf44

…ls comparison.

Ignore non-InternalRow comparisons.

1319e82

JoshRosen closed this Sep 27, 2016

JoshRosen deleted the SPARK-17618 branch September 27, 2016 18:19

JoshRosen mentioned this pull request Sep 27, 2016

[SPARK-17618] Guard against invalid comparisons between UnsafeRow and other formats #15265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17618] Fix invalid comparisons between UnsafeRow and other row formats #15185

[SPARK-17618] Fix invalid comparisons between UnsafeRow and other row formats #15185

Uh oh!

JoshRosen commented Sep 21, 2016

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

hvanhovell commented Sep 21, 2016

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

JoshRosen commented Sep 21, 2016

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

JoshRosen commented Sep 22, 2016

Uh oh!

SparkQA commented Sep 22, 2016

Uh oh!

hvanhovell commented Sep 27, 2016

Uh oh!

Uh oh!

[SPARK-17618] Fix invalid comparisons between UnsafeRow and other row formats #15185

[SPARK-17618] Fix invalid comparisons between UnsafeRow and other row formats #15185

Uh oh!

Conversation

JoshRosen commented Sep 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

hvanhovell commented Sep 21, 2016

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

JoshRosen commented Sep 21, 2016

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

JoshRosen commented Sep 22, 2016

Uh oh!

SparkQA commented Sep 22, 2016

Uh oh!

hvanhovell commented Sep 27, 2016

Uh oh!

Uh oh!