[SPARK-7269] [SQL] Incorrect analysis for aggregation #5798

chenghao-intel · 2015-04-30T02:36:34Z

https://issues.apache.org/jira/browse/SPARK-7269
In a case in-sensitive system, the AttributeReference object may not the same literally, as the name in different capital, however, in semantic it's should be identical, as after being resolved, the exprId is exactly the same.

For example:

SELECT kEy + 1 FROM src GROUP BY key+ 1

It's actually a legal query in HiveContext(case insensitive), however, it will fail in CheckAnalysis as
Add(AttributeReference("key"), Literal(1)) and
Add(AttributeReference("kEy"), Literal(1)) are not identical in literal. As we have code

case e if groupingExprs.contains(e) => // OK

in CheckAnalysis.scala for Aggregate, as well as in patterns.scala for partial aggregation.

In order not to confusing people by overwriting the method AttributeReference.equals(), we provide a utility classes for the equality checking purpose.

In long term, we probably need to refactor the Expression a little bit for supporting the semanticEquals() instead of equals()(literally equality checking)

AmplabJenkins · 2015-04-30T02:37:09Z

Merged build triggered.

AmplabJenkins · 2015-04-30T02:37:15Z

Merged build started.

SparkQA · 2015-04-30T02:38:55Z

Test build #31372 has started for PR 5798 at commit 1280cda.

viirya · 2015-04-30T03:23:49Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala

@@ -81,7 +81,7 @@ class HiveResolutionSuite extends HiveComparisonTest {
      .toDF().registerTempTable("caseSensitivityTest")

    val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest")
-    assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"),
+    assert(query.schema.fields.map(_.name) === Seq("a", "B", "a", "B", "a", "B", "a", "B"),


Why don't we preserve the case of the query?

I have unit test for explain this. Actually this is a workaround for the bug fixing, and, we should normalize the attribute names during the analysis. But leave it for the further improvement.

I meant that looks we preserve the case before, why do we now don't want to preserve it?
This test is used to test preserving the case of the query. So if you modified it like that, the test is not meaningful.

Oh, yes, I will update the code.

SparkQA · 2015-04-30T03:39:15Z

Test build #31372 has finished for PR 5798 at commit 1280cda.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-30T03:39:19Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-04-30T03:39:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31372/
Test FAILed.

AmplabJenkins · 2015-04-30T04:57:09Z

Merged build triggered.

AmplabJenkins · 2015-04-30T04:57:16Z

Merged build started.

SparkQA · 2015-04-30T04:57:51Z

Test build #31383 has started for PR 5798 at commit c00f1ad.

scwf · 2015-04-30T05:03:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala

+
+  val caseInsensitiveResolution = new Resolver {
+    override def apply(a: String, b: String): Boolean = a.equalsIgnoreCase(b)
+    override def apply(a: String): String = a.toLowerCase // as Hive does


how about rename the first apply -> resolve and the second rename to normalize

I'd like keep the first apply as it was, because I don't want to impact a lots of existed code. I agree we should rename the second apply => normalize.

/cc @rxin may has concern about this

If we want to add this, I think we should call it normalize. Maybe change the first apply to something else in the future.

I'm not sure if we need to add this though. I will let @marmbrus comment on that.

SparkQA · 2015-04-30T06:39:45Z

Test build #31383 has finished for PR 5798 at commit c00f1ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Resolver
This patch does not change any dependencies.

AmplabJenkins · 2015-04-30T06:39:49Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-04-30T06:39:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31383/
Test PASSed.

cloud-fan · 2015-04-30T07:44:54Z

It looks to me that the ostensible reason of this failure is groupingExprs.contains(e) mistakenly return false. Why not simply change the equals method in AttributeReference to not compare name? The AttributeReference.hashCode didn't use name either. Sorry if I missed something.

chenghao-intel · 2015-04-30T07:50:16Z

@cloud-fan I was thinking that also, but I don't think it's a good idea to override the equals method for a case class like that. And that's why we have the helper class AttributeEquals.

chenghao-intel · 2015-04-30T08:10:10Z

Thank you for the comments, I've updated the code for preserving the attribute name. Attribute name normalization seems still require some discussion, let's keep it for the future improvement.

AmplabJenkins · 2015-04-30T08:12:11Z

Merged build triggered.

AmplabJenkins · 2015-04-30T08:12:17Z

Merged build started.

SparkQA · 2015-04-30T08:12:40Z

Test build #31403 has started for PR 5798 at commit 1f0ed92.

cloud-fan · 2015-04-30T08:21:11Z

@chenghao-intel , how about changing groupingExprs.contains(e) to using AttributeEquals? Thus we don't need to touch AttributeReference.equals.

cloud-fan · 2015-04-30T08:32:18Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala

@@ -81,9 +81,11 @@ class HiveResolutionSuite extends HiveComparisonTest {
      .toDF().registerTempTable("caseSensitivityTest")

    val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest")
-    assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"),
+    assert(query.schema.fields.map(_.name) === Seq("a", "B", "a", "B", "a", "b", "A", "B"),


I'm not sure what we really want here. When user SELECT b FROM t and t has a column B, which one should we used in the result schema? b or B? cc @marmbrus

Does that matter for a case-insensitive system?
But we do need keep the attribute name identical in the references chain. This is a workaround approach for the bug fixing, in long term, we probably need to refactor the AttributeReference equality for name (or take the Resolver in?).

SparkQA · 2015-04-30T10:17:27Z

Test build #31403 has finished for PR 5798 at commit 1f0ed92.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Param[T] (val parent: Params, val name: String, val doc: String, val isValid: T => Boolean)
- class DoubleParam(parent: Params, name: String, doc: String, isValid: Double => Boolean)
- class IntParam(parent: Params, name: String, doc: String, isValid: Int => Boolean)
- class FloatParam(parent: Params, name: String, doc: String, isValid: Float => Boolean)
- class LongParam(parent: Params, name: String, doc: String, isValid: Long => Boolean)
- class BooleanParam(parent: Params, name: String, doc: String) // No need for isValid
- case class ParamPair[T](param: Param[T], value: T)
- class KMeansModel (
- trait PMMLExportable
- case class Sample(
- case class Sample(
This patch adds the following new dependencies:
- jaxb-api-2.2.7.jar
- jaxb-core-2.2.7.jar
- jaxb-impl-2.2.7.jar
- pmml-agent-1.1.15.jar
- pmml-model-1.1.15.jar
- pmml-schema-1.1.15.jar
This patch removes the following dependencies:
- activation-1.1.jar
- jaxb-api-2.2.2.jar
- jaxb-impl-2.2.3-1.jar

AmplabJenkins · 2015-04-30T10:17:31Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-04-30T10:17:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31403/
Test PASSed.

marmbrus · 2015-04-30T21:30:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

@@ -158,7 +158,7 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
      resolver: Resolver,
      attribute: Attribute): Option[(Attribute, List[String])] = {
    if (resolver(attribute.name, nameParts.head)) {
-      Option((attribute.withName(nameParts.head), nameParts.tail.toList))
+      Option((attribute, nameParts.tail.toList))


This is incorrect. Spark SQL is case insensitive but case preserving. This behavior is important because we interface with systems that are case sensitive (think DataFrames in python) and otherwise it is very confusing to the user.

marmbrus · 2015-04-30T21:36:58Z

Definitely do not change the equality function for AttributeReference. I did this in an early version of catalyst and the result can be quite confusing. equals() should always be exact and consider all properties of a case class.

Instead, use an AttributeSet whenever you are looking for reference equals or contains operations. Really it would be awesome if we could add a linter rule that warned for Seq/Set[Attribute].contains(), since this is often incorrect.

AmplabJenkins · 2015-05-10T16:41:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32348/
Test FAILed.

chenghao-intel · 2015-05-11T02:11:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

            partialEvaluations.values.flatMap(_.partialEvaluations)).toSeq

-        val namedGroupingAttributes = namedGroupingExpressions.values.map(_.toAttribute).toSeq


namedGroupingExpressions.values probably come with arbitrary order, which is not right compare to the groupingExpressions.

AmplabJenkins · 2015-05-11T02:12:12Z

Merged build triggered.

AmplabJenkins · 2015-05-11T02:12:19Z

Merged build started.

SparkQA · 2015-05-11T02:13:57Z

Test build #32361 has started for PR 5798 at commit e00d0bc.

SparkQA · 2015-05-11T04:43:58Z

Test build #32361 timed out for PR 5798 at commit e00d0bc after a configured wait of 150m.

AmplabJenkins · 2015-05-11T04:44:02Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-11T04:44:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32361/
Test FAILed.

chenghao-intel · 2015-05-11T05:10:32Z

retest this please.

AmplabJenkins · 2015-05-11T05:12:12Z

Merged build triggered.

AmplabJenkins · 2015-05-11T05:12:20Z

Merged build started.

SparkQA · 2015-05-11T05:14:05Z

Test build #32374 has started for PR 5798 at commit e00d0bc.

SparkQA · 2015-05-11T07:10:54Z

Test build #32374 has finished for PR 5798 at commit e00d0bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sealed class ExpressionMap[A] extends Serializable
- sealed class ExpressionSet extends Serializable

AmplabJenkins · 2015-05-11T07:10:59Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-11T07:10:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32374/
Test PASSed.

marmbrus · 2015-05-11T22:40:33Z

The QA phase is not a reason to rush this patch in half finished. This isn't a regression and there is a trivial workaround (use consistent capitalization).

If we are going to add ExpressionMap and ExpressionSet, they should be complete and have tests. It is also pretty weird for a Set to not return the same elements that are put into it. Also, I don't think the current method of normalization is sufficient. It will fail if other unimportant properties (such as scope) differ.

chenghao-intel · 2015-05-12T00:29:17Z

Will it be simpler if we refactor the AttributeReference that not to compare the nullable, dataType or name, but exprId? If in that case, then we might remove both AttributeSet, AttributeMap etc.

Says:

case class AttributeReference(exprId:String=ExprId)(name: String ...)

I don't think we can stop people to write code like：

    val exprs: Seq[Expression] = ...
    exprs.toSet.contains(otherExpression)

marmbrus · 2015-05-12T01:20:49Z

If you break equality you will break the transform function.

You can't stop people from using expression equality incorrectly, but you also can't stop them from doing a1.name == a2.name and it's equally invalid (and happened in the code in quite a few places before we added AttributeSet. I'm not sure there is a way to make the compiler understand what type of equality you are looking for. I think the best solution is awareness of the sharp edges amongst people reviewing code and nice helper classes for dealing the the various types of equality that we care about.

chenghao-intel · 2015-05-12T02:49:48Z

Thank you @marmbrus for the explanation. I was thinking if there is a simple way to make the Expression more like a general class, which can be used with common collections like Map.get, Set.contains etc. Anyway, we can keep it for the future improvement.

For this PR, I am not sure if we really need a real Map or Seq as we did for AttributeMap and AttributeSet, the methods such as ++, --, subsetOf etc. are used widely in ColumnPruning, ExtractEquiJoinKey, Expression.references etc., but ExpressionMap/Set is not the case. That's why put very few methods there.

I can add more methods/tests if you feel we should do it in this PR.

cloud-fan · 2015-05-12T04:37:43Z

How about adding semanticEquals method to Expression like you suggested before? And we can choose semanticEquals as equality function for Set.contains etc. when we need.

chenghao-intel · 2015-05-12T04:52:52Z

Set.contains only support the concrete object parameter, not a function.
Supporting the semanticEquals will impact lots of Expressions, and still we couldn't make it seamlessly integrate with the scala collections like Set.contains.

cloud-fan · 2015-05-12T09:45:16Z

Hmm...How about using Set.find here instead of contains? find is slower than contains but we don't care much about performance here, right?

marmbrus · 2015-05-12T18:18:18Z

Using .find seems like a pretty reasonable solution to me.

chenghao-intel · 2015-05-13T03:12:14Z

OK, let's targeting the bug fixing for now, I will update the code soon.

chenghao-intel · 2015-05-13T05:50:07Z

I've updated the code at #6110, but I don't think that's a better solution.

chenghao-intel mentioned this pull request Apr 30, 2015

[SPARK-7235] [SQL] Refactor the grouping sets #5780

Closed

viirya reviewed Apr 30, 2015
View reviewed changes

scwf reviewed Apr 30, 2015
View reviewed changes

chenghao-intel force-pushed the analysis branch from c00f1ad to 1f0ed92 Compare April 30, 2015 08:08

cloud-fan reviewed Apr 30, 2015
View reviewed changes

marmbrus reviewed Apr 30, 2015
View reviewed changes

map.values change the order of partial aggregate expression

e00d0bc

chenghao-intel reviewed May 11, 2015
View reviewed changes

chenghao-intel mentioned this pull request May 13, 2015

[SPARK-7269] [SQL] Incorrect analysis for aggregation (new implementation) #6110

Closed

chenghao-intel mentioned this pull request May 17, 2015

[SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals) #6173

Closed

chenghao-intel closed this May 19, 2015

chenghao-intel deleted the analysis branch July 2, 2015 08:32

		partialEvaluations.values.flatMap(_.partialEvaluations)).toSeq

		val namedGroupingAttributes = namedGroupingExpressions.values.map(_.toAttribute).toSeq

[SPARK-7269] [SQL] Incorrect analysis for aggregation #5798

[SPARK-7269] [SQL] Incorrect analysis for aggregation #5798

Uh oh!

Conversation

chenghao-intel commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

cloud-fan commented Apr 30, 2015

Uh oh!

chenghao-intel commented Apr 30, 2015

Uh oh!

chenghao-intel commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

cloud-fan commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

AmplabJenkins commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Apr 30, 2015

Uh oh!

AmplabJenkins commented May 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 11, 2015

Uh oh!