-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-7269] [SQL] Incorrect analysis for aggregation #5798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merged build triggered. |
Merged build started. |
Test build #31372 has started for PR 5798 at commit |
@@ -81,7 +81,7 @@ class HiveResolutionSuite extends HiveComparisonTest { | |||
.toDF().registerTempTable("caseSensitivityTest") | |||
|
|||
val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest") | |||
assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"), | |||
assert(query.schema.fields.map(_.name) === Seq("a", "B", "a", "B", "a", "B", "a", "B"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we preserve the case of the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have unit test for explain this. Actually this is a workaround for the bug fixing, and, we should normalize the attribute names during the analysis. But leave it for the further improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that looks we preserve the case before, why do we now don't want to preserve it?
This test is used to test preserving the case of the query. So if you modified it like that, the test is not meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, I will update the code.
Test build #31372 has finished for PR 5798 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build triggered. |
Merged build started. |
Test build #31383 has started for PR 5798 at commit |
|
||
val caseInsensitiveResolution = new Resolver { | ||
override def apply(a: String, b: String): Boolean = a.equalsIgnoreCase(b) | ||
override def apply(a: String): String = a.toLowerCase // as Hive does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about rename the first apply -> resolve and the second rename to normalize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like keep the first apply
as it was, because I don't want to impact a lots of existed code. I agree we should rename the second apply
=> normalize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @rxin may has concern about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to add this, I think we should call it normalize. Maybe change the first apply to something else in the future.
I'm not sure if we need to add this though. I will let @marmbrus comment on that.
Test build #31383 has finished for PR 5798 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
It looks to me that the ostensible reason of this failure is |
@cloud-fan I was thinking that also, but I don't think it's a good idea to override the |
c00f1ad
to
1f0ed92
Compare
Thank you for the comments, I've updated the code for preserving the attribute name. Attribute name normalization seems still require some discussion, let's keep it for the future improvement. |
Merged build triggered. |
Merged build started. |
Test build #31403 has started for PR 5798 at commit |
@chenghao-intel , how about changing |
@@ -81,9 +81,11 @@ class HiveResolutionSuite extends HiveComparisonTest { | |||
.toDF().registerTempTable("caseSensitivityTest") | |||
|
|||
val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest") | |||
assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"), | |||
assert(query.schema.fields.map(_.name) === Seq("a", "B", "a", "B", "a", "b", "A", "B"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what we really want here. When user SELECT b FROM t
and t
has a column B
, which one should we used in the result schema? b
or B
? cc @marmbrus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that matter for a case-insensitive system?
But we do need keep the attribute name identical in the references chain. This is a workaround approach for the bug fixing, in long term, we probably need to refactor the AttributeReference equality
for name (or take the Resolver in?).
Test build #31403 has finished for PR 5798 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
@@ -158,7 +158,7 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { | |||
resolver: Resolver, | |||
attribute: Attribute): Option[(Attribute, List[String])] = { | |||
if (resolver(attribute.name, nameParts.head)) { | |||
Option((attribute.withName(nameParts.head), nameParts.tail.toList)) | |||
Option((attribute, nameParts.tail.toList)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. Spark SQL is case insensitive but case preserving. This behavior is important because we interface with systems that are case sensitive (think DataFrames in python) and otherwise it is very confusing to the user.
Definitely do not change the equality function for Instead, use an |
Test FAILed. |
partialEvaluations.values.flatMap(_.partialEvaluations)).toSeq | ||
|
||
val namedGroupingAttributes = namedGroupingExpressions.values.map(_.toAttribute).toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namedGroupingExpressions.values
probably come with arbitrary order, which is not right compare to the groupingExpressions
.
Merged build triggered. |
Merged build started. |
Test build #32361 has started for PR 5798 at commit |
Test build #32361 timed out for PR 5798 at commit |
Merged build finished. Test FAILed. |
Test FAILed. |
retest this please. |
Merged build triggered. |
Merged build started. |
Test build #32374 has started for PR 5798 at commit |
Test build #32374 has finished for PR 5798 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
The QA phase is not a reason to rush this patch in half finished. This isn't a regression and there is a trivial workaround (use consistent capitalization). If we are going to add |
Will it be simpler if we refactor the Says: case class AttributeReference(exprId:String=ExprId)(name: String ...) I don't think we can stop people to write code like: val exprs: Seq[Expression] = ...
exprs.toSet.contains(otherExpression) |
If you break equality you will break the You can't stop people from using expression equality incorrectly, but you also can't stop them from doing |
Thank you @marmbrus for the explanation. I was thinking if there is a simple way to make the For this PR, I am not sure if we really need a real I can add more methods/tests if you feel we should do it in this PR. |
How about adding |
|
Hmm...How about using |
Using |
OK, let's targeting the bug fixing for now, I will update the code soon. |
I've updated the code at #6110, but I don't think that's a better solution. |
https://issues.apache.org/jira/browse/SPARK-7269
In a case in-sensitive system, the
AttributeReference
object may not the same literally, as thename
in different capital, however, in semantic it's should be identical, as after being resolved, theexprId
is exactly the same.For example:
It's actually a legal query in HiveContext(case insensitive), however, it will fail in
CheckAnalysis
asAdd(AttributeReference("key"), Literal(1))
andAdd(AttributeReference("kEy"), Literal(1))
are not identical in literal. As we have codein
CheckAnalysis.scala
forAggregate
, as well as inpatterns.scala
for partial aggregation.In order not to confusing people by overwriting the method
AttributeReference.equals()
, we provide a utility classes for the equality checking purpose.In long term, we probably need to refactor the
Expression
a little bit for supporting thesemanticEquals()
instead ofequals()
(literally equality checking)