Skip to content

Conversation

chenghao-intel
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-7269
In a case in-sensitive system, the AttributeReference object may not the same literally, as the name in different capital, however, in semantic it's should be identical, as after being resolved, the exprId is exactly the same.

For example:

SELECT kEy + 1 FROM src GROUP BY key+ 1

It's actually a legal query in HiveContext(case insensitive), however, it will fail in CheckAnalysis as
Add(AttributeReference("key"), Literal(1)) and
Add(AttributeReference("kEy"), Literal(1)) are not identical in literal. As we have code

case e if groupingExprs.contains(e) => // OK

in CheckAnalysis.scala for Aggregate, as well as in patterns.scala for partial aggregation.

In order not to confusing people by overwriting the method AttributeReference.equals(), we provide a utility classes for the equality checking purpose.

In long term, we probably need to refactor the Expression a little bit for supporting the semanticEquals() instead of equals()(literally equality checking)

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31372 has started for PR 5798 at commit 1280cda.

@@ -81,7 +81,7 @@ class HiveResolutionSuite extends HiveComparisonTest {
.toDF().registerTempTable("caseSensitivityTest")

val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest")
assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"),
assert(query.schema.fields.map(_.name) === Seq("a", "B", "a", "B", "a", "B", "a", "B"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we preserve the case of the query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have unit test for explain this. Actually this is a workaround for the bug fixing, and, we should normalize the attribute names during the analysis. But leave it for the further improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that looks we preserve the case before, why do we now don't want to preserve it?
This test is used to test preserving the case of the query. So if you modified it like that, the test is not meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, I will update the code.

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31372 has finished for PR 5798 at commit 1280cda.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31372/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31383 has started for PR 5798 at commit c00f1ad.


val caseInsensitiveResolution = new Resolver {
override def apply(a: String, b: String): Boolean = a.equalsIgnoreCase(b)
override def apply(a: String): String = a.toLowerCase // as Hive does
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about rename the first apply -> resolve and the second rename to normalize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like keep the first apply as it was, because I don't want to impact a lots of existed code. I agree we should rename the second apply => normalize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @rxin may has concern about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to add this, I think we should call it normalize. Maybe change the first apply to something else in the future.

I'm not sure if we need to add this though. I will let @marmbrus comment on that.

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31383 has finished for PR 5798 at commit c00f1ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Resolver
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31383/
Test PASSed.

@cloud-fan
Copy link
Contributor

It looks to me that the ostensible reason of this failure is groupingExprs.contains(e) mistakenly return false. Why not simply change the equals method in AttributeReference to not compare name? The AttributeReference.hashCode didn't use name either. Sorry if I missed something.

@chenghao-intel
Copy link
Contributor Author

@cloud-fan I was thinking that also, but I don't think it's a good idea to override the equals method for a case class like that. And that's why we have the helper class AttributeEquals.

@chenghao-intel
Copy link
Contributor Author

Thank you for the comments, I've updated the code for preserving the attribute name. Attribute name normalization seems still require some discussion, let's keep it for the future improvement.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31403 has started for PR 5798 at commit 1f0ed92.

@cloud-fan
Copy link
Contributor

@chenghao-intel , how about changing groupingExprs.contains(e) to using AttributeEquals? Thus we don't need to touch AttributeReference.equals.

@@ -81,9 +81,11 @@ class HiveResolutionSuite extends HiveComparisonTest {
.toDF().registerTempTable("caseSensitivityTest")

val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest")
assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"),
assert(query.schema.fields.map(_.name) === Seq("a", "B", "a", "B", "a", "b", "A", "B"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we really want here. When user SELECT b FROM t and t has a column B, which one should we used in the result schema? b or B? cc @marmbrus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that matter for a case-insensitive system?
But we do need keep the attribute name identical in the references chain. This is a workaround approach for the bug fixing, in long term, we probably need to refactor the AttributeReference equality for name (or take the Resolver in?).

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31403 has finished for PR 5798 at commit 1f0ed92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Param[T] (val parent: Params, val name: String, val doc: String, val isValid: T => Boolean)
    • class DoubleParam(parent: Params, name: String, doc: String, isValid: Double => Boolean)
    • class IntParam(parent: Params, name: String, doc: String, isValid: Int => Boolean)
    • class FloatParam(parent: Params, name: String, doc: String, isValid: Float => Boolean)
    • class LongParam(parent: Params, name: String, doc: String, isValid: Long => Boolean)
    • class BooleanParam(parent: Params, name: String, doc: String) // No need for isValid
    • case class ParamPair[T](param: Param[T], value: T)
    • class KMeansModel (
    • trait PMMLExportable
    • case class Sample(
    • case class Sample(
  • This patch adds the following new dependencies:
    • jaxb-api-2.2.7.jar
    • jaxb-core-2.2.7.jar
    • jaxb-impl-2.2.7.jar
    • pmml-agent-1.1.15.jar
    • pmml-model-1.1.15.jar
    • pmml-schema-1.1.15.jar
  • This patch removes the following dependencies:
    • activation-1.1.jar
    • jaxb-api-2.2.2.jar
    • jaxb-impl-2.2.3-1.jar

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31403/
Test PASSed.

@@ -158,7 +158,7 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
resolver: Resolver,
attribute: Attribute): Option[(Attribute, List[String])] = {
if (resolver(attribute.name, nameParts.head)) {
Option((attribute.withName(nameParts.head), nameParts.tail.toList))
Option((attribute, nameParts.tail.toList))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Spark SQL is case insensitive but case preserving. This behavior is important because we interface with systems that are case sensitive (think DataFrames in python) and otherwise it is very confusing to the user.

@marmbrus
Copy link
Contributor

Definitely do not change the equality function for AttributeReference. I did this in an early version of catalyst and the result can be quite confusing. equals() should always be exact and consider all properties of a case class.

Instead, use an AttributeSet whenever you are looking for reference equals or contains operations. Really it would be awesome if we could add a linter rule that warned for Seq/Set[Attribute].contains(), since this is often incorrect.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32348/
Test FAILed.

partialEvaluations.values.flatMap(_.partialEvaluations)).toSeq

val namedGroupingAttributes = namedGroupingExpressions.values.map(_.toAttribute).toSeq
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namedGroupingExpressions.values probably come with arbitrary order, which is not right compare to the groupingExpressions.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 11, 2015

Test build #32361 has started for PR 5798 at commit e00d0bc.

@SparkQA
Copy link

SparkQA commented May 11, 2015

Test build #32361 timed out for PR 5798 at commit e00d0bc after a configured wait of 150m.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32361/
Test FAILed.

@chenghao-intel
Copy link
Contributor Author

retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 11, 2015

Test build #32374 has started for PR 5798 at commit e00d0bc.

@SparkQA
Copy link

SparkQA commented May 11, 2015

Test build #32374 has finished for PR 5798 at commit e00d0bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sealed class ExpressionMap[A] extends Serializable
    • sealed class ExpressionSet extends Serializable

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32374/
Test PASSed.

@marmbrus
Copy link
Contributor

The QA phase is not a reason to rush this patch in half finished. This isn't a regression and there is a trivial workaround (use consistent capitalization).

If we are going to add ExpressionMap and ExpressionSet, they should be complete and have tests. It is also pretty weird for a Set to not return the same elements that are put into it. Also, I don't think the current method of normalization is sufficient. It will fail if other unimportant properties (such as scope) differ.

@chenghao-intel
Copy link
Contributor Author

Will it be simpler if we refactor the AttributeReference that not to compare the nullable, dataType or name, but exprId? If in that case, then we might remove both AttributeSet, AttributeMap etc.

Says:

case class AttributeReference(exprId:String=ExprId)(name: String ...)

I don't think we can stop people to write code like:

    val exprs: Seq[Expression] = ...
    exprs.toSet.contains(otherExpression)

@marmbrus
Copy link
Contributor

If you break equality you will break the transform function.

You can't stop people from using expression equality incorrectly, but you also can't stop them from doing a1.name == a2.name and it's equally invalid (and happened in the code in quite a few places before we added AttributeSet. I'm not sure there is a way to make the compiler understand what type of equality you are looking for. I think the best solution is awareness of the sharp edges amongst people reviewing code and nice helper classes for dealing the the various types of equality that we care about.

@chenghao-intel
Copy link
Contributor Author

Thank you @marmbrus for the explanation. I was thinking if there is a simple way to make the Expression more like a general class, which can be used with common collections like Map.get, Set.contains etc. Anyway, we can keep it for the future improvement.

For this PR, I am not sure if we really need a real Map or Seq as we did for AttributeMap and AttributeSet, the methods such as ++, --, subsetOf etc. are used widely in ColumnPruning, ExtractEquiJoinKey, Expression.references etc., but ExpressionMap/Set is not the case. That's why put very few methods there.

I can add more methods/tests if you feel we should do it in this PR.

@cloud-fan
Copy link
Contributor

How about adding semanticEquals method to Expression like you suggested before? And we can choose semanticEquals as equality function for Set.contains etc. when we need.

@chenghao-intel
Copy link
Contributor Author

Set.contains only support the concrete object parameter, not a function.
Supporting the semanticEquals will impact lots of Expressions, and still we couldn't make it seamlessly integrate with the scala collections like Set.contains.

@cloud-fan
Copy link
Contributor

Hmm...How about using Set.find here instead of contains? find is slower than contains but we don't care much about performance here, right?

@marmbrus
Copy link
Contributor

Using .find seems like a pretty reasonable solution to me.

@chenghao-intel
Copy link
Contributor Author

OK, let's targeting the bug fixing for now, I will update the code soon.

@chenghao-intel
Copy link
Contributor Author

I've updated the code at #6110, but I don't think that's a better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants