Skip to content

[SPARK-28189][SQL] Use semanticEquals in Dataset drop method for attributes comparison #25055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2322,7 +2322,7 @@ class Dataset[T] private[sql](
}
val attrs = this.logicalPlan.output
val colsAfterDrop = attrs.filter { attr =>
attr != expression
!attr.semanticEquals(expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no palce having the same issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through Dataset.scala - didn't find similar issue. However there might be the same problems in other places in our SQL code..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other places. Please see #21449.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks. I think its ok to only target dataset.drop in this pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @maropu 's opinion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  def semanticEquals(other: Expression): Boolean =
    deterministic && other.deterministic && canonicalized == other.canonicalized

What is the reason the comparison should be related to the deterministic when we want to drop it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. The output only contains Attribute

}.map(attr => Column(attr))
select(colsAfterDrop : _*)
}
Expand Down
23 changes: 23 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,29 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
assert(df.schema.map(_.name) === Seq("value"))
}

test("SPARK-28189 drop column using drop with column reference with case-insensitive names") {
// With SQL config caseSensitive OFF, case insensitive column name should work
withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
val col1 = testData("KEY")
val df1 = testData.drop(col1)
checkAnswer(df1, testData.selectExpr("value"))
assert(df1.schema.map(_.name) === Seq("value"))

val col2 = testData("Key")
val df2 = testData.drop(col2)
checkAnswer(df2, testData.selectExpr("value"))
assert(df2.schema.map(_.name) === Seq("value"))
}

// With SQL config caseSensitive ON, AnalysisException should be thrown
withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
val e = intercept[AnalysisException] {
testData("KEY")
Copy link
Member

@maropu maropu Jul 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, it seems this test is not related to this pr?
you just wanted to do like this?

  test("SPARK-28189 drop column using drop with column reference with case-insensitive names") {
    var caseInsensitiveCol: Column = null
    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
      caseInsensitiveCol = testData("KEY")
    }

    Seq("true", "false").foreach { isCaseSensitive =>
      withSQLConf(SQLConf.CASE_SENSITIVE.key -> isCaseSensitive) {
        val df = testData.drop(caseInsensitiveCol)
        checkAnswer(df, testData.selectExpr("value"))
        assert(df.schema.map(_.name) === Seq("value"))
      }
    }
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? @maropu . In your example, the following will fail in the same way because testData only has key and value.

      withSQLConf(SQLConf.CASE_SENSITIVE.key -> isCaseSensitive) {
        caseInsensitiveCol = testData("KEY")

For me, the test case looked okay because caseSensitive mode doesn't allow that kind of unmatched column from the beginning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry and my bad. That line wasn't needed and I updated the example above.
Anyway, I just a bit worry about the behaivour change in the code flow below;

scala> val testData = Seq(("a", 1)).toDF("key", "value")

scala> sql("SET spark.sql.caseSensitive=false")
scala> val caseInsensitiveCol = testData("KEY")

scala> sql("SET spark.sql.caseSensitive=true")
scala> testData.drop(caseInsensitiveCol).show()
// v2.4.3
+---+-----+
|key|value|
+---+-----+
|  a|    1|
+---+-----+

// master w/this pr
+-----+
|value|
+-----+
|    1|
+-----+

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • First, it's not a correct(?) use-case to switch the case sensitivity between declaring the variable and using the variable.
  • Second, I merged this to master only with the similar concern. :)
    Since this is a bug fix, I think we don't need a documentation about this. And, 3.0.0 is a good place to change those behavior change due to the bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for the check!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestion for your concern? Then, please share it with us~ You're welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM. I just a bit worry about not use cases but the test coverage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This test case of the second part (SQLConf.CASE_SENSITIVE.key -> "true") is weird to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, we can keep it unchanged. Ideally, we do not need the second part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I made a pr to drop that part: #25216
anyway, thanks for the check!

}.getMessage
assert(e.contains("Cannot resolve column name"))
}
}

test("drop unknown column (no-op) with column reference") {
val col = Column("random")
val df = testData.drop(col)
Expand Down