-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-41708][SQL][FOLLOWUP] Add a new replaceAll
to SQLQueryTestHelper#replaceNotIncludedMsg
to remove @hashCode
#39598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
equals
and hashCode
to make CatalogFileIndex
print the same results when using Scala 2.12 and 2.13equals
and hashCode
of TableIdentifier
to make CatalogFileIndex
print the same results when using Scala 2.12 and 2.13
@@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt | |||
|
|||
def this(table: String) = this(table, None, None) | |||
def this(table: String, database: Option[String]) = this(table, database, None) | |||
|
|||
override def equals(obj: Any): Boolean = obj match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan @HyukjinKwon @dongjoon-hyun @ulysses-you
The daily build GA Master Scala 2.13 + Hadoop 3 + JDK 8
failed after #39468 merged.
- https://github.com/apache/spark/actions/runs/3924877161
- https://github.com/apache/spark/actions/runs/3919886270
- https://github.com/apache/spark/actions/runs/3914162890
- https://github.com/apache/spark/actions/runs/3905252989
- https://github.com/apache/spark/actions/runs/3895875210
- https://github.com/apache/spark/actions/runs/3895875210
Are there any suggestions for better solutions? For example, override toString
of CatalogFileIndex
? What is better to print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently printing is org.apache.spark.sql.execution.datasources.CatalogFileIndex@hashCode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The better solution would be to normalize the output by removing CatalogFileIndex@xxx
part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just print org.apache.spark.sql.execution.datasources.CatalogFileIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can replace CatalogFileIndex@xxx
to CatalogFileIndex
in SQLQueryTestHelper#replaceNotIncludedMsg
to recover test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it had already used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @LuciferYang .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make a separate test PR to scrub the hashCode instead of this follow-up, @LuciferYang ?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Lines 843 to 845 in 2062c74
private def scrubOutIds(string: String): String = | |
string.replaceAll("#\\d+", "#x") | |
.replaceAll("operator id = \\d+", "operator id = #x") |
I believe you can do that here in SQLQueryTestSuite.
ok, let me try this |
@@ -1086,20 +1086,20 @@ struct<plan:string> | |||
+- 'UnresolvedRelation [explain_temp4], [], false | |||
|
|||
== Analyzed Logical Plan == | |||
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val] | |||
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@5c30f6b5, [key, val] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not include the hashCode in the golden files.
equals
and hashCode
of TableIdentifier
to make CatalogFileIndex
print the same results when using Scala 2.12 and 2.13SQLQueryTestHelper#replaceNotIncludedMsg
to remove hashCode
SQLQueryTestHelper#replaceNotIncludedMsg
to remove hashCodereplaceAll
to SQLQueryTestHelper#replaceNotIncludedMsg
to remove hashCode
@@ -47,6 +47,7 @@ trait SQLQueryTestHelper { | |||
.replaceAll("Last Access.*", s"Last Access $notIncludedMsg") | |||
.replaceAll("Partition Statistics\t\\d+", s"Partition Statistics\t$notIncludedMsg") | |||
.replaceAll("\\*\\(\\d+\\) ", "*") // remove the WholeStageCodegen codegenStageIds | |||
.replaceAll("@[0-9a-z]+,", "@x,") // remove hashCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add ,
to avoid userinfo@spark.apache.org
being replaced by mistake
userinfo@spark.apache.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan @dongjoon-hyun @ulysses-you
Change to add a new replaceAll
action to SQLQueryTestHelper#replaceNotIncludedMsg
to replace @hashCode
to @x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just remove all? since @x
is uesless anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one change to remove all
@@ -1086,20 +1086,20 @@ struct<plan:string> | |||
+- 'UnresolvedRelation [explain_temp4], [], false | |||
|
|||
== Analyzed Logical Plan == | |||
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val] | |||
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@x, [key, val] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we simply override toString
in FileIndex
? One idea is to produce className(rootPaths.mString(","))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val]
+InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]
After override toString
in FileIndex
, Append,
spark_catalog.
default.
explain_temp5, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218
disappeared from result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it means we cannot do that. Did I understand correctly? Given that, I'm +1 with the AS-IS PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan className(rootPaths.mString(","))
will print out the path related to the running environment, like
org.apache.spark.sql.execution.datasources.CatalogFileIndex(file:/Users/${userNaame}/spark-source/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/explain_temp5)
It seems not friendly to developers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it means we cannot do that. Did I understand correctly? Given that, I'm +1 with the AS-IS PR.
Only override the toString
method cannot achieve the goal, and as I said above, rootPaths.mString(",")
needs further cleaning to avoid differences in test execution paths, so I prefer the current way.
replaceAll
to SQLQueryTestHelper#replaceNotIncludedMsg
to remove hashCodereplaceAll
to SQLQueryTestHelper#replaceNotIncludedMsg
to remove @hashCode
Since the PR builder is only testing Scala 2.12, I tested manually with Scala 2.13. Merged to master to recover Scala 2.13 test coverage. Thank you, @LuciferYang , @HyukjinKwon , @ulysses-you, @cloud-fan . |
@@ -47,6 +47,7 @@ trait SQLQueryTestHelper { | |||
.replaceAll("Last Access.*", s"Last Access $notIncludedMsg") | |||
.replaceAll("Partition Statistics\t\\d+", s"Partition Statistics\t$notIncludedMsg") | |||
.replaceAll("\\*\\(\\d+\\) ", "*") // remove the WholeStageCodegen codegenStageIds | |||
.replaceAll("@[0-9a-z]+,", ",") // remove hashCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is we will change it very soon. It's meaningless to put a default object string in the EXPLAIN result, as it's not user-facing. Having a path string is fine, and if you look at this method, we already normalized location strings for other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. It seems that I merged too early. Could you make a PR if this is not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ulysses-you can you help to fix this EXPLAIN issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.replaceAll(s"file:.*$clsName", s"Location $notIncludedMsg/{warehouse_dir}") |
It seems that the problem is here. The current replaceAll will match the longest match string, such as the following:
== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5, false, [val#x], Parquet, [path=file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex(file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5), [key, val]
+- Project [key#x, val#x]
+- SubqueryAlias spark_catalog.default.explain_temp4
+- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
After the the replaceAll, it will become
== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]
+- Project [key#x, val#x]
+- SubqueryAlias spark_catalog.default.explain_temp4
+- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
instead of
== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5, false, [val#x], Parquet, [path=Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex(fLocation [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]
+- Project [key#x, val#x]
+- SubqueryAlias spark_catalog.default.explain_temp4
+- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we can use non-greedy mode, by using file:.*?$clsName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK. Another pr #39610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Very sorry for giving misleading suggestions
### What changes were proposed in this pull request? The main change of this pr is to fix suggestions of #39598 (comment): - revert the change of #39598 - override `toString` method of `FileIndex` to print className with `rootPaths` - change the replacement rule of `SQLQueryTestHelper#replaceNotIncludedMsg` method for `file://xxx/clsName/` path to replace one by one instead of replacing the longest match ### Why are the changes needed? Fix suggestions of #39598 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manually test `build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" -Pscala-2.13` , run successful Closes #39610 from LuciferYang/SPARK-41708-FOLLOWUP-n. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
The daily build GA
Master Scala 2.13 + Hadoop 3 + JDK 8
failed after #39468 merged, the failed tests includes:and the failure error message is similar to the following:
The reason for the failure is that the result of
CatalogFileIndex
printed when using Scala 2.12 is different from Scala 2.13:When using Scala 2.12, the
hashCode
ofCatalogFileIndex
is7d811218
When using Scala 2.13, the
hashCode
ofCatalogFileIndex
iscdfa8472
So this add a new
replaceAll
action toSQLQueryTestHelper#replaceNotIncludedMsg
to remove@hashCode
to make
CatalogFileIndex
print the same results when using Scala 2.12 and 2.13Why are the changes needed?
Make daily build GA
Master Scala 2.13 + Hadoop 3 + JDK 8
can run successfullyDoes this PR introduce any user-facing change?
No
How was this patch tested?