-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41708][SQL][FOLLOWUP] Add a new replaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove @hashCode
#39598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
equals and hashCode to make CatalogFileIndex print the same results when using Scala 2.12 and 2.13equals and hashCode of TableIdentifier to make CatalogFileIndex print the same results when using Scala 2.12 and 2.13
| def this(table: String) = this(table, None, None) | ||
| def this(table: String, database: Option[String]) = this(table, database, None) | ||
|
|
||
| override def equals(obj: Any): Boolean = obj match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan @HyukjinKwon @dongjoon-hyun @ulysses-you
The daily build GA Master Scala 2.13 + Hadoop 3 + JDK 8 failed after #39468 merged.
- https://github.com/apache/spark/actions/runs/3924877161
- https://github.com/apache/spark/actions/runs/3919886270
- https://github.com/apache/spark/actions/runs/3914162890
- https://github.com/apache/spark/actions/runs/3905252989
- https://github.com/apache/spark/actions/runs/3895875210
- https://github.com/apache/spark/actions/runs/3895875210
Are there any suggestions for better solutions? For example, override toString of CatalogFileIndex? What is better to print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently printing is org.apache.spark.sql.execution.datasources.CatalogFileIndex@hashCode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The better solution would be to normalize the output by removing CatalogFileIndex@xxx part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just print org.apache.spark.sql.execution.datasources.CatalogFileIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can replace CatalogFileIndex@xxx to CatalogFileIndex in SQLQueryTestHelper#replaceNotIncludedMsg to recover test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it had already used
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @LuciferYang .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make a separate test PR to scrub the hashCode instead of this follow-up, @LuciferYang ?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Lines 843 to 845 in 2062c74
| private def scrubOutIds(string: String): String = | |
| string.replaceAll("#\\d+", "#x") | |
| .replaceAll("operator id = \\d+", "operator id = #x") |
I believe you can do that here in SQLQueryTestSuite.
ok, let me try this |
|
|
||
| == Analyzed Logical Plan == | ||
| InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val] | ||
| InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@5c30f6b5, [key, val] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not include the hashCode in the golden files.
equals and hashCode of TableIdentifier to make CatalogFileIndex print the same results when using Scala 2.12 and 2.13SQLQueryTestHelper#replaceNotIncludedMsg to remove hashCode
SQLQueryTestHelper#replaceNotIncludedMsg to remove hashCodereplaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove hashCode
| .replaceAll("Last Access.*", s"Last Access $notIncludedMsg") | ||
| .replaceAll("Partition Statistics\t\\d+", s"Partition Statistics\t$notIncludedMsg") | ||
| .replaceAll("\\*\\(\\d+\\) ", "*") // remove the WholeStageCodegen codegenStageIds | ||
| .replaceAll("@[0-9a-z]+,", "@x,") // remove hashCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add , to avoid userinfo@spark.apache.org being replaced by mistake
| userinfo@spark.apache.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan @dongjoon-hyun @ulysses-you
Change to add a new replaceAll action to SQLQueryTestHelper#replaceNotIncludedMsg to replace @hashCode to @x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just remove all? since @x is uesless anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one change to remove all
|
|
||
| == Analyzed Logical Plan == | ||
| InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val] | ||
| InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@x, [key, val] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we simply override toString in FileIndex? One idea is to produce className(rootPaths.mString(","))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val]
+InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]
After override toString in FileIndex, Append, spark_catalog.default.explain_temp5, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218 disappeared from result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it means we cannot do that. Did I understand correctly? Given that, I'm +1 with the AS-IS PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan className(rootPaths.mString(",")) will print out the path related to the running environment, like
org.apache.spark.sql.execution.datasources.CatalogFileIndex(file:/Users/${userNaame}/spark-source/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/explain_temp5)
It seems not friendly to developers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it means we cannot do that. Did I understand correctly? Given that, I'm +1 with the AS-IS PR.
Only override the toString method cannot achieve the goal, and as I said above, rootPaths.mString(",") needs further cleaning to avoid differences in test execution paths, so I prefer the current way.
replaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove hashCodereplaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove @hashCode
|
Since the PR builder is only testing Scala 2.12, I tested manually with Scala 2.13. Merged to master to recover Scala 2.13 test coverage. Thank you, @LuciferYang , @HyukjinKwon , @ulysses-you, @cloud-fan . |
| .replaceAll("Last Access.*", s"Last Access $notIncludedMsg") | ||
| .replaceAll("Partition Statistics\t\\d+", s"Partition Statistics\t$notIncludedMsg") | ||
| .replaceAll("\\*\\(\\d+\\) ", "*") // remove the WholeStageCodegen codegenStageIds | ||
| .replaceAll("@[0-9a-z]+,", ",") // remove hashCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is we will change it very soon. It's meaningless to put a default object string in the EXPLAIN result, as it's not user-facing. Having a path string is fine, and if you look at this method, we already normalized location strings for other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. It seems that I merged too early. Could you make a PR if this is not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ulysses-you can you help to fix this EXPLAIN issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .replaceAll(s"file:.*$clsName", s"Location $notIncludedMsg/{warehouse_dir}") |
It seems that the problem is here. The current replaceAll will match the longest match string, such as the following:
== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5, false, [val#x], Parquet, [path=file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex(file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5), [key, val]
+- Project [key#x, val#x]
+- SubqueryAlias spark_catalog.default.explain_temp4
+- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
After the the replaceAll, it will become
== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]
+- Project [key#x, val#x]
+- SubqueryAlias spark_catalog.default.explain_temp4
+- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
instead of
== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5, false, [val#x], Parquet, [path=Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex(fLocation [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]
+- Project [key#x, val#x]
+- SubqueryAlias spark_catalog.default.explain_temp4
+- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we can use non-greedy mode, by using file:.*?$clsName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK. Another pr #39610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Very sorry for giving misleading suggestions
### What changes were proposed in this pull request? The main change of this pr is to fix suggestions of #39598 (comment): - revert the change of #39598 - override `toString` method of `FileIndex` to print className with `rootPaths` - change the replacement rule of `SQLQueryTestHelper#replaceNotIncludedMsg` method for `file://xxx/clsName/` path to replace one by one instead of replacing the longest match ### Why are the changes needed? Fix suggestions of #39598 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manually test `build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" -Pscala-2.13` , run successful Closes #39610 from LuciferYang/SPARK-41708-FOLLOWUP-n. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
The daily build GA
Master Scala 2.13 + Hadoop 3 + JDK 8failed after #39468 merged, the failed tests includes:and the failure error message is similar to the following:
The reason for the failure is that the result of
CatalogFileIndexprinted when using Scala 2.12 is different from Scala 2.13:When using Scala 2.12, the
hashCodeofCatalogFileIndexis7d811218When using Scala 2.13, the
hashCodeofCatalogFileIndexiscdfa8472So this add a new
replaceAllaction toSQLQueryTestHelper#replaceNotIncludedMsgto remove@hashCodeto make
CatalogFileIndexprint the same results when using Scala 2.12 and 2.13Why are the changes needed?
Make daily build GA
Master Scala 2.13 + Hadoop 3 + JDK 8can run successfullyDoes this PR introduce any user-facing change?
No
How was this patch tested?