Skip to content

[SPARK-41708][SQL][FOLLOWUP] Add a new replaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove @hashCode #39598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Jan 16, 2023

What changes were proposed in this pull request?

The daily build GA Master Scala 2.13 + Hadoop 3 + JDK 8 failed after #39468 merged, the failed tests includes:

  • org.apache.spark.sql.SQLQueryTestSuite
  • org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite

and the failure error message is similar to the following:

2023-01-11T01:03:46.3478323Z 01:03:46.347 ERROR org.apache.spark.sql.SQLQueryTestSuite: Error using configs: 
2023-01-11T01:03:46.3677880Z [info]- explain.sql *** FAILED *** (348 milliseconds)
2023-01-11T01:03:46.3678710Z [info]  explain.sql
2023-01-11T01:03:46.3679479Z [info]  Expected "...es.CatalogFileIndex@[7d811218, [key, val]
2023-01-11T01:03:46.3680509Z [info]  +- Project [key#x, val#x]
2023-01-11T01:03:46.3681033Z [info]     +- SubqueryAlias spark_catalog.default.explain_temp4
2023-01-11T01:03:46.3684259Z [info]        +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
2023-01-11T01:03:46.3684922Z [info]   
2023-01-11T01:03:46.3685766Z [info]  == Optimized Logical Plan ==
2023-01-11T01:03:46.3687590Z [info]  InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val]
2023-01-11T01:03:46.3688465Z [info]  +- WriteFiles
2023-01-11T01:03:46.3690929Z [info]     +- Sort [val#x ASC NULLS FIRST], false
2023-01-11T01:03:46.3691387Z [info]        +- Project [key#x, empty2null(val#x) AS val#x]
2023-01-11T01:03:46.3692078Z [info]           +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
2023-01-11T01:03:46.3692549Z [info]   
2023-01-11T01:03:46.3693443Z [info]  == Physical Plan ==
2023-01-11T01:03:46.3695233Z [info]  Execute InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218], [key, val]
2023-01-11T01:03:46.3696100Z [info]  +- Writ...", but got "...es.CatalogFileIndex@[cdfa8472, [key, val]
2023-01-11T01:03:46.3698327Z [info]  +- Project [key#x, val#x]
2023-01-11T01:03:46.3698881Z [info]     +- SubqueryAlias spark_catalog.default.explain_temp4
2023-01-11T01:03:46.3699680Z [info]        +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
2023-01-11T01:03:46.3704986Z [info]   
2023-01-11T01:03:46.3705457Z [info]  == Optimized Logical Plan ==
2023-01-11T01:03:46.3717140Z [info]  InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@cdfa8472, [key, val]
2023-01-11T01:03:46.3718309Z [info]  +- WriteFiles
2023-01-11T01:03:46.3718964Z [info]     +- Sort [val#x ASC NULLS FIRST], false
2023-01-11T01:03:46.3719752Z [info]        +- Project [key#x, empty2null(val#x) AS val#x]
2023-01-11T01:03:46.3723046Z [info]           +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet
2023-01-11T01:03:46.3723598Z [info]   
2023-01-11T01:03:46.3726955Z [info]  == Physical Plan ==
2023-01-11T01:03:46.3728111Z [info]  Execute InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@cdfa8472], [key, val]
2023-01-11T01:03:46.3898445Z [info]  +- Writ..." Result did not match for query #21
2023-01-11T01:03:46.3902948Z [info]  EXPLAIN EXTENDED INSERT INTO TABLE explain_temp5 SELECT * FROM explain_temp4 (SQLQueryTestSuite.scala:495)
2023-01-11T01:03:46.3903881Z [info]  org.scalatest.exceptions.TestFailedException:
2023-01-11T01:03:46.3904492Z [info]  at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
2023-01-11T01:03:46.3905449Z [info]  at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
2023-01-11T01:03:46.3906493Z [info]  at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
2023-01-11T01:03:46.3907683Z [info]  at org.scalatest.Assertions.assertResult(Assertions.scala:847)
2023-01-11T01:03:46.3908243Z [info]  at org.scalatest.Assertions.assertResult$(Assertions.scala:842)
2023-01-11T01:03:46.3908812Z [info]  at org.scalatest.funsuite.AnyFunSuite.assertResult(AnyFunSuite.scala:1564)
2023-01-11T01:03:46.3910011Z [info]  at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runQueries$11(SQLQueryTestSuite.scala:495)
2023-01-11T01:03:46.3910611Z [info]  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
2023-01-11T01:03:46.3911163Z [info]  at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
2023-01-11T01:03:46.3912094Z [info]  at scala.collection.AbstractIterable.foreach(Iterable.scala:926)
2023-01-11T01:03:46.3912781Z [info]  at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runQueries$9(SQLQueryTestSuite.scala:486)
2023-01-11T01:03:46.3913371Z [info]  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
2023-01-11T01:03:46.3914237Z [info]  at org.scalatest.Assertions.withClue(Assertions.scala:1065)
2023-01-11T01:03:46.3915165Z [info]  at org.scalatest.Assertions.withClue$(Assertions.scala:1052)
2023-01-11T01:03:46.3915725Z [info]  at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564)
2023-01-11T01:03:46.3916341Z [info]  at org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:462)
2023-01-11T01:03:46.3917485Z [info]  at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$35(SQLQueryTestSuite.scala:364)
2023-01-11T01:03:46.3918517Z [info]  at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$35$adapted(SQLQueryTestSuite.scala:362)
2023-01-11T01:03:46.3919102Z [info]  at scala.collection.immutable.List.foreach(List.scala:333)
2023-01-11T01:03:46.3919675Z [info]  at org.apache.spark.sql.SQLQueryTestSuite.runTest(SQLQueryTestSuite.scala:362)
2023-01-11T01:03:46.3921754Z [info]  at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$createScalaTestCase$6(SQLQueryTestSuite.scala:269)
2023-01-11T01:03:46.3922358Z [info]  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
2023-01-11T01:03:46.3923784Z [info]  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
2023-01-11T01:03:46.3924473Z [info]  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
2023-01-11T01:03:46.3925286Z [info]  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
2023-01-11T01:03:46.3926199Z [info]  at org.scalatest.Transformer.apply(Transformer.scala:22)
2023-01-11T01:03:46.3927071Z [info]  at org.scalatest.Transformer.apply(Transformer.scala:20)
2023-01-11T01:03:46.3928583Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
2023-01-11T01:03:46.3929225Z [info]  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207)
2023-01-11T01:03:46.3930091Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
2023-01-11T01:03:46.3933329Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
2023-01-11T01:03:46.3933893Z [info]  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
2023-01-11T01:03:46.3934875Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
2023-01-11T01:03:46.3935479Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
2023-01-11T01:03:46.3936453Z [info]  at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:66)
2023-01-11T01:03:46.3937318Z [info]  at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
2023-01-11T01:03:46.3940707Z [info]  at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
2023-01-11T01:03:46.3941350Z [info]  at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:66)
2023-01-11T01:03:46.3941962Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
2023-01-11T01:03:46.3943332Z [info]  at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
2023-01-11T01:03:46.3944504Z [info]  at scala.collection.immutable.List.foreach(List.scala:333)
2023-01-11T01:03:46.3950194Z [info]  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
2023-01-11T01:03:46.3950748Z [info]  at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
2023-01-11T01:03:46.3951912Z [info]  at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
2023-01-11T01:03:46.3952515Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
2023-01-11T01:03:46.3953476Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
2023-01-11T01:03:46.3954069Z [info]  at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
2023-01-11T01:03:46.3966445Z [info]  at org.scalatest.Suite.run(Suite.scala:1114)
2023-01-11T01:03:46.3967583Z [info]  at org.scalatest.Suite.run$(Suite.scala:1096)
2023-01-11T01:03:46.3968377Z [info]  at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
2023-01-11T01:03:46.3969537Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
2023-01-11T01:03:46.3970510Z [info]  at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
2023-01-11T01:03:46.3971298Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
2023-01-11T01:03:46.3972182Z [info]  at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
2023-01-11T01:03:46.3973529Z [info]  at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:66)
2023-01-11T01:03:46.3974433Z [info]  at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
2023-01-11T01:03:46.3977778Z [info]  at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
2023-01-11T01:03:46.3984781Z [info]  at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
2023-01-11T01:03:46.3985521Z [info]  at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:66)
2023-01-11T01:03:46.3986684Z [info]  at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
2023-01-11T01:03:46.3987264Z [info]  at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
2023-01-11T01:03:46.3987774Z [info]  at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
2023-01-11T01:03:46.3988269Z [info]  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2023-01-11T01:03:46.3989260Z [info]  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2023-01-11T01:03:46.3996895Z [info]  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2023-01-11T01:03:46.3997550Z [info]  at java.lang.Thread.run(Thread.java:750)

The reason for the failure is that the result of CatalogFileIndex printed when using Scala 2.12 is different from Scala 2.13:

When using Scala 2.12, the hashCode of CatalogFileIndex is 7d811218

InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val]

When using Scala 2.13, the hashCode of CatalogFileIndex is cdfa8472

InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@cdfa8472, [key, val]

So this add a new replaceAll action to SQLQueryTestHelper#replaceNotIncludedMsg to remove @hashCode
to make CatalogFileIndex print the same results when using Scala 2.12 and 2.13

Why are the changes needed?

Make daily build GA Master Scala 2.13 + Hadoop 3 + JDK 8 can run successfully

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Pass GitHub Actions
  • Manual test with this pr:
gh pr checkout 39598
dev/change-scala-version.sh 2.13
build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z explain.sql" -Pscala-2.13
build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z explain-aqe.sql" -Pscala-2.13
build/sbt -Phive-thriftserver "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -- -z explain.sql" -Pscala-2.13
build/sbt -Phive-thriftserver "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -- -z explain-aqe.sql" -Pscala-2.13

@github-actions github-actions bot added the SQL label Jan 16, 2023
@LuciferYang LuciferYang changed the title [SPARK-41708][SQL][FOLLOWUP] Override equals and hashCode to make CatalogFileIndex print the same results when using Scala 2.12 and 2.13 [SPARK-41708][SQL][FOLLOWUP] Override equals and hashCode of TableIdentifier to make CatalogFileIndex print the same results when using Scala 2.12 and 2.13 Jan 16, 2023
@@ -104,6 +104,13 @@ case class TableIdentifier(table: String, database: Option[String], catalog: Opt

def this(table: String) = this(table, None, None)
def this(table: String, database: Option[String]) = this(table, database, None)

override def equals(obj: Any): Boolean = obj match {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently printing is org.apache.spark.sql.execution.datasources.CatalogFileIndex@hashCode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better solution would be to normalize the output by removing CatalogFileIndex@xxx part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just print org.apache.spark.sql.execution.datasources.CatalogFileIndex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace CatalogFileIndex@xxx to CatalogFileIndex in SQLQueryTestHelper#replaceNotIncludedMsg to recover test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulysses-you do you mean

change to use SQLQueryTestHelper.replaceNotIncludedMsg(output) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it had already used

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @LuciferYang .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make a separate test PR to scrub the hashCode instead of this follow-up, @LuciferYang ?

private def scrubOutIds(string: String): String =
string.replaceAll("#\\d+", "#x")
.replaceAll("operator id = \\d+", "operator id = #x")

I believe you can do that here in SQLQueryTestSuite.

@LuciferYang
Copy link
Contributor Author

Could you make a separate test PR to scrub the hashCode instead of this follow-up, @LuciferYang ?

private def scrubOutIds(string: String): String =
string.replaceAll("#\\d+", "#x")
.replaceAll("operator id = \\d+", "operator id = #x")

I believe you can do that here in SQLQueryTestSuite.

ok, let me try this

@@ -1086,20 +1086,20 @@ struct<plan:string>
+- 'UnresolvedRelation [explain_temp4], [], false

== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val]
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@5c30f6b5, [key, val]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not include the hashCode in the golden files.

@LuciferYang LuciferYang changed the title [SPARK-41708][SQL][FOLLOWUP] Override equals and hashCode of TableIdentifier to make CatalogFileIndex print the same results when using Scala 2.12 and 2.13 [SPARK-41708][SQL][FOLLOWUP] Add a new replaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove hashCode Jan 16, 2023
@LuciferYang LuciferYang changed the title [SPARK-41708][SQL][FOLLOWUP] Add a new replaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove hashCode [SPARK-41708][SQL][FOLLOWUP] Add a new replaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove hashCode Jan 16, 2023
@@ -47,6 +47,7 @@ trait SQLQueryTestHelper {
.replaceAll("Last Access.*", s"Last Access $notIncludedMsg")
.replaceAll("Partition Statistics\t\\d+", s"Partition Statistics\t$notIncludedMsg")
.replaceAll("\\*\\(\\d+\\) ", "*") // remove the WholeStageCodegen codegenStageIds
.replaceAll("@[0-9a-z]+,", "@x,") // remove hashCode
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add , to avoid userinfo@spark.apache.org being replaced by mistake

Copy link
Contributor Author

@LuciferYang LuciferYang Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @dongjoon-hyun @ulysses-you

Change to add a new replaceAll action to SQLQueryTestHelper#replaceNotIncludedMsg to replace @hashCode to @x.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe just remove all? since @x is uesless anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one change to remove all

@@ -1086,20 +1086,20 @@ struct<plan:string>
+- 'UnresolvedRelation [explain_temp4], [], false

== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val]
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@x, [key, val]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we simply override toString in FileIndex? One idea is to produce className(rootPaths.mString(","))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218, [key, val]
+InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]

After override toString in FileIndex, Append, spark_catalog.default.explain_temp5, org.apache.spark.sql.execution.datasources.CatalogFileIndex@7d811218 disappeared from result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it means we cannot do that. Did I understand correctly? Given that, I'm +1 with the AS-IS PR.

Copy link
Contributor Author

@LuciferYang LuciferYang Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan className(rootPaths.mString(",")) will print out the path related to the running environment, like

org.apache.spark.sql.execution.datasources.CatalogFileIndex(file:/Users/${userNaame}/spark-source/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/explain_temp5)

It seems not friendly to developers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it means we cannot do that. Did I understand correctly? Given that, I'm +1 with the AS-IS PR.

Only override the toString method cannot achieve the goal, and as I said above, rootPaths.mString(",") needs further cleaning to avoid differences in test execution paths, so I prefer the current way.

@LuciferYang LuciferYang changed the title [SPARK-41708][SQL][FOLLOWUP] Add a new replaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove hashCode [SPARK-41708][SQL][FOLLOWUP] Add a new replaceAll to SQLQueryTestHelper#replaceNotIncludedMsg to remove @hashCode Jan 16, 2023
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 16, 2023

Since the PR builder is only testing Scala 2.12, I tested manually with Scala 2.13. Merged to master to recover Scala 2.13 test coverage. Thank you, @LuciferYang , @HyukjinKwon , @ulysses-you, @cloud-fan .

@@ -47,6 +47,7 @@ trait SQLQueryTestHelper {
.replaceAll("Last Access.*", s"Last Access $notIncludedMsg")
.replaceAll("Partition Statistics\t\\d+", s"Partition Statistics\t$notIncludedMsg")
.replaceAll("\\*\\(\\d+\\) ", "*") // remove the WholeStageCodegen codegenStageIds
.replaceAll("@[0-9a-z]+,", ",") // remove hashCode
Copy link
Contributor

@cloud-fan cloud-fan Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is we will change it very soon. It's meaningless to put a default object string in the EXPLAIN result, as it's not user-facing. Having a path string is fine, and if you look at this method, we already normalized location strings for other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. It seems that I merged too early. Could you make a PR if this is not enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulysses-you can you help to fix this EXPLAIN issue?

Copy link
Contributor Author

@LuciferYang LuciferYang Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.replaceAll(s"file:.*$clsName", s"Location $notIncludedMsg/{warehouse_dir}")

It seems that the problem is here. The current replaceAll will match the longest match string, such as the following:

== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5, false, [val#x], Parquet, [path=file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex(file:/Users/yangjie01/SourceCode/git/spark-mine-12/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/explain_temp5), [key, val]
+- Project [key#x, val#x]
   +- SubqueryAlias spark_catalog.default.explain_temp4
      +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet

After the the replaceAll, it will become

== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]
+- Project [key#x, val#x]
   +- SubqueryAlias spark_catalog.default.explain_temp4
      +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet

instead of

== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5, false, [val#x], Parquet, [path=Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex(fLocation [not included in comparison]/{warehouse_dir}/explain_temp5), [key, val]
+- Project [key#x, val#x]
   +- SubqueryAlias spark_catalog.default.explain_temp4
      +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet

Copy link
Contributor

@cloud-fan cloud-fan Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can use non-greedy mode, by using file:.*?$clsName

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK. Another pr #39610

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Very sorry for giving misleading suggestions

dongjoon-hyun pushed a commit that referenced this pull request Jan 16, 2023
### What changes were proposed in this pull request?
The main change of this pr is to fix suggestions of #39598 (comment):

- revert the change of #39598
- override `toString` method of `FileIndex` to print className with `rootPaths`
- change the replacement rule of `SQLQueryTestHelper#replaceNotIncludedMsg` method for `file://xxx/clsName/` path to replace one by one instead of replacing the longest match

### Why are the changes needed?
Fix suggestions of #39598 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions
- Manually  test `build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" -Pscala-2.13` , run successful

Closes #39610 from LuciferYang/SPARK-41708-FOLLOWUP-n.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants