-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17073] [SQL] [FOLLOWUP] generate column-level statistics #15360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #66378 has finished for PR 15360 at commit
|
Thank you! Will review it tonight or tomorrow morning. |
// non ascii characters are not allowed in the source code, so we disable the scalastyle. | ||
val columnGroups: Seq[(String, String)] = Seq(("c1", "C1"), ("列c", "列C")) | ||
// scalastyle:on | ||
columnGroups.foreach { case (column1, column2) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you create a separate function for the following checking logics? Then, you can have two test cases without duplicate codes.
@@ -62,7 +62,7 @@ case class AnalyzeColumnCommand( | |||
val statistics = Statistics( | |||
sizeInBytes = newTotalSize, | |||
rowCount = Some(rowCount), | |||
colStats = columnStats ++ catalogTable.stats.map(_.colStats).getOrElse(Map())) | |||
colStats = catalogTable.stats.map(_.colStats).getOrElse(Map()) ++ columnStats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug exposed by the newly added test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) Improving the test case coverage is important.
@@ -90,8 +90,9 @@ case class AnalyzeColumnCommand( | |||
} | |||
} | |||
if (duplicatedColumns.nonEmpty) { | |||
logWarning(s"Duplicated columns ${duplicatedColumns.mkString("(", ", ", ")")} detected " + | |||
s"when analyzing columns ${columnNames.mkString("(", ", ", ")")}, ignoring them.") | |||
logWarning("Duplicate column names were detected in `ANALYZE TABLE` statement. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detected
-> deduplicated
} | ||
} | ||
|
||
test("test refreshing statistics of cached data source table") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave the comments to explain which DDL commands trigger the refresh; Otherwise, the reviewers might be confused about what this test case is doing.
rsd = spark.sessionState.conf.ndvMaxError) | ||
|
||
sql(s"INSERT INTO $tableName SELECT 2") | ||
sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this DDL?
|
||
sql(s"INSERT INTO $tableName SELECT 2") | ||
sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS") | ||
sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS key") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above both DDL will call refreshTable
with the same table name. Right? If the source codes remove any refreshTable
, the test case still passes. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll split these two command into two separate test cases, for table stats and column stats respectively.
Test build #66503 has finished for PR 15360 at commit
|
We need a test case for Hive serde table. So far, I still have not found any test case to cover Hive serde tables. |
@gatorsmile Oh, I thought by "hive serde tables" you mean tables stored in Hive metastore. |
Test build #66581 has finished for PR 15360 at commit
|
retest this please |
Test build #66586 has finished for PR 15360 at commit
|
Will review this tonight. Thanks! |
val column1 = columnName.toLowerCase | ||
val column2 = columnName.toUpperCase | ||
withSQLConf("spark.sql.caseSensitive" -> "true") { | ||
sql(s"CREATE TABLE $tableName (`$column1` int, `$column2` double) USING PARQUET") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We hit a bug here... Not by your PRs, but this test case just exposes it. No need to worry about it. I will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What bug? Please let me know when that bug fix pr is sent. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not attempt to create a Hive-compatible table in this case. It always fails because of column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it cause any problems? The logic to create Hive-compatible table is quite conservative, we will try to save into hive metastore first, if fails, fallback to spark specific format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It outputs a warning including an exception, and the test can complete successfully.
WARN org.apache.spark.sql.hive.HiveExternalCatalog: Could not persist `default`.`tbl` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name c1 in the table definition.
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
...
} | ||
} | ||
|
||
private def checkCaseSensitiveColStats(columnName: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment to briefly explain the test case scenario. Thanks!
@@ -62,7 +62,7 @@ case class AnalyzeColumnCommand( | |||
val statistics = Statistics( | |||
sizeInBytes = newTotalSize, | |||
rowCount = Some(rowCount), | |||
colStats = columnStats ++ catalogTable.stats.map(_.colStats).getOrElse(Map())) | |||
colStats = catalogTable.stats.map(_.colStats).getOrElse(Map()) ++ columnStats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you leave a code comment here to emphasize it? I am just afraid this might be modified without notice. Newly computed stats should override the existing stats.
@@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils | |||
} | |||
} | |||
|
|||
test("generate column-level statistics and load them from hive metastore") { | |||
test("test refreshing table stats of cached data source table by `ANALYZE TABLE` statement") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you deduplicate the two test cases refreshing table stats
and refreshing column stats
by calling the same common function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile rebased and updated.
@@ -358,53 +358,189 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils | |||
} | |||
} | |||
|
|||
test("generate column-level statistics and load them from hive metastore") { | |||
private def statsBeforeAfterUpdate(isAnalyzeTable: Boolean): (Statistics, Statistics) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
statsBeforeAfterUpdate
-> getStatsBeforeAfterAnalyzeCommand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Analyze Table COMPUTE STATISTICS FOR COLUMNS
is also Analyze Table
. Thus, the input parm name is confusing. How about isAnalyzeTable
-> isAnalyzeColumns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile OK
cc @cloud-fan I do not have any more comment. Could you check this please? Thanks! |
Test build #66799 has finished for PR 15360 at commit
|
Test build #66807 has finished for PR 15360 at commit
|
d782c14
to
30ac539
Compare
resolve conflicts |
Test build #66935 has finished for PR 15360 at commit
|
Test build #66937 has finished for PR 15360 at commit
|
val (statsBeforeUpdate, statsAfterUpdate) = getStatsBeforeAfterUpdate(isAnalyzeColumns = false) | ||
|
||
assert(statsBeforeUpdate.sizeInBytes > 0) | ||
assert(statsBeforeUpdate.rowCount.contains(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should not use Option
as a collection, but use it more explicitly statsBeforeUpdate.rowCount == Some(1)
. BTW Option.contains
is not in scala 2.10
rsd = spark.sessionState.conf.ndvMaxError) | ||
} | ||
|
||
private def dataAndColStats(): (DataFrame, Seq[(StructField, ColumnStat)]) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method doesn't take any parameters so its result is static. Can we just create 2 fields for them? e.g.
private lazy val testDataFrame = ...
private lazy val expectedStats = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They share some common values e.g. intSeq, stringSeq... so I put them in a single method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then can we
private lazy val (testDataFrame, expectedStats) = {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's good, thanks!
Test build #66955 has finished for PR 15360 at commit
|
LGTM, merging to master! |
## What changes were proposed in this pull request? This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#15360 from wzhfy/colStats2.
## What changes were proposed in this pull request? This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#15360 from wzhfy/colStats2.
What changes were proposed in this pull request?
This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.
How was this patch tested?
add test cases