Skip to content

[SPARK-17073] [SQL] [FOLLOWUP] generate column-level statistics #15360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Oct 5, 2016

What changes were proposed in this pull request?

This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.

How was this patch tested?

add test cases

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 5, 2016

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66378 has finished for PR 15360 at commit 0ad7c88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thank you! Will review it tonight or tomorrow morning.

// non ascii characters are not allowed in the source code, so we disable the scalastyle.
val columnGroups: Seq[(String, String)] = Seq(("c1", "C1"), ("列c", "列C"))
// scalastyle:on
columnGroups.foreach { case (column1, column2) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a separate function for the following checking logics? Then, you can have two test cases without duplicate codes.

@@ -62,7 +62,7 @@ case class AnalyzeColumnCommand(
val statistics = Statistics(
sizeInBytes = newTotalSize,
rowCount = Some(rowCount),
colStats = columnStats ++ catalogTable.stats.map(_.colStats).getOrElse(Map()))
colStats = catalogTable.stats.map(_.colStats).getOrElse(Map()) ++ columnStats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug exposed by the newly added test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Improving the test case coverage is important.

@@ -90,8 +90,9 @@ case class AnalyzeColumnCommand(
}
}
if (duplicatedColumns.nonEmpty) {
logWarning(s"Duplicated columns ${duplicatedColumns.mkString("(", ", ", ")")} detected " +
s"when analyzing columns ${columnNames.mkString("(", ", ", ")")}, ignoring them.")
logWarning("Duplicate column names were detected in `ANALYZE TABLE` statement. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detected -> deduplicated

}
}

test("test refreshing statistics of cached data source table") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave the comments to explain which DDL commands trigger the refresh; Otherwise, the reviewers might be confused about what this test case is doing.

rsd = spark.sessionState.conf.ndvMaxError)

sql(s"INSERT INTO $tableName SELECT 2")
sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this DDL?


sql(s"INSERT INTO $tableName SELECT 2")
sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS key")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above both DDL will call refreshTable with the same table name. Right? If the source codes remove any refreshTable, the test case still passes. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll split these two command into two separate test cases, for table stats and column stats respectively.

@SparkQA
Copy link

SparkQA commented Oct 7, 2016

Test build #66503 has finished for PR 15360 at commit e7979c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

We need a test case for Hive serde table. So far, I still have not found any test case to cover Hive serde tables.

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 8, 2016

@gatorsmile Oh, I thought by "hive serde tables" you mean tables stored in Hive metastore.
Let me create test cases for both data source table and hive serde table in sql/hive.

@SparkQA
Copy link

SparkQA commented Oct 8, 2016

Test build #66581 has finished for PR 15360 at commit 2ee4252.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 8, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Oct 9, 2016

Test build #66586 has finished for PR 15360 at commit 2ee4252.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Will review this tonight. Thanks!

val column1 = columnName.toLowerCase
val column2 = columnName.toUpperCase
withSQLConf("spark.sql.caseSensitive" -> "true") {
sql(s"CREATE TABLE $tableName (`$column1` int, `$column2` double) USING PARQUET")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hit a bug here... Not by your PRs, but this test case just exposes it. No need to worry about it. I will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bug? Please let me know when that bug fix pr is sent. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not attempt to create a Hive-compatible table in this case. It always fails because of column names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it cause any problems? The logic to create Hive-compatible table is quite conservative, we will try to save into hive metastore first, if fails, fallback to spark specific format.

Copy link
Contributor Author

@wzhfy wzhfy Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It outputs a warning including an exception, and the test can complete successfully.

WARN org.apache.spark.sql.hive.HiveExternalCatalog: Could not persist `default`.`tbl` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name c1 in the table definition.
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
...

}
}

private def checkCaseSensitiveColStats(columnName: String): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment to briefly explain the test case scenario. Thanks!

@@ -62,7 +62,7 @@ case class AnalyzeColumnCommand(
val statistics = Statistics(
sizeInBytes = newTotalSize,
rowCount = Some(rowCount),
colStats = columnStats ++ catalogTable.stats.map(_.colStats).getOrElse(Map()))
colStats = catalogTable.stats.map(_.colStats).getOrElse(Map()) ++ columnStats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave a code comment here to emphasize it? I am just afraid this might be modified without notice. Newly computed stats should override the existing stats.

@@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils
}
}

test("generate column-level statistics and load them from hive metastore") {
test("test refreshing table stats of cached data source table by `ANALYZE TABLE` statement") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you deduplicate the two test cases refreshing table stats and refreshing column stats by calling the same common function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile rebased and updated.

@@ -358,53 +358,189 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils
}
}

test("generate column-level statistics and load them from hive metastore") {
private def statsBeforeAfterUpdate(isAnalyzeTable: Boolean): (Statistics, Statistics) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statsBeforeAfterUpdate -> getStatsBeforeAfterAnalyzeCommand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyze Table COMPUTE STATISTICS FOR COLUMNS is also Analyze Table. Thus, the input parm name is confusing. How about isAnalyzeTable -> isAnalyzeColumns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile
Copy link
Member

cc @cloud-fan I do not have any more comment. Could you check this please? Thanks!

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66799 has finished for PR 15360 at commit 1e64163.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66807 has finished for PR 15360 at commit d93d082.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy wzhfy force-pushed the colStats2 branch 2 times, most recently from d782c14 to 30ac539 Compare October 14, 2016 02:23
@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 14, 2016

resolve conflicts

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66935 has finished for PR 15360 at commit d782c14.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class AnalyzeColumnCommand(

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66937 has finished for PR 15360 at commit 30ac539.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val (statsBeforeUpdate, statsAfterUpdate) = getStatsBeforeAfterUpdate(isAnalyzeColumns = false)

assert(statsBeforeUpdate.sizeInBytes > 0)
assert(statsBeforeUpdate.rowCount.contains(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should not use Option as a collection, but use it more explicitly statsBeforeUpdate.rowCount == Some(1) . BTW Option.contains is not in scala 2.10

rsd = spark.sessionState.conf.ndvMaxError)
}

private def dataAndColStats(): (DataFrame, Seq[(StructField, ColumnStat)]) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method doesn't take any parameters so its result is static. Can we just create 2 fields for them? e.g.

private lazy val testDataFrame = ...
private lazy val expectedStats = ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They share some common values e.g. intSeq, stringSeq... so I put them in a single method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can we

private lazy val (testDataFrame, expectedStats) = {
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's good, thanks!

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66955 has finished for PR 15360 at commit 6cf23ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 7486442 Oct 14, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?
This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.

## How was this patch tested?
add test cases

Author: wangzhenhua <wangzhenhua@huawei.com>

Closes apache#15360 from wzhfy/colStats2.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.

## How was this patch tested?
add test cases

Author: wangzhenhua <wangzhenhua@huawei.com>

Closes apache#15360 from wzhfy/colStats2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants