[SPARK-17073] [SQL] [FOLLOWUP] generate column-level statistics #15360

wzhfy · 2016-10-05T06:09:35Z

What changes were proposed in this pull request?

This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation.

How was this patch tested?

add test cases

wzhfy · 2016-10-05T06:16:48Z

cc @cloud-fan @gatorsmile

SparkQA · 2016-10-05T08:16:46Z

Test build #66378 has finished for PR 15360 at commit 0ad7c88.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-10-06T00:40:30Z

Thank you! Will review it tonight or tomorrow morning.

gatorsmile · 2016-10-06T20:37:19Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+    // non ascii characters are not allowed in the source code, so we disable the scalastyle.
+    val columnGroups: Seq[(String, String)] = Seq(("c1", "C1"), ("列c", "列C"))
+    // scalastyle:on
+    columnGroups.foreach { case (column1, column2) =>


Could you create a separate function for the following checking logics? Then, you can have two test cases without duplicate codes.

gatorsmile · 2016-10-06T21:51:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

@@ -62,7 +62,7 @@ case class AnalyzeColumnCommand(
      val statistics = Statistics(
        sizeInBytes = newTotalSize,
        rowCount = Some(rowCount),
-        colStats = columnStats ++ catalogTable.stats.map(_.colStats).getOrElse(Map()))
+        colStats = catalogTable.stats.map(_.colStats).getOrElse(Map()) ++ columnStats)


Is this a bug exposed by the newly added test case?

:) Improving the test case coverage is important.

gatorsmile · 2016-10-06T21:53:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

@@ -90,8 +90,9 @@ case class AnalyzeColumnCommand(
      }
    }
    if (duplicatedColumns.nonEmpty) {
-      logWarning(s"Duplicated columns ${duplicatedColumns.mkString("(", ", ", ")")} detected " +
-        s"when analyzing columns ${columnNames.mkString("(", ", ", ")")}, ignoring them.")
+      logWarning("Duplicate column names were detected in `ANALYZE TABLE` statement. " +


detected -> deduplicated

gatorsmile · 2016-10-06T21:56:51Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+    }
+  }
+
+  test("test refreshing statistics of cached data source table") {


Please leave the comments to explain which DDL commands trigger the refresh; Otherwise, the reviewers might be confused about what this test case is doing.

gatorsmile · 2016-10-06T22:16:48Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+        rsd = spark.sessionState.conf.ndvMaxError)
+
+      sql(s"INSERT INTO $tableName SELECT 2")
+      sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")


What is the purpose of this DDL?

gatorsmile · 2016-10-06T22:18:24Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+
+      sql(s"INSERT INTO $tableName SELECT 2")
+      sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
+      sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS key")


The above both DDL will call refreshTable with the same table name. Right? If the source codes remove any refreshTable, the test case still passes. Right?

Yeah, I'll split these two command into two separate test cases, for table stats and column stats respectively.

SparkQA · 2016-10-07T17:07:37Z

Test build #66503 has finished for PR 15360 at commit e7979c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-10-07T17:22:46Z

We need a test case for Hive serde table. So far, I still have not found any test case to cover Hive serde tables.

wzhfy · 2016-10-08T14:14:42Z

@gatorsmile Oh, I thought by "hive serde tables" you mean tables stored in Hive metastore.
Let me create test cases for both data source table and hive serde table in sql/hive.

SparkQA · 2016-10-08T17:41:18Z

Test build #66581 has finished for PR 15360 at commit 2ee4252.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-10-08T22:59:36Z

retest this please

SparkQA · 2016-10-09T01:09:19Z

Test build #66586 has finished for PR 15360 at commit 2ee4252.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-10-11T00:17:58Z

Will review this tonight. Thanks!

gatorsmile · 2016-10-11T06:26:55Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+      val column1 = columnName.toLowerCase
+      val column2 = columnName.toUpperCase
+      withSQLConf("spark.sql.caseSensitive" -> "true") {
+        sql(s"CREATE TABLE $tableName (`$column1` int, `$column2` double) USING PARQUET")


We hit a bug here... Not by your PRs, but this test case just exposes it. No need to worry about it. I will fix it.

What bug? Please let me know when that bug fix pr is sent. :)

We should not attempt to create a Hive-compatible table in this case. It always fails because of column names.

does it cause any problems? The logic to create Hive-compatible table is quite conservative, we will try to save into hive metastore first, if fails, fallback to spark specific format.

It outputs a warning including an exception, and the test can complete successfully.

WARN org.apache.spark.sql.hive.HiveExternalCatalog: Could not persist `default`.`tbl` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name c1 in the table definition. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720) ...

gatorsmile · 2016-10-11T06:31:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+    }
+  }
+
+  private def checkCaseSensitiveColStats(columnName: String): Unit = {


Please add a comment to briefly explain the test case scenario. Thanks!

gatorsmile · 2016-10-11T06:39:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

@@ -62,7 +62,7 @@ case class AnalyzeColumnCommand(
      val statistics = Statistics(
        sizeInBytes = newTotalSize,
        rowCount = Some(rowCount),
-        colStats = columnStats ++ catalogTable.stats.map(_.colStats).getOrElse(Map()))
+        colStats = catalogTable.stats.map(_.colStats).getOrElse(Map()) ++ columnStats)


Could you leave a code comment here to emphasize it? I am just afraid this might be modified without notice. Newly computed stats should override the existing stats.

gatorsmile · 2016-10-11T06:44:02Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

@@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils
    }
  }

-  test("generate column-level statistics and load them from hive metastore") {
+  test("test refreshing table stats of cached data source table by `ANALYZE TABLE` statement") {


Could you deduplicate the two test cases refreshing table stats and refreshing column stats by calling the same common function?

@gatorsmile rebased and updated.

gatorsmile · 2016-10-12T07:11:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

@@ -358,53 +358,189 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils
    }
  }

-  test("generate column-level statistics and load them from hive metastore") {
+  private def statsBeforeAfterUpdate(isAnalyzeTable: Boolean): (Statistics, Statistics) = {


statsBeforeAfterUpdate -> getStatsBeforeAfterAnalyzeCommand

Analyze Table COMPUTE STATISTICS FOR COLUMNS is also Analyze Table. Thus, the input parm name is confusing. How about isAnalyzeTable -> isAnalyzeColumns?

@gatorsmile OK

gatorsmile · 2016-10-12T07:13:56Z

cc @cloud-fan I do not have any more comment. Could you check this please? Thanks!

SparkQA · 2016-10-12T09:21:50Z

Test build #66799 has finished for PR 15360 at commit 1e64163.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T10:55:17Z

Test build #66807 has finished for PR 15360 at commit d93d082.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-10-14T02:59:42Z

resolve conflicts

SparkQA · 2016-10-14T04:15:12Z

Test build #66935 has finished for PR 15360 at commit d782c14.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class AnalyzeColumnCommand(

SparkQA · 2016-10-14T04:40:38Z

Test build #66937 has finished for PR 15360 at commit 30ac539.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-14T06:48:10Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+    val (statsBeforeUpdate, statsAfterUpdate) = getStatsBeforeAfterUpdate(isAnalyzeColumns = false)
+
+    assert(statsBeforeUpdate.sizeInBytes > 0)
+    assert(statsBeforeUpdate.rowCount.contains(1))


nit: we should not use Option as a collection, but use it more explicitly statsBeforeUpdate.rowCount == Some(1) . BTW Option.contains is not in scala 2.10

cloud-fan · 2016-10-14T06:53:12Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+      rsd = spark.sessionState.conf.ndvMaxError)
+  }
+
+  private def dataAndColStats(): (DataFrame, Seq[(StructField, ColumnStat)]) = {


this method doesn't take any parameters so its result is static. Can we just create 2 fields for them? e.g.

private lazy val testDataFrame = ... private lazy val expectedStats = ...

They share some common values e.g. intSeq, stringSeq... so I put them in a single method.

then can we

private lazy val (testDataFrame, expectedStats) = { ... }

that's good, thanks!

SparkQA · 2016-10-14T09:52:21Z

Test build #66955 has finished for PR 15360 at commit 6cf23ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-14T13:19:24Z

LGTM, merging to master!

## What changes were proposed in this pull request? This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#15360 from wzhfy/colStats2.

wzhfy mentioned this pull request Oct 5, 2016

[SPARK-17073] [SQL] generate column-level statistics #15090

Closed

gatorsmile reviewed Oct 6, 2016

View reviewed changes

gatorsmile reviewed Oct 11, 2016

View reviewed changes

wzhfy force-pushed the colStats2 branch from 2ee4252 to 1e64163 Compare October 12, 2016 06:51

gatorsmile reviewed Oct 12, 2016

View reviewed changes

improve test cases and documentation

30ac539

wzhfy force-pushed the colStats2 branch 2 times, most recently from d782c14 to 30ac539 Compare October 14, 2016 02:23

cloud-fan reviewed Oct 14, 2016

View reviewed changes

comments

6cf23ae

asfgit closed this in 7486442 Oct 14, 2016

[SPARK-17073] [SQL] [FOLLOWUP] generate column-level statistics #15360

[SPARK-17073] [SQL] [FOLLOWUP] generate column-level statistics #15360

Uh oh!

Conversation

wzhfy commented Oct 5, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wzhfy commented Oct 5, 2016

Uh oh!

SparkQA commented Oct 5, 2016

Uh oh!

gatorsmile commented Oct 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 7, 2016

Uh oh!

gatorsmile commented Oct 7, 2016

Uh oh!

wzhfy commented Oct 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 8, 2016

Uh oh!

wzhfy commented Oct 8, 2016

Uh oh!

SparkQA commented Oct 9, 2016

Uh oh!

gatorsmile commented Oct 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Oct 12, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

wzhfy commented Oct 14, 2016

wzhfy commented Oct 8, 2016 •

edited

Loading

wzhfy Oct 14, 2016 •

edited

Loading