[SPARK-17073] [SQL] generate column-level statistics #15090

wzhfy · 2016-09-14T03:56:14Z

What changes were proposed in this pull request?

Generate basic column statistics for all the atomic types:

numeric types: max, min, num of nulls, ndv (number of distinct values)
date/timestamp types: they are also represented as numbers internally, so they have the same stats as above.
string: avg length, max length, num of nulls, ndv
binary: avg length, max length, num of nulls
boolean: num of nulls, num of trues, num of falsies

Also support storing and loading these statistics.

One thing to notice:
We support analyzing columns independently, e.g.:
sql1: ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;
sql2: ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;
when running sql2 to collect column stats for value, we don’t remove stats of columns key which are analyzed in sql1 and not in sql2. As a result, users need to guarantee consistency between sql1 and sql2. If the table has been changed before sql2, users should re-analyze column key when they want to analyze column value:
ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;

How was this patch tested?

add unit tests

SparkQA · 2016-09-14T03:59:00Z

Test build #65347 has finished for PR 15090 at commit 59ae3df.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BasicColStats(
- case class AnalyzeColumnCommand(
- trait StatsAggFunc

gatorsmile · 2016-09-14T04:14:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

        table_no_cols,
        isDataSourceTable = true,
        hasSizeInBytes = true,
        expectedRowCounts = Some(10))
    }
  }

+  private def checkColStats(


This test suite becomes bigger and bigger. For column stats, let us create a new file?

I used checkTableStats in some cases for column stats, so maybe put all test cases for table/column stats into a separate file?

Maybe move these utility functions to org.apache.spark.sql.QueryTest? I am not 100% sure whether this is the best home, but we definitely need to split this suite to smaller suites.

I'm planning to create a new suite for column stats test cases, put all the common methods to a new trait called StatisticsTest, and let the two suites extend it. How about that?

Sounds good to me.

gatorsmile · 2016-09-14T04:28:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

@@ -98,8 +98,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder {
      ctx.identifier != null &&
      ctx.identifier.getText.toLowerCase == "noscan") {
      AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString)
-    } else {
+    } else if (ctx.identifierSeq() == null) {


Since this PR changes the Parser, please update the comment of this function to reflect the latest changes.

In addition, please add the test cases in DDLCommandSuite for verifying the Parser's behaviors

This has a bug. It will jump to this branch, if users input

ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS

For analyze column command, users should know exactly what they want to do. So they need to specify the columns, otherwise, we don't compute statistics for columns. AFAIK, hive will generate all column stats for this case, but I don't think we should do that. At least, we could provide other command like FOR ALL COLUMNS to do this.

Then, issue an exception here.

yeah, I'm also thinking to do this:)

@gatorsmile The tests for parsing analyze commands are in StatisticsSuite, so I'll add those verifying tests in it instead of DDLCommandSuite.

As mentioned in the comment, we are going to change the "ANALYZE" syntax in SqlBase.g4, i.e. make the identifierSeq non-optional, which is different from Hive. Is this ok? @rxin @hvanhovell

gatorsmile · 2016-09-14T05:12:10Z

Like Hive, I think we should implement a built-in function, compute_stats. Then, the implementation of AnalyzeColumnCommand will be much cleaner. cc @hvanhovell @cloud-fan

SparkQA · 2016-09-14T06:09:46Z

Test build #65349 has finished for PR 15090 at commit 027bdcc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-14T09:48:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+    tableName: String,


Please use a TableIdentifier here.

Looks like AnalyzeTableCommand is also using String instead of TableIdentifier. Want to change as well?

@viirya Yeah, I'll fix that too.

hvanhovell · 2016-09-14T19:20:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+
+    def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = {
+      val lowerCaseNames = columnNames.map(_.toLowerCase)
+      val attributes =


It is nicer to resolve them using the available resolution functionality in LogicalPlan.

Yeah, then we can let the resolve function deal with case sensitivity. Thanks.

viirya · 2016-09-15T14:17:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+
+case class BasicColStats(
+    dataType: DataType,
+    numNulls: Long,


Any reason why numNulls is the only one not wrapped in Option here?

nvm. I got it.

viirya · 2016-09-15T14:53:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+          s"${otherRelation.nodeName}.")
+    }
+
+    def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = {


It's in def run.

Can this be then defined outside? I know this is about personal taste but it causes confusion to at least two of reviewers here.

Then we will need to pass many parameters to this method.

viirya · 2016-09-15T14:54:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+  }
+}
+
+object ColumnStats extends Enumeration {


private[sql]?

I guess it is not encouraged to add private[sql] in this case. Please see 511f52f.

I would think it is more like the case of SparkPlanInfo?

anyway, no strong opinion to this.

SparkQA · 2016-09-16T06:24:32Z

Test build #65475 has finished for PR 15090 at commit 761a9e0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AnalyzeTableCommand(

viirya · 2016-09-16T06:48:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+      }.toMap
+
+      val statistics =
+        Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount), basicColStats = colStats)


One question. Seems we overwrite all existing column statistics, no matter whether we collect statistics of other columns before. Is it better to only replace the column statistics of the same columns? We can keep the column statistics of other columns. @wzhfy What do you think?

I think that we should document the policy for (partial) statistics updates very very well. It is important that an end user understands what is going on. The current policy is to drop them as soon as we touch the table stats.

In this case I think it it important to consider a few elements:

Relative column statistics typically change less than table statistics. So they do not need to changed as often.

When do we consider column statistics to be stale (less trustworthy)? If the size of the table has changed by x%? If the statistics are older than x time?

I think this is an open question. IMO, as a computing framework like Spark, it is difficult to decide when to recollect the stats, because that depends on how often the data is changed and how it is changed, i.e. how customers use Spark. For different applications, this can be quite different. So I think maybe we should leave the decision to users or someone like DBAs who know their applications well, let them decide when to run the "ANALYZE" commands and refresh the stats.

I am just not to sure that we should follow an all or nothing collection approach. Currently ANALYZE TABLE discards column level statistics and ANALYSE COLUMN discards statistics on any column that has not been included in the given analyze command.

In the databases I have worked with you could typically refresh columns stats independently. I would be a big proponent of doing so.

@hvanhovell Agree. If we want to support refresh column stats independently, we should do as @viirya said. But the consistency between the kept column stats and the refreshed column stats still needs to be guaranteed by users.

For the "all or nothing" statistics approach, we can better maintain statistics consistency. However, from the viewpoint of usability, a user may want to collect column statistics for column A and column B of a given table first. After a while, he may want to collect column statistics for column C and column D because of the need for new queries. A user may intuitively specify only columns C and D in the new ANALZYE command because he expects the statistics of column A and B to be there as well. Hence, refreshing column stats independently can better match a user's expectation.

wzhfy · 2016-09-16T07:38:57Z

Yes we can keep stats of other columns, but we can't be sure if they are still correct between two analyze column commands, unless we recollect them. It's again the consistency issue, like between analyze table and analyze column.

SparkQA · 2016-09-16T08:47:54Z

Test build #65477 has finished for PR 15090 at commit 9cdc722.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell

Looks pretty solid. I have left some more comments.

hvanhovell · 2016-09-16T12:27:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+      avgColLen = if (row.isNullAt(4)) None else Some(row.getDouble(4)),
+      maxColLen = if (row.isNullAt(5)) None else Some(row.getLong(5)),
+      numTrues = if (row.isNullAt(6)) None else Some(row.getLong(6)),
+      numFalses = if (row.isNullAt(7)) None else Some(row.getLong(7) - row.getLong(0)))


It might be easier to write the source expression as Sum(If(Not(e), one, zero)).

hvanhovell · 2016-09-16T12:28:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+
+  def apply(e: Expression, relativeSD: Double): CreateStruct = {
+    var statistics = e.dataType match {
+      case n: NumericType =>


Why not combine NumericType with TimestampType & DateType? e.g.. _: NumericType | TimestampType | DateType?

This will cause compilation error: illegal variable in pattern alternative

@wzhfy Hm. Are you doubly sure that causes compilation error? I just checked out your PR and it seems compiling fine.

case _: NumericType | TimestampType | DateType => Seq(Max(e), Min(e), HyperLogLogPlusPlus(e, relativeSD), nullDouble, nullLong, nullLong, nullLong)

If you keep n variable then it will cause a compilation error but if we don't use this, it'd be fine.

Oh, I made a mistake, thanks! @HyukjinKwon

hvanhovell · 2016-09-16T12:29:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+        Seq(nullBoolean, nullBoolean, two, nullDouble, nullLong, Sum(If(e, one, zero)),
+          Sum(If(e, zero, one)))
+      case otherType =>
+        throw new AnalysisException("ANALYZE command is not supported for data type: " +


This can be a bit clearer for the end user. Please add the name of the column & mention that you are analyzing columns.

hvanhovell · 2016-09-16T12:31:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+    BasicColStats(
+      dataType = e.dataType,
+      numNulls = row.getLong(0),
+      max = if (row.isNullAt(1)) None else Some(row.get(1, e.dataType)),


It might be nicer to add a helper method here.

hvanhovell · 2016-09-16T12:31:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

@@ -32,67 +34,25 @@ import org.apache.spark.sql.execution.datasources.LogicalRelation
 * Analyzes the given table in the current database to generate statistics, which will be
 * used in query optimizations.
 */
-case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand {
+case class AnalyzeTableCommand(
+    tableIdent: TableIdentifier,


hvanhovell · 2016-09-16T13:14:17Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -563,6 +563,13 @@ object SQLConf {
      .timeConf(TimeUnit.MILLISECONDS)
      .createWithDefault(10L)

+  val NDV_MAX_ERROR =
+    SQLConfigBuilder("spark.sql.ndv.maxError")


Could you place this under spark.sql.statistics?

hvanhovell · 2016-09-16T13:16:15Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsSuite.scala

+      assert(colStats.min == expectedColStats.min)
+      if (expectedColStats.ndv.isDefined) {
+        // ndv is an approximate value, so we just make sure we have the value
+        assert(colStats.ndv.get >= 0)


you can also check if the value is within 3 standard deviations.

How to get the standard deviations?

It should be:

estimate column statistics for several times.

calculate mean and standard deviation of ndv for the multiple statistics.

check if the real ndv value is within (mean + 3 * sd, mean - 3 * sd).

Please correct me if it is wrong.

Why 3 standard deviations?
I think for tests in this suite, we just need to make sure we get the stats, we should leave the accuracy test to HyperLogLogPlusPlusSuite, right?

assert(colStats.ndv.get >= 0) might be too loose. Can we check if it is more than (or equal to) 0 and less than (or equal to) the number of values in the column?

Because the check here for estimated ndv value is not really related to the actual ndv value, we can't make sure if we really get the estimated ndv back. Especially we only use ndv > 0 to check it. In an extreme case, we can replace the ndv expression with a literal(0) in the aggregation, the test would not check it out.

Of course my previous suggestion 0 <= ndv <= count(col) is also failed for this standard. Using sd to check it is more reliable. I am not strong against this. But it should be better. See what @hvanhovell and @cloud-fan think?

I think I may know what @hvanhovell meant. The standard deviation is the parameter we pass to HyperLogLogPlusPlus, so the code may look like this:

// ndv is an approximate value, so we make sure we have the value, and it should be // within 3*SD's of the given rsd. assert(colStats.ndv.get >= 0) if (expectedColStats.ndv.get == 0) { assert(colStats.ndv.get == 0) } else if (expectedColStats.ndv.get > 0) { val rsd = spark.sessionState.conf.ndvMaxError val error = math.abs((colStats.ndv.get / expectedColStats.ndv.get.toDouble) - 1.0d) assert(error <= rsd * 3.0d, "Error should be within 3 std. errors.") }

What do you think?

I just think it is better to test this a bit more thoroughly. The HLL++ tests use a similar test.

BTW: sorry for the very cryptic initial comment.

hvanhovell · 2016-09-16T13:18:26Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

@@ -401,7 +401,10 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
      var statsProperties: Map[String, String] =
        Map(STATISTICS_TOTAL_SIZE -> stats.sizeInBytes.toString())
      if (stats.rowCount.isDefined) {
-        statsProperties += (STATISTICS_NUM_ROWS -> stats.rowCount.get.toString())
+        statsProperties += STATISTICS_NUM_ROWS -> stats.rowCount.get.toString()


General question: Is this how Hive stores this?

Yes, we just give it a different name in Spark.

hvanhovell · 2016-09-16T13:23:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

   */
  override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = withOrigin(ctx) {
    if (ctx.partitionSpec == null &&
      ctx.identifier != null &&
      ctx.identifier.getText.toLowerCase == "noscan") {
-      AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString)
+      AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier))
+    } else if (ctx.identifierSeq() == null) {


Why not make the identifierSeq non-optional in the grammar? Saves a lot of typing:)

Yeah, good idea

hvanhovell · 2016-09-16T13:28:46Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsColumnSuite.scala

+        StructField(name = "c3", dataType = IntegerType, nullable = true) ::
+        StructField(name = "c4", dataType = LongType, nullable = true) :: Nil)
+    val expectedBasicStats = BasicColStats(
+      dataType = ByteType, numNulls = 2, max = Some(3), min = Some(1), ndv = Some(3))


It is weird that this works. The Min and the Max value should not be equal for Short and Long types.

Can you explain more about this?

rxin · 2016-09-17T06:07:54Z

What's "basic" about? Are we going to have something that's not basic in the future (e.g. histogram)? If yes, should those go into a separate class or just in ColumnStats?

wzhfy · 2016-09-17T06:37:41Z

@rxin Yeah, I think it's better to move histograms into ColumnStats than to maintain two members like BasicColStats and Histograms. Let me rename BasicColStats as ColumnStats so that all the column-level stats go into this structure.

viirya · 2016-09-19T04:38:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+  }
+}
+
+case class BasicColStats(


Add a section of comments explaining the meaning for each statistic? E.g., Readers may not know what ndv means.

@viirya thanks for the advice.

cloud-fan · 2016-09-30T00:02:38Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsColumnSuite.scala

+        case o => o
+      }
+      assert(operators.size == 1)
+      if (operators.head.getClass != c) {


I don't think it's a good style, parsing a SQL string and constructing an expected logical plan and then compare them looks much clearer.

ok, then I'll fix it

SparkQA · 2016-09-30T00:30:51Z

Test build #66131 has finished for PR 15090 at commit 3335af6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-30T06:06:57Z

Test build #66150 has finished for PR 15090 at commit 06819dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-30T23:33:56Z

Test build #66188 has finished for PR 15090 at commit 95c2d2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-01T12:51:45Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsTest.scala

+      val relation = spark.sessionState.catalog.lookupRelation(tableIdent)
+      val columnStats =
+        AnalyzeColumnCommand(tableIdent, columns.map(_.name)).computeColStats(spark, relation)._2
+      expectedColStatsSeq.foreach { expected =>


how about ...foreach { case (field, expectedStat) => ? Then we use field.name instead of expected._1.name.

Yeah, that's better.

cloud-fan · 2016-10-01T12:59:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+ */
+case class ColumnStat(statRow: InternalRow) {
+
+  def forNumeric[T <: AtomicType](dataType: T): NumericColumnStat[T] = {


We have NumericType, can we use that?

But we need to include DateType and TimestampType also, so I use AtomicType.

cloud-fan · 2016-10-01T13:17:20Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsColumnSuite.scala

+        dataType = IntegerType,
+        colStat = colStat,
+        expectedColStat = ColumnStat(InternalRow.fromSeq(
+          Seq(0L, values.max, values.min, values.distinct.length.toLong))),


I think we should write InternalRow(0L, values.max, ...) directly

cloud-fan · 2016-10-01T13:19:25Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsColumnSuite.scala

+      df.write.format("json").saveAsTable(tmpTable)
+
+      sql(s"CREATE TABLE $table (c1 int) USING PARQUET")
+      sql(s"INSERT INTO $table SELECT * FROM $tmpTable")


we can just write INSERT INTO $table SELECT 1, thus don't need the temp table

cloud-fan · 2016-10-01T13:26:02Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

-      val totalSize = BigInt(catalogTable.properties.get(STATISTICS_TOTAL_SIZE).get)
-      // TODO: we will compute "estimatedSize" when we have column stats:
-      // average size of row * number of rows
+    if (catalogTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX)).nonEmpty) {


should we make catalogTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX)) a variable? it's used twice

cloud-fan · 2016-10-01T13:29:44Z

LGTM except some minor comment, thanks for working on it!

gatorsmile · 2016-10-01T16:47:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+      if (!attributesToAnalyze.contains(expr)) {
+        attributesToAnalyze += expr
+      } else {
+        logWarning(s"Duplicated column: $col")


You need to explain the context. Otherwise, the log message becomes useless.

gatorsmile · 2016-10-01T17:15:04Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsTest.scala

+      val tableIdent = TableIdentifier(table, Some("default"))
+      val relation = spark.sessionState.catalog.lookupRelation(tableIdent)
+      val columnStats =
+        AnalyzeColumnCommand(tableIdent, columns.map(_.name)).computeColStats(spark, relation)._2


Nit:

val (_, columnStats) = AnalyzeColumnCommand(tableIdent, columnsToAnalyze).computeColStats(spark, relation)

gatorsmile · 2016-10-01T17:15:54Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsColumnSuite.scala

+        val tableIdent = TableIdentifier(table, Some("default"))
+        val relation = spark.sessionState.catalog.lookupRelation(tableIdent)
+        val columnStats =
+          AnalyzeColumnCommand(tableIdent, columnsToAnalyze).computeColStats(spark, relation)._2


The same here.

SparkQA · 2016-10-01T18:16:40Z

Test build #66226 has finished for PR 15090 at commit 734abad.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-10-01T18:24:11Z

retest this please

gatorsmile · 2016-10-01T18:52:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+    Seq(numNulls(e), avgLength(e), maxLength(e))
+  }
+
+  def booleanColumnStat(e: Expression): Seq[Expression] = {


All the above 13 functions should be private, right?

gatorsmile · 2016-10-01T18:54:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+    }
+  }
+
+  def apply(e: Attribute, relativeSD: Double): CreateStruct = e.dataType match {


Nit: e -> attr

SparkQA · 2016-10-01T20:38:28Z

Test build #66227 has finished for PR 15090 at commit 734abad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-10-02T00:27:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+    }
+    if (duplicatedColumns.nonEmpty) {
+      logWarning(s"Duplicated columns ${duplicatedColumns.mkString("(", ", ", ")")} detected " +
+        s"when analyzing columns ${columnNames.mkString("(", ", ", ")")}, ignoring them.")


How about this?

logWarning("A duplicate column name was detected in `ANALYZE TABLE` statement. " + s"Input columns: ${columnNames.mkString("(", ", ", ")")}. " + s"Duplicate columns: ${duplicatedColumns.mkString("(", ", ", ")")}.")

gatorsmile · 2016-10-02T01:32:24Z

Could you add a positive test case when we turn on the case sensitivity? The scenario is like:

    withTable(table) {
      withSQLConf("spark.sql.caseSensitive" -> "true") {
        sql(s"CREATE TABLE $table (c1 int, C1 double) STORED AS PARQUET")
        sql(s"INSERT INTO $table SELECT 1, 3.0")
        sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS c1, C1")
        ...
      }
    }

Note: the above is just the SQL statements. We need the verification logics.

This is to verify whether the case of column names is preserved in ANALYZE COLUMN.

gatorsmile · 2016-10-02T01:41:30Z

Another test case for Unicode column names in ANALYZE COLUMN:

    // scalastyle:off
    // non ascii characters are not allowed in the source code, so we disable the scalastyle.
    val colName1 = "`列1`"
    val colName2 = "`列2`"
    // scalastyle:on
    withTable(table) {
      sql(s"CREATE TABLE $table ($colName1 int, $colName2 double) STORED AS PARQUET")
      sql(s"INSERT INTO $table SELECT 1, 3.0")
      sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS $colName2, $colName1")
      ...
    }```

gatorsmile · 2016-10-02T03:04:44Z

LGTM except the above minor comments.

Test cases mentioned above need to be added to sql/hive/, since the correctness could be affected by the behaviors of Hive metastore . So far, your test cases have not cover Hive serde tables. Thus, the above two scenarios are creating Hive-serde tables.

If possible, also add a test case for refreshTable in sql/hive. That is also not verified.

viirya · 2016-10-03T08:28:02Z

LGTM except for one minor comment regarding ndv config document.

rxin · 2016-10-03T17:11:26Z

Alright looking at the number of outstanding comments, it looks like none of the issues are structural anymore. I'm going to merge this in order to speed up this work. @wzhfy please make sure you fix the remaining issues in follow-up pull requests.

Merging into master.

wzhfy · 2016-10-04T13:56:48Z

@rxin Thanks, I'll fix them in the followup pr.

wzhfy · 2016-10-05T06:12:30Z

@gatorsmile Hive tables don't support case sensitive column names, so I use data source tables in the added test cases. PR is already sent: #15360.

gatorsmile reviewed Sep 14, 2016
View reviewed changes

gatorsmile mentioned this pull request Sep 14, 2016

[SPARK-17410] [SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl #14971

Closed

gatorsmile reviewed Sep 14, 2016
View reviewed changes

hvanhovell reviewed Sep 14, 2016
View reviewed changes

hvanhovell reviewed Sep 14, 2016

View reviewed changes

viirya reviewed Sep 15, 2016

View reviewed changes

viirya reviewed Sep 16, 2016

View reviewed changes

hvanhovell requested changes Sep 16, 2016

View reviewed changes

viirya reviewed Sep 19, 2016

View reviewed changes

cloud-fan reviewed Sep 30, 2016

View reviewed changes

fix test style

06819dd

add logWarning for duplicated columns

95c2d2f

cloud-fan reviewed Oct 1, 2016

View reviewed changes

gatorsmile reviewed Oct 1, 2016

View reviewed changes

more comments

734abad

gatorsmile reviewed Oct 1, 2016

View reviewed changes

gatorsmile reviewed Oct 2, 2016

View reviewed changes

asfgit closed this in 7bf9212 Oct 3, 2016

[SPARK-17073] [SQL] generate column-level statistics #15090

[SPARK-17073] [SQL] generate column-level statistics #15090

Uh oh!

Conversation

wzhfy commented Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Sep 14, 2016 •

edited

Loading

gatorsmile Sep 14, 2016 •

edited

Loading

wzhfy Sep 14, 2016 •

edited

Loading

wzhfy Sep 18, 2016 •

edited

Loading

gatorsmile commented Sep 14, 2016 •

edited

Loading

wzhfy Sep 15, 2016 •

edited

Loading

HyukjinKwon Sep 16, 2016 •

edited

Loading