Skip to content

[SPARK-17073] [SQL] generate column-level statistics #15090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Sep 14, 2016

What changes were proposed in this pull request?

Generate basic column statistics for all the atomic types:

  • numeric types: max, min, num of nulls, ndv (number of distinct values)
  • date/timestamp types: they are also represented as numbers internally, so they have the same stats as above.
  • string: avg length, max length, num of nulls, ndv
  • binary: avg length, max length, num of nulls
  • boolean: num of nulls, num of trues, num of falsies

Also support storing and loading these statistics.

One thing to notice:
We support analyzing columns independently, e.g.:
sql1: ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;
sql2: ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;
when running sql2 to collect column stats for value, we don’t remove stats of columns key which are analyzed in sql1 and not in sql2. As a result, users need to guarantee consistency between sql1 and sql2. If the table has been changed before sql2, users should re-analyze column key when they want to analyze column value:
ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;

How was this patch tested?

add unit tests

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65347 has finished for PR 15090 at commit 59ae3df.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BasicColStats(
    • case class AnalyzeColumnCommand(
    • trait StatsAggFunc

table_no_cols,
isDataSourceTable = true,
hasSizeInBytes = true,
expectedRowCounts = Some(10))
}
}

private def checkColStats(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test suite becomes bigger and bigger. For column stats, let us create a new file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used checkTableStats in some cases for column stats, so maybe put all test cases for table/column stats into a separate file?

Copy link
Member

@gatorsmile gatorsmile Sep 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move these utility functions to org.apache.spark.sql.QueryTest? I am not 100% sure whether this is the best home, but we definitely need to split this suite to smaller suites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to create a new suite for column stats test cases, put all the common methods to a new trait called StatisticsTest, and let the two suites extend it. How about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@@ -98,8 +98,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder {
ctx.identifier != null &&
ctx.identifier.getText.toLowerCase == "noscan") {
AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString)
} else {
} else if (ctx.identifierSeq() == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR changes the Parser, please update the comment of this function to reflect the latest changes.

In addition, please add the test cases in DDLCommandSuite for verifying the Parser's behaviors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a bug. It will jump to this branch, if users input

ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS

Copy link
Contributor Author

@wzhfy wzhfy Sep 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For analyze column command, users should know exactly what they want to do. So they need to specify the columns, otherwise, we don't compute statistics for columns. AFAIK, hive will generate all column stats for this case, but I don't think we should do that. At least, we could provide other command like FOR ALL COLUMNS to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, issue an exception here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm also thinking to do this:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile The tests for parsing analyze commands are in StatisticsSuite, so I'll add those verifying tests in it instead of DDLCommandSuite.

Copy link
Contributor Author

@wzhfy wzhfy Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the comment, we are going to change the "ANALYZE" syntax in SqlBase.g4, i.e. make the identifierSeq non-optional, which is different from Hive. Is this ok? @rxin @hvanhovell

@gatorsmile
Copy link
Member

gatorsmile commented Sep 14, 2016

Like Hive, I think we should implement a built-in function, compute_stats. Then, the implementation of AnalyzeColumnCommand will be much cleaner. cc @hvanhovell @cloud-fan

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65349 has finished for PR 15090 at commit 027bdcc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* which will be used in query optimizations.
*/
case class AnalyzeColumnCommand(
tableName: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a TableIdentifier here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like AnalyzeTableCommand is also using String instead of TableIdentifier. Want to change as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Yeah, I'll fix that too.


def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = {
val lowerCaseNames = columnNames.map(_.toLowerCase)
val attributes =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nicer to resolve them using the available resolution functionality in LogicalPlan.

Copy link
Contributor Author

@wzhfy wzhfy Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, then we can let the resolve function deal with case sensitivity. Thanks.


case class BasicColStats(
dataType: DataType,
numNulls: Long,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why numNulls is the only one not wrapped in Option here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. I got it.

s"${otherRelation.nodeName}.")
}

def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in def run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be then defined outside? I know this is about personal taste but it causes confusion to at least two of reviewers here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we will need to pass many parameters to this method.

}
}

object ColumnStats extends Enumeration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[sql]?

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is not encouraged to add private[sql] in this case. Please see 511f52f.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think it is more like the case of SparkPlanInfo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, no strong opinion to this.

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65475 has finished for PR 15090 at commit 761a9e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AnalyzeTableCommand(

}.toMap

val statistics =
Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount), basicColStats = colStats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question. Seems we overwrite all existing column statistics, no matter whether we collect statistics of other columns before. Is it better to only replace the column statistics of the same columns? We can keep the column statistics of other columns. @wzhfy What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should document the policy for (partial) statistics updates very very well. It is important that an end user understands what is going on. The current policy is to drop them as soon as we touch the table stats.

In this case I think it it important to consider a few elements:

  1. Relative column statistics typically change less than table statistics. So they do not need to changed as often.
  2. When do we consider column statistics to be stale (less trustworthy)? If the size of the table has changed by x%? If the statistics are older than x time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an open question. IMO, as a computing framework like Spark, it is difficult to decide when to recollect the stats, because that depends on how often the data is changed and how it is changed, i.e. how customers use Spark. For different applications, this can be quite different. So I think maybe we should leave the decision to users or someone like DBAs who know their applications well, let them decide when to run the "ANALYZE" commands and refresh the stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just not to sure that we should follow an all or nothing collection approach. Currently ANALYZE TABLE discards column level statistics and ANALYSE COLUMN discards statistics on any column that has not been included in the given analyze command.

In the databases I have worked with you could typically refresh columns stats independently. I would be a big proponent of doing so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Agree. If we want to support refresh column stats independently, we should do as @viirya said. But the consistency between the kept column stats and the refreshed column stats still needs to be guaranteed by users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "all or nothing" statistics approach, we can better maintain statistics consistency. However, from the viewpoint of usability, a user may want to collect column statistics for column A and column B of a given table first. After a while, he may want to collect column statistics for column C and column D because of the need for new queries. A user may intuitively specify only columns C and D in the new ANALZYE command because he expects the statistics of column A and B to be there as well. Hence, refreshing column stats independently can better match a user's expectation.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 16, 2016

Yes we can keep stats of other columns, but we can't be sure if they are still correct between two analyze column commands, unless we recollect them. It's again the consistency issue, like between analyze table and analyze column.

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65477 has finished for PR 15090 at commit 9cdc722.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty solid. I have left some more comments.

avgColLen = if (row.isNullAt(4)) None else Some(row.getDouble(4)),
maxColLen = if (row.isNullAt(5)) None else Some(row.getLong(5)),
numTrues = if (row.isNullAt(6)) None else Some(row.getLong(6)),
numFalses = if (row.isNullAt(7)) None else Some(row.getLong(7) - row.getLong(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to write the source expression as Sum(If(Not(e), one, zero)).


def apply(e: Expression, relativeSD: Double): CreateStruct = {
var statistics = e.dataType match {
case n: NumericType =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not combine NumericType with TimestampType & DateType? e.g.. _: NumericType | TimestampType | DateType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause compilation error: illegal variable in pattern alternative

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzhfy Hm. Are you doubly sure that causes compilation error? I just checked out your PR and it seems compiling fine.

      case _: NumericType | TimestampType | DateType =>
        Seq(Max(e), Min(e), HyperLogLogPlusPlus(e, relativeSD), nullDouble, nullLong, nullLong,
          nullLong)

If you keep n variable then it will cause a compilation error but if we don't use this, it'd be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I made a mistake, thanks! @HyukjinKwon

Seq(nullBoolean, nullBoolean, two, nullDouble, nullLong, Sum(If(e, one, zero)),
Sum(If(e, zero, one)))
case otherType =>
throw new AnalysisException("ANALYZE command is not supported for data type: " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a bit clearer for the end user. Please add the name of the column & mention that you are analyzing columns.

BasicColStats(
dataType = e.dataType,
numNulls = row.getLong(0),
max = if (row.isNullAt(1)) None else Some(row.get(1, e.dataType)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nicer to add a helper method here.

@@ -32,67 +34,25 @@ import org.apache.spark.sql.execution.datasources.LogicalRelation
* Analyzes the given table in the current database to generate statistics, which will be
* used in query optimizations.
*/
case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand {
case class AnalyzeTableCommand(
tableIdent: TableIdentifier,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -563,6 +563,13 @@ object SQLConf {
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefault(10L)

val NDV_MAX_ERROR =
SQLConfigBuilder("spark.sql.ndv.maxError")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you place this under spark.sql.statistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

assert(colStats.min == expectedColStats.min)
if (expectedColStats.ndv.isDefined) {
// ndv is an approximate value, so we just make sure we have the value
assert(colStats.ndv.get >= 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also check if the value is within 3 standard deviations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to get the standard deviations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be:

  1. estimate column statistics for several times.
  2. calculate mean and standard deviation of ndv for the multiple statistics.
  3. check if the real ndv value is within (mean + 3 * sd, mean - 3 * sd).

Please correct me if it is wrong.

Copy link
Contributor Author

@wzhfy wzhfy Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 3 standard deviations?
I think for tests in this suite, we just need to make sure we get the stats, we should leave the accuracy test to HyperLogLogPlusPlusSuite, right?

Copy link
Member

@viirya viirya Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(colStats.ndv.get >= 0) might be too loose. Can we check if it is more than (or equal to) 0 and less than (or equal to) the number of values in the column?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the check here for estimated ndv value is not really related to the actual ndv value, we can't make sure if we really get the estimated ndv back. Especially we only use ndv > 0 to check it. In an extreme case, we can replace the ndv expression with a literal(0) in the aggregation, the test would not check it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course my previous suggestion 0 <= ndv <= count(col) is also failed for this standard. Using sd to check it is more reliable. I am not strong against this. But it should be better. See what @hvanhovell and @cloud-fan think?

Copy link
Contributor Author

@wzhfy wzhfy Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may know what @hvanhovell meant. The standard deviation is the parameter we pass to HyperLogLogPlusPlus, so the code may look like this:

        // ndv is an approximate value, so we make sure we have the value, and it should be
        // within 3*SD's of the given rsd.
        assert(colStats.ndv.get >= 0)
        if (expectedColStats.ndv.get == 0) {
          assert(colStats.ndv.get == 0)
        } else if (expectedColStats.ndv.get > 0) {
          val rsd = spark.sessionState.conf.ndvMaxError
          val error = math.abs((colStats.ndv.get / expectedColStats.ndv.get.toDouble) - 1.0d)
          assert(error <= rsd * 3.0d, "Error should be within 3 std. errors.")
        }

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think it is better to test this a bit more thoroughly. The HLL++ tests use a similar test.

Copy link
Contributor

@hvanhovell hvanhovell Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: sorry for the very cryptic initial comment.

@@ -401,7 +401,10 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
var statsProperties: Map[String, String] =
Map(STATISTICS_TOTAL_SIZE -> stats.sizeInBytes.toString())
if (stats.rowCount.isDefined) {
statsProperties += (STATISTICS_NUM_ROWS -> stats.rowCount.get.toString())
statsProperties += STATISTICS_NUM_ROWS -> stats.rowCount.get.toString()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: Is this how Hive stores this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we just give it a different name in Spark.

*/
override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = withOrigin(ctx) {
if (ctx.partitionSpec == null &&
ctx.identifier != null &&
ctx.identifier.getText.toLowerCase == "noscan") {
AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString)
AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier))
} else if (ctx.identifierSeq() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make the identifierSeq non-optional in the grammar? Saves a lot of typing:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea

StructField(name = "c3", dataType = IntegerType, nullable = true) ::
StructField(name = "c4", dataType = LongType, nullable = true) :: Nil)
val expectedBasicStats = BasicColStats(
dataType = ByteType, numNulls = 2, max = Some(3), min = Some(1), ndv = Some(3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird that this works. The Min and the Max value should not be equal for Short and Long types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about this?

@rxin
Copy link
Contributor

rxin commented Sep 17, 2016

What's "basic" about? Are we going to have something that's not basic in the future (e.g. histogram)? If yes, should those go into a separate class or just in ColumnStats?

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 17, 2016

@rxin Yeah, I think it's better to move histograms into ColumnStats than to maintain two members like BasicColStats and Histograms. Let me rename BasicColStats as ColumnStats so that all the column-level stats go into this structure.

}
}

case class BasicColStats(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section of comments explaining the meaning for each statistic? E.g., Readers may not know what ndv means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya thanks for the advice.

case o => o
}
assert(operators.size == 1)
if (operators.head.getClass != c) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good style, parsing a SQL string and constructing an expected logical plan and then compare them looks much clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then I'll fix it

@SparkQA
Copy link

SparkQA commented Sep 30, 2016

Test build #66131 has finished for PR 15090 at commit 3335af6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 30, 2016

Test build #66150 has finished for PR 15090 at commit 06819dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 30, 2016

Test build #66188 has finished for PR 15090 at commit 95c2d2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val relation = spark.sessionState.catalog.lookupRelation(tableIdent)
val columnStats =
AnalyzeColumnCommand(tableIdent, columns.map(_.name)).computeColStats(spark, relation)._2
expectedColStatsSeq.foreach { expected =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about ...foreach { case (field, expectedStat) => ? Then we use field.name instead of expected._1.name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's better.

*/
case class ColumnStat(statRow: InternalRow) {

def forNumeric[T <: AtomicType](dataType: T): NumericColumnStat[T] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have NumericType, can we use that?

Copy link
Contributor Author

@wzhfy wzhfy Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need to include DateType and TimestampType also, so I use AtomicType.

dataType = IntegerType,
colStat = colStat,
expectedColStat = ColumnStat(InternalRow.fromSeq(
Seq(0L, values.max, values.min, values.distinct.length.toLong))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should write InternalRow(0L, values.max, ...) directly

df.write.format("json").saveAsTable(tmpTable)

sql(s"CREATE TABLE $table (c1 int) USING PARQUET")
sql(s"INSERT INTO $table SELECT * FROM $tmpTable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just write INSERT INTO $table SELECT 1, thus don't need the temp table

val totalSize = BigInt(catalogTable.properties.get(STATISTICS_TOTAL_SIZE).get)
// TODO: we will compute "estimatedSize" when we have column stats:
// average size of row * number of rows
if (catalogTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX)).nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make catalogTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX)) a variable? it's used twice

@cloud-fan
Copy link
Contributor

LGTM except some minor comment, thanks for working on it!

if (!attributesToAnalyze.contains(expr)) {
attributesToAnalyze += expr
} else {
logWarning(s"Duplicated column: $col")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to explain the context. Otherwise, the log message becomes useless.

val tableIdent = TableIdentifier(table, Some("default"))
val relation = spark.sessionState.catalog.lookupRelation(tableIdent)
val columnStats =
AnalyzeColumnCommand(tableIdent, columns.map(_.name)).computeColStats(spark, relation)._2
Copy link
Member

@gatorsmile gatorsmile Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

val (_, columnStats) =
  AnalyzeColumnCommand(tableIdent, columnsToAnalyze).computeColStats(spark, relation)

val tableIdent = TableIdentifier(table, Some("default"))
val relation = spark.sessionState.catalog.lookupRelation(tableIdent)
val columnStats =
AnalyzeColumnCommand(tableIdent, columnsToAnalyze).computeColStats(spark, relation)._2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66226 has finished for PR 15090 at commit 734abad.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 1, 2016

retest this please

Seq(numNulls(e), avgLength(e), maxLength(e))
}

def booleanColumnStat(e: Expression): Seq[Expression] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the above 13 functions should be private, right?

}
}

def apply(e: Attribute, relativeSD: Double): CreateStruct = e.dataType match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: e -> attr

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66227 has finished for PR 15090 at commit 734abad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
if (duplicatedColumns.nonEmpty) {
logWarning(s"Duplicated columns ${duplicatedColumns.mkString("(", ", ", ")")} detected " +
s"when analyzing columns ${columnNames.mkString("(", ", ", ")")}, ignoring them.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

      logWarning("A duplicate column name was detected in `ANALYZE TABLE` statement. " +
        s"Input columns: ${columnNames.mkString("(", ", ", ")")}. " +
        s"Duplicate columns: ${duplicatedColumns.mkString("(", ", ", ")")}.")

@gatorsmile
Copy link
Member

gatorsmile commented Oct 2, 2016

Could you add a positive test case when we turn on the case sensitivity? The scenario is like:

    withTable(table) {
      withSQLConf("spark.sql.caseSensitive" -> "true") {
        sql(s"CREATE TABLE $table (c1 int, C1 double) STORED AS PARQUET")
        sql(s"INSERT INTO $table SELECT 1, 3.0")
        sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS c1, C1")
        ...
      }
    }

Note: the above is just the SQL statements. We need the verification logics.

This is to verify whether the case of column names is preserved in ANALYZE COLUMN.

@gatorsmile
Copy link
Member

gatorsmile commented Oct 2, 2016

Another test case for Unicode column names in ANALYZE COLUMN:

    // scalastyle:off
    // non ascii characters are not allowed in the source code, so we disable the scalastyle.
    val colName1 = "`列1`"
    val colName2 = "`列2`"
    // scalastyle:on
    withTable(table) {
      sql(s"CREATE TABLE $table ($colName1 int, $colName2 double) STORED AS PARQUET")
      sql(s"INSERT INTO $table SELECT 1, 3.0")
      sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS $colName2, $colName1")
      ...
    }```

@gatorsmile
Copy link
Member

gatorsmile commented Oct 2, 2016

LGTM except the above minor comments.

Test cases mentioned above need to be added to sql/hive/, since the correctness could be affected by the behaviors of Hive metastore . So far, your test cases have not cover Hive serde tables. Thus, the above two scenarios are creating Hive-serde tables.

If possible, also add a test case for refreshTable in sql/hive. That is also not verified.

@viirya
Copy link
Member

viirya commented Oct 3, 2016

LGTM except for one minor comment regarding ndv config document.

@rxin
Copy link
Contributor

rxin commented Oct 3, 2016

Alright looking at the number of outstanding comments, it looks like none of the issues are structural anymore. I'm going to merge this in order to speed up this work. @wzhfy please make sure you fix the remaining issues in follow-up pull requests.

Merging into master.

@asfgit asfgit closed this in 7bf9212 Oct 3, 2016
@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 4, 2016

@rxin Thanks, I'll fix them in the followup pr.

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 5, 2016

@gatorsmile Hive tables don't support case sensitive column names, so I use data source tables in the added test cases. PR is already sent: #15360.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants